๋ฐ์ดํ„ฐ ์—”์ง€๋‹ˆ์–ด๋ง

์‹ค์‹œ๊ฐ„ ๋ถ„์„ ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์ถ• - 01. Kafka, Iceberg ์„ค์น˜

Tempo 2025. 5. 15. 08:03
RedPanda, Iceberg๋ฅผ Docker Compose๋กœ ๊ตฌ์„ฑํ•ด์„œ ์‹ค์‹œ๊ฐ„ ํŒŒ์ดํ”„๋ผ์ธ ๊ธฐ์ดˆ ๊ตฌ์„ฑ

 

์ด๋ฒˆ ๊ธ€์—์„œ๋Š” Redpanda์™€ Iceberg, Minio๋ฅผ ๊ตฌ์„ฑํ•ด์„œ ์‹ค์‹œ๊ฐ„ ๋ฐ์ดํ„ฐ๋ ˆ์ดํฌ ํ™˜๊ฒฝ์„ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

 

๐Ÿ“Œ ๋ชฉํ‘œ

  • Redpanda ์„ค์น˜
  • Apache Iceberg ์„ค์น˜
  • Minio ์„ค์น˜
  • ์œ„ ์„ค์น˜๋ฅผ Docker compose๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ตฌํ˜„

โ„๏ธ Iceberg๋ž€ ๋ฌด์—‡์ธ๊ฐ€์š”?

Apache Iceberg๋Š” ๋Œ€๊ทœ๋ชจ ํ…Œ์ด๋ธ”์„ ์œ„ํ•œ ์˜คํ”ˆ์†Œ์Šค ๋ฐ์ดํ„ฐ ๋ ˆ์ดํฌ ํฌ๋งท์ž…๋‹ˆ๋‹ค. ๊ธฐ์กด Hive ๋ฉ”ํƒ€์Šคํ† ์–ด ๊ธฐ๋ฐ˜์˜ ๋А๋ฆฌ๊ณ  ๋น„ํšจ์œจ์ ์ธ ์ฟผ๋ฆฌ๋ฅผ ๊ทน๋ณตํ•˜๊ณ ์ž ์„ค๊ณ„๋˜์—ˆ์œผ๋ฉฐ, Spark, Trino, Flink ๋“ฑ ๋‹ค์–‘ํ•œ ๋ถ„์„ ๋„๊ตฌ์™€ ์‰ฝ๊ฒŒ ํ†ตํ•ฉ๋ฉ๋‹ˆ๋‹ค.

โœ… Iceberg์˜ ํŠน์ง•

  • ACID ํŠธ๋žœ์žญ์…˜ ์ง€์›
  • Schema Evolution (์Šคํ‚ค๋งˆ ๋ณ€๊ฒฝ) ๊ฐ€๋Šฅ
  • Partitioning ์ „๋žต์ด ๋›ฐ์–ด๋‚˜ ์ฟผ๋ฆฌ ์„ฑ๋Šฅ ํ–ฅ์ƒ
  • MinIO ๊ฐ™์€ S3 ํ˜ธํ™˜ ์Šคํ† ๋ฆฌ์ง€์™€๋„ ์‰ฝ๊ฒŒ ์—ฐ๋™

 

๐Ÿ“ฆ Docker Compose๋กœ ํ™˜๊ฒฝ ๊ตฌ์„ฑํ•˜๊ธฐ

์•„์ด์Šค๋ฒ„๊ทธ์—์„œ ๊ณต์‹์œผ๋กœ ์ œ๊ณตํ•˜๋Š” Docker compose ํŒŒ์ผ ์˜ˆ์‹œ๋Š” ๊ณต์‹ ๋ฌธ์„œ(https://iceberg.apache.org/spark-quickstart/#docker-compose)์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ 2025.05 ๊ธฐ์ค€ ํ•ด๋‹น iceberg ๋„์ปค ์ด๋ฏธ์ง€์—๋Š” kafka์™€ ๊ด€๋ จ๋œ spark jar package๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด์— ์ถ”๊ฐ€ ํŒจํ‚ค์ง€๋ฅผ ์„ค์น˜ํ•˜์—ฌ kafka topic์„ consumeํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“„ ๋„์ปค ์ด๋ฏธ์ง€ ๋ณ€๊ฒฝ

์šฐ์„  ๋„์ปค ์ปดํฌ์ฆˆ๋ฅผ ์ œ๊ณตํ•˜๋Š” git ์ €์žฅ์†Œ๋ฅผ clone ํ•ฉ๋‹ˆ๋‹ค.

git clone https://github.com/databricks/docker-spark-iceberg.git

 

ํ•ด๋‹น ์ €์žฅ์†Œ clone ํ›„ ์•„๋ž˜ ๊ฒฝ๋กœ์˜ ํŒŒ์ผ์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค.

  • docker-spark-iceberg > spark > Dockerfile ๋‚ด line 92์— ์ถ”๊ฐ€
  • Spark 3.5.5 ๊ธฐ์ค€
  • iceberg ์‹คํ–‰ ์ค‘ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค๋ฉด kafka client, spark-sql-kafka-0-10 ๋“ฑ ํŒจํ‚ค์ง€ ๋ฒ„์ „ ์ถฉ๋Œ๋กœ ํŒจํ‚ค์ง€ ๋ฒ„์ „์„ ๋งž์ถฐ์•ผ ํ•ฉ๋‹ˆ๋‹ค.
# Download Kafka
RUN curl -s https://repo1.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.12/${SPARK_VERSION}/spark-sql-kafka-0-10_2.12-${SPARK_VERSION}.jar -Lo /opt/spark/jars/spark-sql-kafka-0-10_2.12-${SPARK_VERSION}.jar
RUN curl -s https://repo1.maven.org/maven2/org/apache/kafka/kafka-clients/3.4.1/kafka-clients-3.4.1.jar -Lo /opt/spark/jars/kafka-clients-3.4.1.jar
RUN curl -s https://repo1.maven.org/maven2/org/apache/commons/commons-pool2/2.11.1/commons-pool2-2.11.1.jar -Lo /opt/spark/jars/commons-pool2-2.11.1.jar
RUN curl -s https://repo1.maven.org/maven2/org/apache/spark/spark-token-provider-kafka-0-10_2.12/${SPARK_VERSION}/spark-token-provider-kafka-0-10_2.12-${SPARK_VERSION}.jar -Lo /opt/spark/jars/spark-token-provider-kafka-0-10_2.12-${SPARK_VERSION}.jar

 

๐Ÿ” Spark, Kafka Maven ๋งž์ถค ๋ฒ„์ „ ์ฐพ๊ธฐ

1. Spark-sql-kafka ํŽ˜์ด์ง€์— ์ ‘์†ํ•ฉ๋‹ˆ๋‹ค.(https://mvnrepository.com/artifact/org.apache.spark/spark-sql-kafka-0-10)

2. ํ•ด๋‹นํ•˜๋Š” Spark ๋ฒ„์ „์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ์„ ํƒ ์‹œ Scala ๋ฒ„์ „์„ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค. ์ €๋Š” scala ๋ฒ„์ „์„ 2.12๋กœ ๋งž์ท„์Šต๋‹ˆ๋‹ค.

Spark core

3. Spark ๋ฒ„์ „ ์„ ํƒ ์‹œ ํ•ด๋‹น ๋ฒ„์ „๊ณผ ๊ด€๋ จ๋œ ๋˜๋Š” ํ˜ธํ™˜์„ฑ์ด ๋งž๋Š” ํŒจํ‚ค์ง€๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

4. ์•„๋ž˜ ๋ช…๋ น์–ด๋กœ customํ•œ ๋„์ปค ์ด๋ฏธ์ง€๋กœ ๋นŒ๋“œํ•ฉ๋‹ˆ๋‹ค.

docker buildx build -t rt/iceberg --platform=linux/amd64,linux/arm64 .

 

๐Ÿณ Docker compose ๊ตฌ์„ฑ

ํŽธ์˜๋ฅผ ์œ„ํ•ด Delta-lake๋„ ๋ฏธ๋ฆฌ ์ถ”๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.
version: "3.8"

services:
  spark-iceberg:
    image: rt/iceberg
    container_name: spark-iceberg
    build: spark/
    networks:
      iceberg_net:
    depends_on:
      - rest
      - minio
    volumes:
      - ./warehouse:/home/iceberg/warehouse
      - ./notebooks:/home/iceberg/notebooks/notebooks
    environment:
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
    ports:
      - 8888:8888
      - 8080:8080
      - 10000:10000
      - 10001:10001
  rest:
    image: apache/iceberg-rest-fixture
    container_name: iceberg-rest
    networks:
      iceberg_net:
    ports:
      - 8181:8181
    environment:
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
      - CATALOG_WAREHOUSE=s3://warehouse/
      - CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
      - CATALOG_S3_ENDPOINT=http://minio:9000
  minio:
    image: minio/minio
    container_name: minio
    environment:
      - MINIO_ROOT_USER=admin
      - MINIO_ROOT_PASSWORD=password
      - MINIO_DOMAIN=minio
    networks:
      iceberg_net:
        aliases:
          - warehouse.minio
    ports:
      - 9001:9001
      - 9000:9000
    command: ["server", "/data", "--console-address", ":9001"]
  mc:
    depends_on:
      - minio
    image: minio/mc
    container_name: mc
    networks:
      iceberg_net:
    environment:
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
    entrypoint: |
      /bin/sh -c "
      until (/usr/bin/mc config host add minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done;
      /usr/bin/mc rm -r --force minio/warehouse;
      /usr/bin/mc mb minio/warehouse;
      /usr/bin/mc policy set public minio/warehouse;
      tail -f /dev/null
      "
  delta-lake:
    # ์šด์˜์ฒด์ œ์— ๋”ฐ๋ผ ์ด๋ฏธ์ง€๋ฅผ ๋ณ€๊ฒฝํ•˜์—ฌ ๋ฐฐํฌํ•ด์•ผ ํ•จ
    image: deltaio/delta-docker:latest_arm64
    container_name: delta_quickstart
    volumes:
      - rustbuild:/tmp
    ports:
      - "8088:8888"
    # entrypoint: ["bash", "deltaio/delta-docker:latest_arm64"]
    networks:
      - iceberg_net
    depends_on:
      - minio
    environment:
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
      - HADOOP_CONF_DIR=/tmp/hadoop-conf
      - HADOOP_OPTIONAL_TOOLS=hadoop-aws
      - SPARK_CONF__spark.hadoop.fs.s3a.endpoint=http://minio:9000
      - SPARK_CONF__spark.hadoop.fs.s3a.access.key=admin
      - SPARK_CONF__spark.hadoop.fs.s3a.secret.key=password
      - SPARK_CONF__spark.hadoop.fs.s3a.path.style.access=true
      - SPARK_CONF__spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
  redpanda-0:
    command:
      - redpanda
      - start
      - --kafka-addr internal://0.0.0.0:9092,external://0.0.0.0:19092
      # Address the broker advertises to clients that connect to the Kafka API.
      # Use the internal addresses to connect to the Redpanda brokers'
      # from inside the same Docker network.
      # Use the external addresses to connect to the Redpanda brokers'
      # from outside the Docker network.
      - --advertise-kafka-addr internal://redpanda-0:9092,external://localhost:19092
      - --pandaproxy-addr internal://0.0.0.0:8082,external://0.0.0.0:18082
      # Address the broker advertises to clients that connect to the HTTP Proxy.
      - --advertise-pandaproxy-addr internal://redpanda-0:8082,external://localhost:18082
      - --schema-registry-addr internal://0.0.0.0:8081,external://0.0.0.0:18081
      # Redpanda brokers use the RPC API to communicate with each other internally.
      - --rpc-addr redpanda-0:33145
      - --advertise-rpc-addr redpanda-0:33145
      # Mode dev-container uses well-known configuration properties for development in containers.
      - --mode dev-container
      # Tells Seastar (the framework Redpanda uses under the hood) to use 1 core on the system.
      - --smp 1
      - --default-log-level=info
    image: docker.redpanda.com/redpandadata/redpanda:v25.1.4
    container_name: redpanda-0
    volumes:
      - redpanda-0:/var/lib/redpanda/data
    networks:
      - iceberg_net
    ports:
      - 18081:18081
      - 18082:18082
      - 19092:19092
      - 19644:9644
  console:
    container_name: redpanda-console
    image: docker.redpanda.com/redpandadata/console:v3.1.0
    networks:
      - iceberg_net
    entrypoint: /bin/sh
    command: -c 'echo "$$CONSOLE_CONFIG_FILE" > /tmp/config.yml; /app/console'
    environment:
      CONFIG_FILEPATH: /tmp/config.yml
      CONSOLE_CONFIG_FILE: |
        kafka:
          brokers: ["redpanda-0:9092"]
        schemaRegistry:
          enabled: true
          urls: ["http://redpanda-0:8081"]
        redpanda:
          adminApi:
            enabled: true
            urls: ["http://redpanda-0:9644"]
    ports:
      - 18080:8080
    depends_on:
      - redpanda-0
volumes:
  rustbuild:
  redpanda-0: null
networks:
  iceberg_net:

 

ํ•ด๋‹น ๋„์ปค ์ปดํฌ์ฆˆ๋ฅผ ์‹คํ–‰ ์‹œ ๊ฐ URL์—์„œ ์„œ๋น„์Šค๋ฅผ ํ™•์ธ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋„์ปค ์ด๋ฏธ์ง€ ์‹คํ–‰ ํ™•์ธ

  • Iceberg Jupyter notebook: http://localhost:8888/
  • Redpanda Console(Kafka): http://localhost:18080/
  • Minio: http://localhost:9001/ (ID: admin, PW: password)

โœ… ๋‹ค์Œ ๊ธ€: Kafka - Iceberg ์—ฐ๊ฒฐ๋กœ ์‹ค์‹œ๊ฐ„ ๋ฐ์ดํ„ฐ ์ ์žฌ

๋‹ค์Œ ๊ธ€์—์„  Iceberg์— ํ…Œ์ด๋ธ”์„ ์ƒ์„ฑํ•˜๊ณ  Kafka ํ† ํ”ฝ์„ ์ƒ์„ฑ, ๋ฉ”์‹œ์ง€๋ฅผ ์ƒ์„ฑํ•ด์„œ ์‹ค์‹œ๊ฐ„์œผ๋กœ ์ ์žฌํ•˜๋Š” ์‹ค์Šต์„ ์ง„ํ–‰ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

๋ฐ˜์‘ํ˜•