๋ฐ์ดํ„ฐ ์—”์ง€๋‹ˆ์–ด๋ง

์‹ค์‹œ๊ฐ„ ๋ถ„์„ ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์ถ• - 04. Delta Lake ์†Œ๊ฐœ ๋ฐ ํ™˜๊ฒฝ ์„ค์ •

Tempo 2025. 6. 12. 11:09
์ด๋ฒˆ ๊ธ€์—์„œ๋Š” ์‹ค์‹œ๊ฐ„ ๋ถ„์„ ํŒŒ์ดํ”„๋ผ์ธ์˜ ํ•ต์‹ฌ ๊ธฐ์ˆ  ์ค‘ ํ•˜๋‚˜์ธ Delta Lake์— ๋Œ€ํ•ด ์†Œ๊ฐœํ•˜๊ณ , Docker Compose ๊ธฐ๋ฐ˜์˜ ํ™˜๊ฒฝ ์„ค์ • ๋ฐ PySpark๋ฅผ ํ™œ์šฉํ•œ ์ฃผ์š” ๊ธฐ๋Šฅ์„ ์‹ค์Šตํ•œ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, Delta Lake์˜ ์ฃผ์š” ํŠน์ง•๊ณผ Iceberg์™€์˜ ์ฐจ์ด์ ์„ ๋น„๊ตํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

์ด์ „ docker compose ๊ธฐ๋ฐ˜ delta-lake ์„ค์น˜ - https://jongwho.tistory.com/37

 

์‹ค์‹œ๊ฐ„ ๋ถ„์„ ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์ถ• - 01. Kafka, Iceberg ์„ค์น˜

RedPanda, Iceberg๋ฅผ Docker Compose๋กœ ๊ตฌ์„ฑํ•ด์„œ ์‹ค์‹œ๊ฐ„ ํŒŒ์ดํ”„๋ผ์ธ ๊ธฐ์ดˆ ๊ตฌ์„ฑ ์ด๋ฒˆ ๊ธ€์—์„œ๋Š” Redpanda์™€ Iceberg, Minio๋ฅผ ๊ตฌ์„ฑํ•ด์„œ ์‹ค์‹œ๊ฐ„ ๋ฐ์ดํ„ฐ๋ ˆ์ดํฌ ํ™˜๊ฒฝ์„ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๐Ÿ“Œ ๋ชฉํ‘œRedpanda ์„ค์น˜Apache Iceberg ์„ค

jongwho.tistory.com

 


โœ… Delta Lake๋ž€?

Delta Lake๋Š” Apache Spark ๊ธฐ๋ฐ˜์˜ ACID ํŠธ๋žœ์žญ์…˜์„ ์ง€์›ํ•˜๋Š” ์ €์žฅ์†Œ ๊ณ„์ธต(Storage Layer) ์œผ๋กœ, ๋ฐ์ดํ„ฐ ๋ ˆ์ดํฌ(Data Lake)์˜ ์‹ ๋ขฐ์„ฑ๊ณผ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผœ ์ค๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ HDFS๋‚˜ S3์™€ ๊ฐ™์€ ๊ฐ์ฒด ์ €์žฅ์†Œ ์œ„์— ๊ตฌ์ถ•๋˜์–ด, Append-Only ํŠน์„ฑ์„ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ ์ €์žฅ ๋ฐฉ์‹์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋ฌธ์ œ์ (์˜ˆ: ๋ฐ์ดํ„ฐ ์ •ํ•ฉ์„ฑ, ๋™์‹œ์„ฑ ์ด์Šˆ ๋“ฑ)์„ ๋ณด์™„ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”ง ์„ค์น˜ ๋ฐ ์‹คํ–‰

๊ธฐ์กด ๊ธ€์—์„œ ์„ค์น˜ํ•œ docker compose์— delta-lake๊ฐ€ ์„ค์น˜๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. Iceberg๋„ ๊ฐ™์ด ์„ค์น˜๋˜์–ด ์žˆ์–ด Delta-Lake ๋…ธํŠธ๋ถ ์‹คํ–‰ ์‹œ ํฌํŠธ์— ํ˜ผ๋™์ด ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

docker-compose up -d

๊ตฌ์„ฑ ์š”์†Œ:

  • Spark (with Delta Lake)
  • Jupyter Notebook
    Delta Lake์˜ Quick Start ์˜ˆ์ œ ๋…ธํŠธ๋ถ์„ ํ†ตํ•ด Delta ํ…Œ์ด๋ธ” ์ƒ์„ฑ ๋ฐ ์ฟผ๋ฆฌ๋ฅผ ์‹ค์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ ‘๊ทผ ๋ฐฉ๋ฒ•: delta_quickstart ๋‚ด jupyter notebook ์ฃผ์†Œ์—์„œ ํฌํŠธ๋ฅผ 8088๋กœ ๋ณ€๊ฒฝ
    • ์˜ˆ: http://127.0.0.1:8088/lab?token=b9a9d31ea453fd8e224eeba50c42053bff12391ea1531aba

๐Ÿงช Delta Lake ๊ธฐ๋ณธ ๋‚ด์šฉ ์†Œ๊ฐœ

Delta Lake์˜ ๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ•์„ Jupyter Notebook์—์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์‹ค์Šตํ•˜์˜€์Šต๋‹ˆ๋‹ค.

1. Delta ํ…Œ์ด๋ธ” ์ƒ์„ฑ

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("DeltaLakeDemo") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

data = spark.range(0, 5)
data.write.format("delta").save("/tmp/delta-table")

2. Delta ํ…Œ์ด๋ธ” ์กฐํšŒ

df = spark.read.format("delta").load("/tmp/delta-table")
df.show()

3. Append ๋ฐ Update

# Append
new_data = spark.range(5, 10)
new_data.write.format("delta").mode("append").save("/tmp/delta-table")

# Update
from delta.tables import DeltaTable

deltaTable = DeltaTable.forPath(spark, "/tmp/delta-table")
deltaTable.update(
    condition="id % 2 == 0",
    set={"id": "id + 100"}
)

4. Time Travel

# ์ „์ฒด ํžˆ์Šคํ† ๋ฆฌ ํ…Œ์ด๋ธ” ํ™•์ธ
delta_table_history = (DeltaTable
                        .forPath(spark, "/tmp/delta-table")
                        .history()
                      )

(delta_table_history
   .select("version", "timestamp", "operation", "operationParameters", "operationMetrics", "engineInfo")
   .show()
)

# ์ด์ „ ๋ฒ„์ „์œผ๋กœ ํ…Œ์ด๋ธ” ๋ณ€๊ฒฝ(๋ฒ„์ „:0) ๋ฐ ์กฐํšŒ
df = (spark
        .read
        .format("delta")
        .option("versionAsOf", 0) # we pass an option `versionAsOf` with the required version number we are interested in
        .load("/tmp/delta-table")
        .orderBy("id")
      )

df.show()

๐Ÿ’ก Delta Lake ์ฃผ์š” ๊ธฐ๋Šฅ ์š”์•ฝ

ACID ํŠธ๋žœ์žญ์…˜ ๋ฉ€ํ‹ฐ ์“ฐ๋ ˆ๋“œ ํ™˜๊ฒฝ์—์„œ๋„ ๋ฐ์ดํ„ฐ ์ •ํ•ฉ์„ฑ ์œ ์ง€
Schema Enforcement & Evolution ์Šคํ‚ค๋งˆ ๊ณ ์ • ๋ฐ ๋ณ€๊ฒฝ ์ง€์›
Time Travel ํŠน์ • ์‹œ์ ์˜ ๋ฐ์ดํ„ฐ ์Šค๋ƒ…์ƒท ์กฐํšŒ ๊ฐ€๋Šฅ
Merge (UPSERT) ๊ธฐ์กด ํ…Œ์ด๋ธ”์— ์กฐ๊ฑด ๊ธฐ๋ฐ˜ ์‚ฝ์ž… ๋ฐ ๊ฐฑ์‹  ์ง€์›
๋ฐ์ดํ„ฐ ์••์ถ• ๋ฐ ํด๋Ÿฌ์Šคํ„ฐ๋ง (Z-Ordering) ์ฟผ๋ฆฌ ์„ฑ๋Šฅ ์ตœ์ ํ™”

 

๐Ÿ”Delta Lake์™€ Iceberg์˜ ์ฃผ์š” ์ฐจ์ด

  Delta Lake Apache Icecberg
์†Œ์† Databricks ์ฃผ๋„, ์˜คํ”ˆ์†Œ์Šค (Linux Foundation) Apache Software Foundation, ๋ฒค๋” ์ค‘๋ฆฝ
ํŒŒํ‹ฐ์…”๋‹ ์‚ฌ์šฉ์ž๊ฐ€ ๋ช…์‹œ์ ์œผ๋กœ ํŒŒํ‹ฐ์…˜ ์ •์˜ ํ•„์š” ์ˆจ๊ฒจ์ง„ ํŒŒํ‹ฐ์…”๋‹์œผ๋กœ ์ž๋™ ์ตœ์ ํ™”, ์‚ฌ์šฉ์ž ๋ถ€๋‹ด ๊ฐ์†Œ
๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๊ด€๋ฆฌ JSON ๊ธฐ๋ฐ˜ ํŠธ๋žœ์žญ์…˜ ๋กœ๊ทธ, ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์—์„œ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๋ถ€ํ•˜ ๊ฐ€๋Šฅ Parquet ๊ธฐ๋ฐ˜ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ, ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์™€ ํŒŒํ‹ฐ์…˜์— ๋” ํšจ์œจ์ 
์Šคํ‚ค๋งˆ ์ง„ํ™” ์Šคํ‚ค๋งˆ ๋ณ€๊ฒฝ ์ง€์›, MergeSchema ์˜ต์…˜์œผ๋กœ ์œ ์—ฐ์„ฑ ์ œ๊ณต ๊ณ ๊ธ‰ ์Šคํ‚ค๋งˆ ์ง„ํ™”(์ปฌ๋Ÿผ ์ˆœ์„œ ๋ณ€๊ฒฝ, ํƒ€์ž… ๋ณ€๊ฒฝ ๋“ฑ) ์ง€์›
์„ฑ๋Šฅ ์ตœ์ ํ™” Z-Order ์ธ๋ฑ์‹ฑ, ๋ฐ์ดํ„ฐ ์Šคํ‚คํ•‘, ์ปดํŒฉ์…˜ ์Šค๋ƒ…์ƒท ๊ธฐ๋ฐ˜ ์ฟผ๋ฆฌ ์ตœ์ ํ™”, ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ์ธ๋ฑ์‹ฑ
์—”์ง„ ํ†ตํ•ฉ Spark์— ์ตœ์ ํ™”, ๋‹ค๋ฅธ ์—”์ง„๋„ ์ง€์› Spark, Flink, Trino ๋“ฑ ๋‹ค์–‘ํ•œ ์—”์ง„์— ๊ท ๋“ฑํ•œ ์ง€์›
UPSERT ์„ฑ๋Šฅ Merge ๋ช…๋ น์œผ๋กœ ํšจ์œจ์ , Databricks ํ™˜๊ฒฝ์—์„œ ์ตœ์ ํ™” Merge ์„ฑ๋Šฅ์€ ๊ฐœ์„  ์ค‘, ์—”์ง„์— ๋”ฐ๋ผ ์„ฑ๋Šฅ ์ฐจ์ด ์กด์žฌ
์ปค๋ฎค๋‹ˆํ‹ฐ/์ƒํƒœ๊ณ„ Databricks ์ค‘์‹ฌ, ํด๋ผ์šฐ๋“œ ํ™˜๊ฒฝ์— ๊ฐ•์  Apache ์ปค๋ฎค๋‹ˆํ‹ฐ ์ค‘์‹ฌ, ๋ฒค๋” ์ค‘๋ฆฝ์ ์ด๊ณ  ์œ ์—ฐํ•œ ํ†ตํ•ฉ

๐Ÿ‘‰ ์ •๋ฆฌ:

  • Delta Lake๋Š” Spark ์ค‘์‹ฌ์˜ ํ™˜๊ฒฝ์— ์ตœ์ ํ™”๋˜์–ด ์žˆ์œผ๋ฉฐ, ์‹ค์‹œ๊ฐ„ ์ŠคํŠธ๋ฆฌ๋ฐ ๋ฐ ๋ฐฐ์น˜ ํ†ตํ•ฉ์— ๊ฐ•์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • Iceberg๋Š” ๋‹ค์–‘ํ•œ ์—”์ง„๊ณผ์˜ ํ†ตํ•ฉ ๋ฐ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๊ด€๋ฆฌ ์ธก๋ฉด์—์„œ ๋” ๋ฐœ์ „๋œ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

โœ… ์–ธ์ œ ๋ฌด์—‡์„ ์„ ํƒํ• ๊นŒ?

  • Delta Lake:
    • Databricks ํ™˜๊ฒฝ์—์„œ ๋ฐ์ดํ„ฐ ๋ ˆ์ดํฌ๋ฅผ ์šด์˜ํ•˜๊ฑฐ๋‚˜ Spark ์ค‘์‹ฌ ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ ์ ํ•ฉ.
    • ๊ฐ„๋‹จํ•œ ์„ค์ •๊ณผ ๋น ๋ฅธ UPSERT ์ž‘์—…์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ.
    • ํด๋ผ์šฐ๋“œ ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ ๋ ˆ์ดํฌ์—์„œ ํ†ตํ•ฉ ๊ด€๋ฆฌ์™€ ์ตœ์ ํ™”๋œ ์„ฑ๋Šฅ์„ ์›ํ•  ๋•Œ.
  • Apache Iceberg:
    • ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹(์ˆ˜์‹ญ์–ต ํ–‰ ์ด์ƒ)๊ณผ ๋ณต์žกํ•œ ํŒŒํ‹ฐ์…”๋‹ ์š”๊ตฌ์‚ฌํ•ญ์ด ์žˆ๋Š” ๊ฒฝ์šฐ.
    • ๋‹ค์–‘ํ•œ ์ฟผ๋ฆฌ ์—”์ง„(Flink, Trino ๋“ฑ)์„ ํ˜ผํ•ฉํ•ด ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ ๋ฒค๋” ์ข…์†์„ฑ์„ ํ”ผํ•˜๊ณ  ์‹ถ์„ ๋•Œ.
    • ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๊ด€๋ฆฌ์™€ ์Šคํ‚ค๋งˆ ์ง„ํ™”์˜ ์œ ์—ฐ์„ฑ์ด ์ค‘์š”ํ•œ ๊ฒฝ์šฐ.

๐Ÿ“Œ ๋‹ค์Œ ๊ธ€: Structed Streaming & Delta Lake

์ด๋ฒˆ ๊ธ€์—์„œ๋Š” Delta Lake์˜ ๊ธฐ๋ณธ ๊ฐœ๋…๊ณผ ํ™˜๊ฒฝ ์„ค์ •, ๊ทธ๋ฆฌ๊ณ  PySpark ๊ธฐ๋ฐ˜์˜ ์ฃผ์š” ๊ธฐ๋Šฅ์„ ์‹ค์Šตํ•ด ๋ณด์•˜์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ ๊ธ€์—์„œ๋Š” Structured Streaming๊ณผ Delta Lake๋ฅผ ๊ฒฐํ•ฉํ•œ ์‹ค์‹œ๊ฐ„ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์„ฑ์„ ๋‹ค๋ฃจ๊ฒ ์Šต๋‹ˆ๋‹ค.

๋ฐ˜์‘ํ˜•