Diving into Apache Iceberg My First-Hand Experience

Dec 24, 2024 · bhaskar

Apache Iceberg is a modern table format for large analytic datasets.

Today, I’ve been experimenting with Apache Iceberg, and I’m genuinely excited about what I’m discovering. Ever wondered why tech giants like Netflix and ByteDance are going all-in on this technology?

Schema evolution that actually makes sense Hidden partitioning that feels like magic ACID transactions that simply deliver Branch.io’s success story validates my experiments - they saw 20x faster queries and 40% cost reduction after implementation.

🤓 Currently testing it with Spark integration, and I must say, the developer experience is surprisingly smooth.

Example

# Create table with evolving schema and partitioning
spark.sql("""
CREATE TABLE customer_events (
    event_id BIGINT,
    customer_id BIGINT,
    event_timestamp TIMESTAMP,
    event_type STRING
) USING iceberg
PARTITIONED BY (days(event_timestamp), bucket(16, customer_id))
""")

# Add new columns over time
spark.sql("ALTER TABLE customer_events ADD COLUMN device_type STRING")

# Concurrent operations with ACID guarantees
def process_events(date_partition):
    spark.sql(f"""
    MERGE INTO customer_events t
    USING daily_events s
    ON t.event_id = s.event_id
    WHEN MATCHED THEN UPDATE SET *
    WHEN NOT MATCHED THEN INSERT *
    WHERE s.event_timestamp >= '{date_partition}'
    """)

# Query with time travel
spark.sql("""
SELECT *
FROM customer_events VERSION AS OF 12345
WHERE event_timestamp > current_date() - INTERVAL 7 DAYS
""")

Quick start

https://github.com/databricks/docker-spark-iceberg

Case Study: Branch.io

Source: https://www.youtube.com/watch?v=WrDBI4Baw6o

Initial Data Platform Architecture

Scale Metrics:

40 billion daily events
100TB daily data ingestion
500+ Hive tables
10PB total data size

Key Components:

Kafka for message passing
Apache Flink for streaming
Apache Spark for batch processing
S3 storage with Parquet format
Hive Metastore for metadata
Trino for ad-hoc queries

Challenges with Parquet

Mutability Issues:
- Unsafe data changes requiring full partition rewrites
- Long processing times for data deletion requests
- Two-step transaction process leading to potential data visibility issues
Query Performance:
- Limited optimization options (only partition filtering)
- Large data scans for specific customer queries
- Complex workarounds needed for performance improvement

Iceberg Migration Architecture

Migration Strategy:

Parallel table creation approach
Double-write strategy during migration
Quality check processes
Backfill process for historical data

Tools Developed:

Column Metrics Configuration
Disaster Recovery Tool
Table Maintainer Process
Monitoring and Alerting System

Benefits Achieved

Performance:
- 20x improvement for customer ID filtered queries
- 20-50% reduction in other query times
- 40% cost reduction for specific data products
Storage:
- 18% reduction in storage costs
- Better utilization of S3 Intelligent-Tiering
Development:
- Simplified schema management
- Easier merge operations
- Improved data quality processes

This migration showcases a successful implementation of Apache Iceberg in a large-scale data platform, delivering significant improvements in performance, cost, and operational efficiency.

Shared storage

Allows multiple engines to access the same data directly
Eliminates need for data duplication across systems
Enables centralization of data storage
Simplifies data integration during company mergers/acquisitions

Magnus Architecture (ByteDance’s Enhanced Iceberg):

Computing Layer:

ETL: Spark, Flink, Krypton
Training: Primus, PyTorch, TensorFlow, Ray
Cross-language support via Arrow

Core Layer:

Column-level updates
Git-like branching
High-speed vectorized merge-read engine
Global indexing with HBase

Storage Layer:

Multiple file formats support
30-50% storage savings
HDFS/object storage backend

References

Shared warehouse storage, the quiet revolution led by Apache Iceberg - YouTube https://www.youtube.com/watch?v=_kYKTQq14Tc
Building an EB Scale Feature Store at ByteDance - YouTube https://www.youtube.com/watch?v=UPjr0qZ0-Do
From Hive to Iceberg Crowdstrike’s Journey - YouTube https://www.youtube.com/watch?v=cFBSr-4pdBs
Journey to the Iceberg Parquet to Iceberg Migration (Branch.io) - YouTube https://www.youtube.com/watch?v=WrDBI4Baw6o
Unleashing Analytical Productivity at Pinterest with Apache Iceberg - YouTube https://www.youtube.com/watch?v=k8MUTt0qh6s
Leveraging Iceberg in Netflix Studio and Creative Production - YouTube https://www.youtube.com/watch?v=xrMaG2G7KX0