Diving into Apache Iceberg My First-Hand Experience
Apache Iceberg is a modern table format for large analytic datasets.
Today, I’ve been experimenting with Apache Iceberg, and I’m genuinely excited about what I’m discovering. Ever wondered why tech giants like Netflix and ByteDance are going all-in on this technology?
Schema evolution that actually makes sense Hidden partitioning that feels like magic ACID transactions that simply deliver Branch.io’s success story validates my experiments - they saw 20x faster queries and 40% cost reduction after implementation.
🤓 Currently testing it with Spark integration, and I must say, the developer experience is surprisingly smooth.
Example
# Create table with evolving schema and partitioning
spark.sql("""
CREATE TABLE customer_events (
event_id BIGINT,
customer_id BIGINT,
event_timestamp TIMESTAMP,
event_type STRING
) USING iceberg
PARTITIONED BY (days(event_timestamp), bucket(16, customer_id))
""")
# Add new columns over time
spark.sql("ALTER TABLE customer_events ADD COLUMN device_type STRING")
# Concurrent operations with ACID guarantees
def process_events(date_partition):
spark.sql(f"""
MERGE INTO customer_events t
USING daily_events s
ON t.event_id = s.event_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
WHERE s.event_timestamp >= '{date_partition}'
""")
# Query with time travel
spark.sql("""
SELECT *
FROM customer_events VERSION AS OF 12345
WHERE event_timestamp > current_date() - INTERVAL 7 DAYS
""")
Quick start
https://github.com/databricks/docker-spark-iceberg
Case Study: Branch.io
Source: https://www.youtube.com/watch?v=WrDBI4Baw6o
Initial Data Platform Architecture
Scale Metrics:
- 40 billion daily events
- 100TB daily data ingestion
- 500+ Hive tables
- 10PB total data size
Key Components:
- Kafka for message passing
- Apache Flink for streaming
- Apache Spark for batch processing
- S3 storage with Parquet format
- Hive Metastore for metadata
- Trino for ad-hoc queries
Challenges with Parquet
Mutability Issues:
- Unsafe data changes requiring full partition rewrites
- Long processing times for data deletion requests
- Two-step transaction process leading to potential data visibility issues
Query Performance:
- Limited optimization options (only partition filtering)
- Large data scans for specific customer queries
- Complex workarounds needed for performance improvement
Iceberg Migration Architecture
Migration Strategy:
- Parallel table creation approach
- Double-write strategy during migration
- Quality check processes
- Backfill process for historical data
Tools Developed:
- Column Metrics Configuration
- Disaster Recovery Tool
- Table Maintainer Process
- Monitoring and Alerting System
Benefits Achieved
Performance:
- 20x improvement for customer ID filtered queries
- 20-50% reduction in other query times
- 40% cost reduction for specific data products
Storage:
- 18% reduction in storage costs
- Better utilization of S3 Intelligent-Tiering
Development:
- Simplified schema management
- Easier merge operations
- Improved data quality processes
This migration showcases a successful implementation of Apache Iceberg in a large-scale data platform, delivering significant improvements in performance, cost, and operational efficiency.
Related Ideas
Shared storage
- Allows multiple engines to access the same data directly
- Eliminates need for data duplication across systems
- Enables centralization of data storage
- Simplifies data integration during company mergers/acquisitions
Magnus Architecture (ByteDance’s Enhanced Iceberg):
Computing Layer:
- ETL: Spark, Flink, Krypton
- Training: Primus, PyTorch, TensorFlow, Ray
- Cross-language support via Arrow
Core Layer:
- Column-level updates
- Git-like branching
- High-speed vectorized merge-read engine
- Global indexing with HBase
Storage Layer:
- Multiple file formats support
- 30-50% storage savings
- HDFS/object storage backend
References
- Shared warehouse storage, the quiet revolution led by Apache Iceberg - YouTube https://www.youtube.com/watch?v=_kYKTQq14Tc
- Building an EB Scale Feature Store at ByteDance - YouTube https://www.youtube.com/watch?v=UPjr0qZ0-Do
- From Hive to Iceberg Crowdstrike’s Journey - YouTube https://www.youtube.com/watch?v=cFBSr-4pdBs
- Journey to the Iceberg Parquet to Iceberg Migration (Branch.io) - YouTube https://www.youtube.com/watch?v=WrDBI4Baw6o
- Unleashing Analytical Productivity at Pinterest with Apache Iceberg - YouTube https://www.youtube.com/watch?v=k8MUTt0qh6s
- Leveraging Iceberg in Netflix Studio and Creative Production - YouTube https://www.youtube.com/watch?v=xrMaG2G7KX0