Bhaskar

Diving into Apache Iceberg My First-Hand Experience

· bhaskar

Apache Iceberg is a modern table format for large analytic datasets.

Today, I’ve been experimenting with Apache Iceberg, and I’m genuinely excited about what I’m discovering. Ever wondered why tech giants like Netflix and ByteDance are going all-in on this technology?

Schema evolution that actually makes sense Hidden partitioning that feels like magic ACID transactions that simply deliver Branch.io’s success story validates my experiments - they saw 20x faster queries and 40% cost reduction after implementation.

🤓 Currently testing it with Spark integration, and I must say, the developer experience is surprisingly smooth.

Example

# Create table with evolving schema and partitioning
spark.sql("""
CREATE TABLE customer_events (
    event_id BIGINT,
    customer_id BIGINT,
    event_timestamp TIMESTAMP,
    event_type STRING
) USING iceberg
PARTITIONED BY (days(event_timestamp), bucket(16, customer_id))
""")

# Add new columns over time
spark.sql("ALTER TABLE customer_events ADD COLUMN device_type STRING")

# Concurrent operations with ACID guarantees
def process_events(date_partition):
    spark.sql(f"""
    MERGE INTO customer_events t
    USING daily_events s
    ON t.event_id = s.event_id
    WHEN MATCHED THEN UPDATE SET *
    WHEN NOT MATCHED THEN INSERT *
    WHERE s.event_timestamp >= '{date_partition}'
    """)

# Query with time travel
spark.sql("""
SELECT *
FROM customer_events VERSION AS OF 12345
WHERE event_timestamp > current_date() - INTERVAL 7 DAYS
""")

Quick start

https://github.com/databricks/docker-spark-iceberg

Case Study: Branch.io

Source: https://www.youtube.com/watch?v=WrDBI4Baw6o

Initial Data Platform Architecture

MicroserviceKafkaATPvrrrieonsotS/o3PHMaiervtqeausettore

Scale Metrics:

  • 40 billion daily events
  • 100TB daily data ingestion
  • 500+ Hive tables
  • 10PB total data size

Key Components:

  • Kafka for message passing
  • Apache Flink for streaming
  • Apache Spark for batch processing
  • S3 storage with Parquet format
  • Hive Metastore for metadata
  • Trino for ad-hoc queries

Challenges with Parquet

  1. Mutability Issues:

    • Unsafe data changes requiring full partition rewrites
    • Long processing times for data deletion requests
    • Two-step transaction process leading to potential data visibility issues
  2. Query Performance:

    • Limited optimization options (only partition filtering)
    • Large data scans for specific customer queries
    • Complex workarounds needed for performance improvement

Iceberg Migration Architecture

MicroserviceKAWanafarkleayhtoiucsseIcIMecebetebaredgragtTaables

Migration Strategy:

  • Parallel table creation approach
  • Double-write strategy during migration
  • Quality check processes
  • Backfill process for historical data

Tools Developed:

  1. Column Metrics Configuration
  2. Disaster Recovery Tool
  3. Table Maintainer Process
  4. Monitoring and Alerting System

Benefits Achieved

  1. Performance:

    • 20x improvement for customer ID filtered queries
    • 20-50% reduction in other query times
    • 40% cost reduction for specific data products
  2. Storage:

    • 18% reduction in storage costs
    • Better utilization of S3 Intelligent-Tiering
  3. Development:

    • Simplified schema management
    • Easier merge operations
    • Improved data quality processes

This migration showcases a successful implementation of Apache Iceberg in a large-scale data platform, delivering significant improvements in performance, cost, and operational efficiency.

Shared storage

  • Allows multiple engines to access the same data directly
  • Eliminates need for data duplication across systems
  • Enables centralization of data storage
  • Simplifies data integration during company mergers/acquisitions

Magnus Architecture (ByteDance’s Enhanced Iceberg):

Computing Layer:

  • ETL: Spark, Flink, Krypton
  • Training: Primus, PyTorch, TensorFlow, Ray
  • Cross-language support via Arrow

Core Layer:

  • Column-level updates
  • Git-like branching
  • High-speed vectorized merge-read engine
  • Global indexing with HBase

Storage Layer:

  • Multiple file formats support
  • 30-50% storage savings
  • HDFS/object storage backend

References