Apache Iceberg is an open table format for large analytic tables typically stored on object storage. RisingWave offers comprehensive support for working with Iceberg tables, enabling you to leverage Iceberg’s capabilities within your streaming pipelines. RisingWave provides two distinct approaches for working with Iceberg, each designed for different use cases and architectural patterns.

The pipeline way: use Iceberg as sources/sinks

This approach is ideal when you have existing Iceberg tables created by systems like Spark, Flink, or batch jobs, and you want RisingWave to read from or write to them as part of a larger data ecosystem.
  • Use cases:
    • Existing data lakes with Iceberg tables managed by other systems.
    • Multi-system architectures where multiple applications need to read/write the same Iceberg tables.
    • Integration into existing data workflows and pipelines.
  • Key capabilities:
    • Read from Iceberg: Ingest data from existing Iceberg tables into RisingWave for stream processing.
    • Write to Iceberg: Stream processed results from RisingWave into existing Iceberg tables.

The database way: create and manage Iceberg tables natively

Choose this approach when you want RisingWave to be the primary owner of your Iceberg tables. RisingWave handles table creation, schema management, and the complete lifecycle while storing data in the standard Iceberg format.
  • Key benefits:
    • Simplified architecture: No external catalog setup required with the hosted catalog option.
    • Streaming-first: Direct path from streaming sources to Iceberg format.
    • Native management: Tables work like any other RisingWave table for queries and operations.
    • Ecosystem compatibility: Standard Iceberg tables readable by Spark, Trino, Flink, etc.
  • Key capabilities:
    • Iceberg table engine: Create tables using ENGINE = iceberg to store data natively in the Iceberg format.
    • Hosted Iceberg catalog: Use RisingWave’s built-in catalog service to eliminate external catalog setup.

Understanding RisingWave’s Iceberg integration

Storage architecture

It’s important to understand that RisingWave’s own internal storage system (Hummock) also uses object storage (like S3) to persist data, but it uses a row-based format optimized for RisingWave’s internal operations. When working with Iceberg, you are storing or accessing data in the columnar Iceberg format on object storage, which is designed for analytical workloads and ecosystem interoperability.

Advanced features

Both approaches support advanced Iceberg features:
  • Time travel: Query historical snapshots of your data.
  • Schema evolution: Handle changing table schemas over time.
  • Partitioning: Optimize query performance with table partitioning.
  • Multiple storage backends: S3, Google Cloud Storage, Azure Blob Storage.
  • Various catalog types: Hosted, JDBC, AWS Glue, REST, Storage, Hive, Snowflake.