Bring Your Own Iceberg

Bring Your Own Iceberg (BYOI) refers to scenarios where external systems manage the Iceberg tables and RisingWave connects to them as a client. In this approach, you have existing Iceberg tables (typically created by Spark, Flink, batch ETL jobs, or other systems) and you want RisingWave to read from or write to these tables.

When to use Bring Your Own Iceberg

Choose this approach when:

Existing data lake: You already have Iceberg tables managed by other systems (Spark, Flink, dbt, etc.).
Multi-system architecture: Multiple applications/engines need to read/write the same Iceberg tables.
Integration requirements: You need to integrate RisingWave into existing data workflows and pipelines.
External catalog infrastructure: You have existing catalog services (AWS Glue, JDBC databases, REST catalogs) managing your Iceberg metadata.
Data lake ingestion: You want to stream processed data from RisingWave into your existing data lake.
Batch+Stream hybrid: Combining batch processing (other systems) with stream processing (RisingWave) on the same tables.

Key capabilities

Read from Iceberg tables (Iceberg Source)

Ingest data from existing Iceberg tables into RisingWave for stream processing:

CREATE SOURCE my_iceberg_source
WITH (
    connector = 'iceberg',
    warehouse.path = 's3://my-data-lake/warehouse',
    database.name = 'my_db',
    table.name = 'user_events',
    catalog.type = 'glue',  -- Use AWS Glue as catalog
    s3.access.key = 'your-access-key',
    s3.secret.key = 'your-secret-key',
    s3.region = 'us-west-2'
);

Streaming ingestion: Continuously read new data as it’s added to the Iceberg table
Time travel: Read historical snapshots of the data
Schema evolution: Automatically handle schema changes in the source table

Write to Iceberg tables (Iceberg Sink)

Stream processed results from RisingWave into existing Iceberg tables:

CREATE SINK my_iceberg_sink FROM processed_data
WITH (
    connector = 'iceberg',
    warehouse.path = 's3://my-data-lake/warehouse',
    database.name = 'analytics',
    table.name = 'aggregated_metrics',
    catalog.type = 'rest',
    catalog.uri = 'http://my-catalog:8181',
    type = 'upsert',  -- Support both append-only and upsert modes
    s3.access.key = 'your-access-key',
    s3.secret.key = 'your-secret-key'
);

Multiple sink modes: Append-only or upsert depending on your use case
Exactly-once delivery: Ensure data consistency with configurable delivery semantics
External accessibility: Data written by RisingWave is immediately available to other systems

Supported catalog types

BYOI works with various external catalog systems:

AWS Glue: Managed metadata service on AWS
JDBC catalogs: PostgreSQL, MySQL, or other JDBC-compatible databases storing metadata
REST catalogs: RESTful catalog services including AWS S3 Tables
Storage catalogs: Direct filesystem-based metadata (S3/HDFS)
Hive Metastore: Traditional Hadoop ecosystem catalog
Snowflake: Snowflake-managed Iceberg catalogs

Architecture patterns

Lambda/Kappa architecture

Batch layer: Spark/Flink writes to Iceberg tables.
Speed layer: RisingWave reads from Iceberg (batch results) and streams real-time updates.
Serving layer: Analytics tools query the combined results.

Multi-engine data lake

Multiple writers: Different systems write to the same Iceberg tables.
Multiple readers: Various engines read from shared tables.
RisingWave role: Provides real-time streaming capabilities to the data lake.

ETL/ELT pipelines

Extract: RisingWave reads from various sources.
Transform: Stream processing in RisingWave.
Load: Write results to existing data lake via Iceberg sink.

What’s included in this section

Read from Iceberg: Complete guide to creating Iceberg sources for data ingestion.
Write to Iceberg: Complete guide to creating Iceberg sinks for data export.
Catalog Configuration: Setup and configuration for external catalog systems.
Object Storage Configuration: Configuration for S3, GCS, and Azure Blob storage.

Next steps

Identify your catalog: Determine what catalog system manages your existing Iceberg tables.
Start with reading: Create an Iceberg source to ingest existing data.
Add streaming outputs: Set up Iceberg sinks to export processed results.
Configure catalogs: Review catalog configuration for your specific setup.

Comparing approaches: If you want RisingWave to create and manage Iceberg tables directly, see RisingWave Managed Iceberg instead.

Iceberg

​When to use Bring Your Own Iceberg

​Key capabilities

​Read from Iceberg tables (Iceberg Source)

​Write to Iceberg tables (Iceberg Sink)

​Supported catalog types

​Architecture patterns

​Lambda/Kappa architecture

​Multi-engine data lake

​ETL/ELT pipelines

​What’s included in this section

​Next steps