Integrate Apache Iceberg with RisingWave

Apache Iceberg is an open table format that provides reliable, high-performance storage for data lakes. RisingWave integrates with Iceberg to combine real-time computation with open data lake storage. You can work with external Iceberg tables managed by other systems or create internal Iceberg tables managed directly by RisingWave.

External Iceberg tables

External Iceberg tables are managed outside of RisingWave, such as S3 Tables, Snowflake-managed Iceberg tables, Databricks-managed Iceberg tables, or self-managed Iceberg deployments. RisingWave connects to these tables through their catalogs and treats them as data sources or data sinks.

Reading from external Iceberg tables

RisingWave can continuously read data from append-only Iceberg tables. It monitors snapshots and automatically loads newly appended data, allowing you to consume the table as a live data stream. Example

CREATE SOURCE orders_src
WITH (
  connector = 'iceberg',
  warehouse.path = 's3://lakehouse/warehouse',
  database.name = 'sales',
  table.name = 'orders',
  catalog.type = 'glue',
  s3.region = 'us-west-2'
);

Ad hoc analytics on Iceberg data

After the source is created, you can query it directly with SQL. RisingWave retrieves the current snapshot of the Iceberg table at query time.

-- Inspect recent orders for a specific product
SELECT order_id, user_id, amount, order_ts
FROM orders_src
WHERE product_id = 12345
ORDER BY order_ts DESC
LIMIT 20;

-- Aggregate daily revenue over the past week
SELECT date_trunc('day', order_ts) AS day, SUM(amount) AS revenue
FROM orders_src
WHERE order_ts >= now() - interval '7 days'
GROUP BY 1
ORDER BY day;

These queries run instantly on the latest Iceberg snapshot, making RisingWave useful for interactive analytics without setting up a separate ETL process.

Continuous analytics with materialized views

For real-time, incremental analytics, create a materialized view on the Iceberg source. RisingWave automatically keeps the view up to date as new snapshots are committed to the Iceberg table.

CREATE MATERIALIZED VIEW mv_daily_sales AS
SELECT
  date_trunc('day', order_ts) AS day,
  SUM(amount) AS total_sales,
  COUNT(DISTINCT user_id) AS active_users
FROM orders_src
GROUP BY 1;

The materialized view keeps results up to date as RisingWave ingests new Iceberg snapshots. You can query the view directly:

SELECT * FROM mv_daily_sales ORDER BY day DESC LIMIT 10;

This approach provides low-latency analytics on Iceberg data while maintaining compatibility with the underlying catalog and storage system.

Writing to external Iceberg tables

RisingWave can also write query results or materialized view outputs to Iceberg tables. The resulting data remains fully compatible with other Iceberg engines such as Spark, Trino, and DuckDB.

Supports append-only, upsert, and force-append-only data modes
Guarantees exactly-once delivery
Can perform optional file compaction for efficiency

Example

CREATE SINK daily_sales_sink FROM mv_daily_sales
WITH (
  connector = 'iceberg',
  type = 'append-only',
  warehouse.path = 's3://lakehouse/warehouse',
  database.name = 'sales',
  table.name = 'daily_sales',
  catalog.type = 'rest',
  catalog.uri = 'http://lakekeeper:8181',
  enable_compaction = true
);

With this configuration, RisingWave acts as a real-time transformation layer between streaming systems and Iceberg storage.

Internal Iceberg tables

Internal Iceberg tables are created and managed directly by RisingWave. They behave like standard RisingWave tables but store their data in Iceberg format on object storage. This allows you to persist computed or aggregated results in an open format that can be accessed by other query engines.

Creating internal Iceberg tables

You can create internal Iceberg tables using standard SQL syntax.

CREATE TABLE fact_orders (
  order_id BIGINT,
  user_id BIGINT,
  amount DOUBLE,
  ts TIMESTAMP
) ENGINE = iceberg;

RisingWave automatically manages schema, metadata, and data persistence. Data is stored in Parquet format and can be queried by any Iceberg-compatible engine such as Spark, Trino, or DuckDB. You can query, join, and build materialized views on internal tables just like any other RisingWave table.

Catalog service

RisingWave provides two hosted catalog options for managing Iceberg metadata, schema versions, and table state:

JDBC hosted catalog — backed by RisingWave’s internal PostgreSQL-compatible metastore. See JDBC hosted catalog
REST hosted catalog — powered by Lakekeeper and compatible with the Iceberg REST catalog API. See REST hosted catalog

Both options allow external Iceberg engines to read and write RisingWave-managed tables using standard Iceberg protocols. If you prefer to use an existing metadata system, RisingWave also supports external catalogs such as AWS Glue, Hive Metastore, or Nessie. For details, see Catalog configuration.

Compaction service

RisingWave provides a built-in compaction service that automatically merges small Parquet files, expires outdated snapshots, and maintains efficient file layouts. This ensures good query performance and stable storage usage during continuous ingestion. Using RisingWave’s compaction service is optional. You can also connect an external compactor such as Tabular’s Iceberg compactor, Databricks-managed compaction, Amazon EMR, or a self-hosted Spark job. When using an external compactor, RisingWave writes data in a compaction-friendly format that allows other systems to safely perform maintenance. Compaction can be enabled or disabled per table or configured globally depending on performance and cost requirements.

Catalog and compaction summary

Component	Default Option	Alternative Options	Description
Catalog service	RisingWave built-in REST catalog	Glue, Hive, Nessie, or custom REST catalogs	Stores metadata and schema information
Compaction service	RisingWave built-in compactor	Amazon EMR, or self-hosted Spark	Optimizes file layout and merges small Parquet files

Typical architecture

[ Kafka / CDC / APIs ]
          |
          v
    RisingWave SQL Engine
   ├─ Reads from external Iceberg
   ├─ Performs real-time computation
   ├─ Builds materialized views
   └─ Writes results to internal or external Iceberg
          |
          v
[ Iceberg Tables in Object Storage ]

RisingWave connects streaming systems with Iceberg-based data lakes. Use external Iceberg tables to analyze or enrich existing datasets, and internal Iceberg tables to persist computed results in an open, queryable format.

Summary

Capability	External Iceberg Tables	Internal Iceberg Tables
Read support	Continuous and ad-hoc queries	Supported
Write support	Append, upsert, or force-append-only	Fully managed by RisingWave
Catalog ownership	External system	RisingWave or external catalog
Compaction	Optional via sink or external compactor	Optional via RisingWave or external compactor
Interoperability	Compatible with other Iceberg engines	Compatible with Iceberg standard
Typical use	Connect to existing Iceberg data	Persist computed or aggregated data

Choosing between external and internal tables

Use external Iceberg tables if you already have an Iceberg environment such as S3 Tables, Snowflake, or Databricks, and want RisingWave to process or update that data.
Use internal Iceberg tables if you want RisingWave to handle both computation and Iceberg data management with its built-in catalog and compaction services.
Combine both approaches to build a unified, real-time lakehouse architecture.

Next steps

Read from Iceberg tables
Write to Iceberg tables
Create and manage internal Iceberg tables
Configure catalogs and compaction services

Iceberg

​External Iceberg tables

​Reading from external Iceberg tables

​Ad hoc analytics on Iceberg data

​Continuous analytics with materialized views

​Writing to external Iceberg tables

​Internal Iceberg tables

​Creating internal Iceberg tables

​Catalog service

​Compaction service

​Catalog and compaction summary

​Typical architecture

​Summary

​Choosing between external and internal tables

​Next steps