Read data from Iceberg tables
Learn how to ingest data from existing Iceberg tables into RisingWave using the Iceberg source connector.
This guide explains how to connect RisingWave to existing Iceberg tables for batch data ingestion as part of the Bring Your Own Iceberg approach. Use this when you have Iceberg tables created and managed by external systems (like Spark, Flink, or batch ETL jobs) and want to stream that data into RisingWave for real-time processing.
When to use Iceberg sources
Choose Iceberg sources when:
- Existing data lake: You have Iceberg tables populated by other systems that you want to ingest into RisingWave.
- Lambda/Kappa architecture: You want to combine batch-processed data (in Iceberg) with real-time streams.
- Multi-engine integration: Different systems write to Iceberg tables and RisingWave needs to process that data.
- Historical data ingestion: You need to ingest large amounts of historical data stored in Iceberg format.
Prerequisites
- An existing Apache Iceberg table managed by external systems.
- Access credentials for the underlying storage system (e.g., S3 access key and secret key).
- Network connectivity between RisingWave and your storage system.
- Knowledge of your Iceberg catalog type and configuration.
Basic connection example
The following example shows how to connect to an Iceberg table stored on S3 using AWS Glue as the catalog:
Replace the placeholders with your actual values.
RisingWave automatically derives column names and data types from the Iceberg table metadata. Use the DESCRIBE statement to view the schema:
Configuration examples
AWS Glue catalog
For tables managed by AWS Glue Data Catalog:
REST catalog
For tables managed by a REST catalog service:
JDBC catalog
For tables managed by a JDBC catalog (PostgreSQL/MySQL):
Storage catalog
For tables using direct filesystem metadata:
Query data
Once created, you can query data from the Iceberg source:
Create streaming jobs
Streaming ingestion is supported for append-only Iceberg tables. If you created the source before RisingWave v2.3, you might need to recreate it to enable streaming functionality.
You can create materialized views that continuously process data from the Iceberg source:
Time travel
Query historical snapshots of your Iceberg tables:
System tables
Access Iceberg metadata through system tables:
Configuration parameters
Required parameters
Parameter | Description | Example |
---|---|---|
connector | Must be 'iceberg' | 'iceberg' |
database.name | Iceberg database/namespace name | 'analytics' |
table.name | Iceberg table name | 'user_events' |
Optional parameters
Parameter | Description | Default |
---|---|---|
commit_checkpoint_interval | Commit every N checkpoints | 60 |
Storage and catalog configuration
For detailed configuration options:
- Object storage: Object storage configuration
- Catalogs: Catalog configuration
Integration patterns
Lambda architecture pattern
Combine batch and streaming data processing:
Multi-engine data lake
Connect to tables managed by multiple systems:
Best practices
- Monitor checkpoint intervals: Adjust
commit_checkpoint_interval
based on your latency requirements. - Use time travel for debugging: Leverage historical snapshots to troubleshoot data issues.
- Combine with real-time sources: Create comprehensive views that merge batch and stream data.
- Optimize query patterns: Structure your materialized views to match your query patterns.
- Handle schema evolution: Be prepared for schema changes in upstream Iceberg tables.
Limitations
- Append-only streaming: Only append-only Iceberg tables support streaming ingestion.
- Schema changes: Major schema changes may require recreating the source.
- Catalog permissions: Ensure RisingWave has read access to your catalog and storage.
Troubleshooting
Connection issues
- Verify catalog configuration and connectivity.
- Check storage permissions and network access.
- Ensure credentials are correct.
Schema issues
- Use
DESCRIBE
to verify the derived schema. - Check for unsupported data types.
- Verify table exists in the specified database.
Performance issues
- Monitor checkpoint intervals and adjust if needed.
- Consider partitioning in your source tables.
- Review query patterns and create appropriate indexes.
Next steps
- Set up a sink: Write processed data back to Iceberg with Write to Iceberg.
- Configure catalogs: Review Catalog configuration for your specific setup.
- Storage setup: Configure your object storage in Object storage configuration.
- Explore managed approach: Consider RisingWave Managed Iceberg for new tables.