What is streaming ETL?
Streaming ETL (Extract, Transform, Load) is the continuous process of extracting data from source systems, transforming it in real time, and loading the results into downstream destinations — without waiting for batch schedules. Unlike traditional batch ETL that runs on hourly or daily intervals, streaming ETL processes each event as it arrives, keeping downstream systems always up to date.Streaming ETL vs. batch ETL
| Batch ETL | Streaming ETL | |
|---|---|---|
| Processing frequency | Scheduled (hourly, daily) | Continuous (event-by-event) |
| Data freshness | Minutes to hours old | Seconds old |
| Latency | High | Low |
| Complexity | Simpler (bounded datasets) | More complex (unbounded streams, ordering, late data) |
| Tools | Spark, Airflow, dbt | RisingWave, Flink, Kafka Streams + Connect |
| Best for | Historical reporting, large backfills | Real-time analytics, monitoring, operational systems |
A streaming ETL pipeline
A typical streaming ETL pipeline has three stages:1. Extract
Continuously ingest data from source systems:- Database CDC: Capture row-level changes from PostgreSQL, MySQL, SQL Server, or MongoDB.
- Message brokers: Consume events from Kafka, Pulsar, Kinesis, or MQTT.
- Object storage: Read files from S3, GCS, or Azure Blob.
- Webhooks: Receive push-based events from SaaS applications.
2. Transform
Apply real-time transformations to the data:- Filtering: Remove irrelevant or invalid records.
- Enrichment: Join streaming data with reference tables or other streams.
- Aggregation: Compute running totals, averages, counts, or percentiles.
- Deduplication: Remove duplicate events.
- Schema mapping: Rename, cast, or restructure fields for the target system.
3. Load
Deliver transformed results to downstream systems:- Data warehouses: Snowflake, BigQuery, ClickHouse.
- Data lakes: Apache Iceberg, Delta Lake, S3 (Parquet/JSON).
- Databases: PostgreSQL, MySQL, Elasticsearch, Redis.
- Message brokers: Kafka, Pulsar (for further downstream consumption).
Streaming ETL with RisingWave
RisingWave enables streaming ETL entirely in SQL — no Java code, no workflow orchestrators, no separate transformation and loading tools.Why RisingWave for streaming ETL?
- All-in-one system: Extract, transform, and load in a single SQL-based platform — no Kafka Connect, no Airflow, no Flink.
- Cascading materialized views: Build multi-layered transformation pipelines where each layer updates incrementally.
- Native CDC: Connect directly to source databases without Debezium or Kafka.
- Multiple destinations: Sink to Iceberg, Snowflake, PostgreSQL, ClickHouse, Kafka, S3, and many more — simultaneously.
- SQL-native: Any SQL-proficient team member can build and maintain streaming ETL pipelines.
Common streaming ETL use cases
- Real-time data warehouse ingestion: Keep analytics tables fresh with continuous updates instead of nightly batch loads.
- CDC-based replication: Replicate and transform database changes across systems in real time.
- Event stream enrichment: Join clickstream or IoT data with reference tables before loading into analytics systems.
- Data lake ingestion: Continuously write transformed data into Iceberg or Delta Lake tables with automatic compaction.
- Cross-system synchronization: Keep multiple downstream systems (search, cache, analytics) in sync with source-of-truth databases.
Related topics
- What is Stream Processing? — Foundations of continuous data processing
- What is CDC? — Native CDC connectors for real-time database replication
- Data ingestion — All ingestion patterns and connectors
- Data delivery — Sink connectors and delivery semantics