Continuous streaming ingestion
What it is: Real-time, continuous data ingestion from streaming sources that automatically updates as new data arrives. When to use: For real-time analytics, event-driven applications, live dashboards, and when you need immediate data freshness.Example: Kafka streaming ingestion
Example: Database CDC (Change Data Capture)
Example: Message queues (MQTT, NATS, Pulsar)
One-time batch ingestion
What it is: Loading data once from external sources like databases, data lakes, or files. When to use: For initial data loads, historical data import, or when you need to load static datasets.Example: Batch load from a database
Thepostgres-cdc connector can be used to perform a one-time snapshot of a PostgreSQL table. For other databases, such as MySQL, you can use the corresponding CDC connector and set snapshot.mode to initial_only.
Example: Load from cloud storage (S3, GCS, Azure)
Example: Load from a data lake (Iceberg)
Periodic ingestion with external orchestration
What it is: RisingWave doesn’t have a built-in scheduler, but you can achieve periodic ingestion using external orchestration tools like Cron or Airflow. When to use: For scheduled data updates, daily/hourly batch processing, or when you need precise control over ingestion timing.Example: Incremental loading pattern
A common pattern is to use a control table to track the last load time and only ingest new data.Other ingestion methods
Direct data insertion
You can always insert data directly into a standard table using theINSERT statement.
Test data generation
For development and testing, you can use the built-indatagen connector to generate mock data streams.
Ingestion method support matrix
| Data Source | Continuous Streaming | One-Time Batch | Periodic | Notes |
|---|---|---|---|---|
| Apache Kafka | ✅ | ❌ | ⚠️ | Streaming only; periodic via external tools |
| Redpanda | ✅ | ❌ | ⚠️ | Streaming only; periodic via external tools |
| Apache Pulsar | ✅ | ❌ | ⚠️ | Streaming only; periodic via external tools |
| AWS Kinesis | ✅ | ❌ | ⚠️ | Streaming only; periodic via external tools |
| Google Pub/Sub | ✅ | ❌ | ⚠️ | Streaming only; periodic via external tools |
| NATS JetStream | ✅ | ❌ | ⚠️ | Streaming only; periodic via external tools |
| MQTT | ✅ | ❌ | ⚠️ | Streaming only; periodic via external tools |
| PostgreSQL CDC | ✅ | ✅ | ⚠️ | CDC for streaming; direct connection for batch |
| MySQL CDC | ✅ | ✅ | ⚠️ | CDC for streaming; direct connection for batch |
| SQL Server CDC | ✅ | ✅ | ⚠️ | CDC for streaming; direct connection for batch |
| MongoDB CDC | ✅ | ✅ | ⚠️ | CDC for streaming; direct connection for batch |
| AWS S3 | ❌ | ✅ | ⚠️ | Batch only; periodic via external tools |
| Google Cloud Storage | ❌ | ✅ | ⚠️ | Batch only; periodic via external tools |
| Azure Blob | ❌ | ✅ | ⚠️ | Batch only; periodic via external tools |
| Apache Iceberg | ❌ | ✅ | ⚠️ | Batch only; periodic via external tools |
| Datagen | ✅ | ❌ | ❌ | Test data generation only |
| Direct INSERT | ❌ | ✅ | ⚠️ | Manual insertion; periodic via external tools |
- ✅ Natively Supported: Built-in support for this ingestion method.
- ❌ Not Supported: This ingestion method is not available for this source.
- ⚠️ External Tools Required: Requires external orchestration tools (e.g., Cron, Airflow).
Best practices
Choose the right method
- Streaming: Use for real-time requirements and continuous data flows.
- Batch: Use for historical data, large one-time loads, or static datasets.
- Periodic: Use for scheduled updates with external orchestration tools.
Performance considerations
- Streaming ingestion offers the best real-time performance.
- Batch loading is efficient for large datasets.
- Use materialized views to pre-compute and store results for fast querying.
Data consistency
- CDC provides high-fidelity replication of database changes.
- For message queues, understand the delivery guarantees (e.g., at-least-once) of your system.
- Use transactions for atomic operations when inserting data manually.
- Monitor data quality and set up alerts.
Monitoring and operations
- Monitor streaming lag for real-time sources to ensure data freshness.
- Track batch job success and failure rates.
- Set up alerts for data quality issues.
- Use RisingWave’s system tables and dashboards for monitoring.