Overview of data delivery

To stream data out of RisingWave, you must create a sink. A sink is an external target that you can send data to. Use the CREATE SINK statement to create a sink. You need to specify what data to be exported, the format, and the sink parameters. Sinks become visible right after you create them, regardless of the backfilling status. Therefore, it’s important to understand that the data in the sinks may not immediately reflect the latest state of their upstream sources due to the latency of the sink, connector, and backfilling process. To determine whether the process is complete and the data in the sink is consistent, refer to Monitor statement progress. Currently, RisingWave supports the following sink connectors:

Sink Connector	Connector Parameter
Apache Doris	`connector = 'doris'`
Apache Iceberg	`connector = 'iceberg'`
AWS Kinesis	`connector = 'kinesis'`
Cassandra and ScyllaDB	`connector = 'cassandra'`
ClickHouse	`connector = 'clickhouse'`
CockroachDB	`connector = 'jdbc'`
Delta Lake	`connector = 'deltalake'`
Elasticsearch	`connector = 'elasticsearch'`
Google BigQuery	`connector = 'bigquery'`
Google Pub/Sub	`connector = 'google_pubsub'`
JDBC: MySQL \| PostgreSQL \| TiDB	`connector = 'jdbc'`
Kafka	`connector = 'kafka'`
MQTT	`connector = 'mqtt'`
NATS	`connector = 'nats'`
Pulsar	`connector = 'pulsar'`
Redis	`connector = 'redis'`
Amazon Redshift	`connector = 'redshift'`
Snowflake	`connector = 'snowflake'`
Snowflake v2	`connector = 'snowflake_v2'`
StarRocks	`connector = 'starrocks'`
Microsoft SQL Server	`connector = 'sqlserver'`

Sink decoupling

Typically, sinks in RisingWave operate in a blocking manner. This means that if the downstream target system experiences performance fluctuations or becomes unavailable, it can potentially impact the stability of the RisingWave instance. However, sink decoupling can be implemented to address this issue. Sink decoupling introduces a buffering queue between a RisingWave sink and the downstream system. This buffering mechanism helps maintain the stability and performance of the RisingWave instance, even when the downstream system is temporarily slow or unavailable. The sink_decouple session variable can be specified to enable or disable sink decoupling. The default value for the session variable is default. To enable sink decoupling for all sinks created in the sessions, set sink_decouple as true or enable.

SET sink_decouple = true;

To disable sink decoupling, set sink_decouple as false or disable, regardless of the default setting.

SET sink_decouple = false;

Sink decoupling is enabled by default for all sinks in RisingWave. An internal system table rw_sink_decouple is provided to query whether a created sink has enabled sink decoupling or not.

dev=> select sink_id, is_decouple from rw_sink_decouple;
 sink_id | is_decouple
---------+-------------
       2 | f
       5 | t
(2 rows)

Upsert sinks and primary keys

For each sink, you can specify the data format. The available data formats are upsert, append-only, and debezium. To determine which data format is supported by each sink connector, please refer to the detailed guide listed above. In the upsert sink, a non-null value updates the last value for the same key or inserts a new value if the key doesn’t exist. A NULL value indicates the deletion of the corresponding key. When creating an upsert sink, note whether or not you need to specify the primary key in the following situations.

If the downstream system supports primary keys and the table in the downstream system has a primary key, you must specify the primary key with the primary_key field when creating an upsert JDBC sink.
If the downstream system supports primary keys but the table in the downstream system has no primary key, then RisingWave does not allow users to create an upsert sink. A primary key must be defined in the table in the downstream system.
If the downstream system does not support primary keys, then users must define the primary key when creating an upsert sink.

Sink buffering behavior

A sink will buffer incoming data within each barrier interval if the stream’s internal primary key (stream key) differs from the user-defined sink primary key and the sink is not append-only. This mismatch can cause update events for the same sink key to be split across multiple upstream fragments in a distributed execution, leading to out-of-order operations if sent directly. Buffering allows the sink to compact and reorder updates so that delete events are emitted before insert events, ensuring correct semantics. If the stream key matches the sink primary key exactly, or if the sink is configured as append-only or force_append_only, no buffering is performed and the sink emits changes immediately as they arrive.

Sink data in parquet or json format

RisingWave supports sinking data in Parquet or JSON formats to cloud storage services, including S3, Google Cloud Storage (GCS), Azure Blob Storage, and WebHDFS. This eliminates the need for complex data lake setups. Once the data is saved, the files can be queried using RisingWave’s batch processing engine through the file_scan API. You can also leverage third-party OLAP query engines to enhance data processing capabilities. Below is an example to sink data to S3:

CREATE SINK test_file_sink FROM test
WITH (
    connector = 's3',
    s3.region_name = '{config['S3_REGION']}',
    s3.bucket_name = '{config['S3_BUCKET']}',
    s3.credentials.access = '{config['S3_ACCESS_KEY']}',
    s3.credentials.secret = '{config['S3_SECRET_KEY']}',
    s3.endpoint_url = 'https://{config['S3_ENDPOINT']}'
    s3.path = '',
    type = 'append-only',
    force_append_only='true'
) FORMAT PLAIN ENCODE PARQUET(force_append_only='true');

File sink currently supports only append-only mode, so please change the query to append-only and specify this explicitly after the FORMAT ... ENCODE ... statement.

Batching strategy for file sink

Added in v2.1.0.

RisingWave implements batching strategies for file sinks, including S3, Google Cloud Storage (GCS), Azure Blob Storage, and WebHDFS. This optimizes file management by preventing the generation of numerous small files. The batching strategy is available for Parquet, JSON, and CSV encode.

File organization

You can specify path_partition_prefix option in the WITH clause to organize files into subdirectories based on their creation time. The available options are month, day, or hour. If not specified, files will be stored directly in the root directory without any time-based subdirectories. Regarding file naming rules, currently, files follow the naming pattern /Option<path_partition_prefix>/executor_id + timestamp.suffix. Timestamp differentiates files batched by the rollover interval. The output files look like below:

path/2024-09-20/47244640257_1727072046.parquet
path/2024-09-20/47244640257_1727072055.parquet

Example

CREATE SINK s1 
FROM t
WITH (
    connector = 's3',
    max_row_count = '100',
    rollover_seconds = '10',
    type = 'append-only',
    path_partition_prefix = 'day'
) FORMAT PLAIN ENCODE PARQUET (force_append_only=true);

In this example, if the number of rows in the file exceeds 100, or if writing has continued for more than 10 seconds, the writing of this file will be completed. Once completed, the file will be visible in the downstream sink system.

Get started

Work with data

Deploy & operate

Performance

Troubleshooting

Reference

Cloud

Overview of data delivery

Sink decoupling

Upsert sinks and primary keys

Sink buffering behavior

Sink data in parquet or json format

Batching strategy for file sink

Category

File organization

Example

Get started

Work with data

Deploy & operate

Performance

Troubleshooting

Reference

Cloud

​Sink decoupling

​Upsert sinks and primary keys

​Sink buffering behavior

​Sink data in parquet or json format

​Batching strategy for file sink

​Category

​File organization

​Example

Sink decoupling

Upsert sinks and primary keys

Sink buffering behavior

Sink data in parquet or json format

Batching strategy for file sink

Category

File organization

Example