Skip to main content

What is a source?

A source in RisingWave is a connection to an external data system that streams data into the database. Sources define where data comes from and how it should be parsed — they are the entry point for all data ingestion in RisingWave. When you create a source with CREATE SOURCE, RisingWave establishes a connection to the external system and begins reading data according to the specified format and encoding. Sources do not persist data inside RisingWave — they provide a live window into the external data stream.

How sources work

-- Create a source that reads from a Kafka topic
CREATE SOURCE pageviews (
  user_id INT,
  page_url VARCHAR,
  view_time TIMESTAMP
) WITH (
  connector = 'kafka',
  topic = 'pageviews',
  properties.bootstrap.server = 'broker:9092'
) FORMAT PLAIN ENCODE JSON;
Once created, a source can be:
  • Queried directly (for supported connectors like Kafka, S3, Iceberg) — useful for ad-hoc exploration and validation.
  • Referenced in materialized views — to build continuous streaming pipelines.
  • Used as input for sinks — to transform and deliver data to downstream systems.

Source vs. Table

A common question is when to use a source versus a table with connector settings. The key difference is data persistence:
SourceTable (with connector)
Persists data in RisingWaveNoYes
Supports DML (INSERT, UPDATE, DELETE)NoYes
Required for CDC connectorsNo — must use TableYes
Primary key requiredNoRequired for CDC
Best forExploration, stateless pipelinesPersistent storage, CDC, updates
Rule of thumb: Use a source when you only need to stream data through to materialized views or sinks without storing the raw data. Use a table when you need to persist the raw data, use CDC connectors, or need to update/delete individual rows.

Supported source connectors

RisingWave supports a wide range of source connectors: Message brokers: Apache Kafka, Apache Pulsar, Amazon Kinesis, MQTT, NATS, Google Pub/Sub Databases (CDC): PostgreSQL, MySQL, SQL Server, MongoDB — via native CDC connectors, no Kafka or Debezium required Object storage: Amazon S3, Google Cloud Storage, Azure Blob Storage Data lakes: Apache Iceberg Other: Webhooks, Datagen (for testing) For the full list, see Supported source connectors.

Data formats and encoding

Sources support multiple data formats:
  • JSON — most common; schema inferred or specified in DDL
  • Avro — requires schema registry URL (Kafka sources)
  • Protobuf — requires schema specification
  • CSV — for file-based sources
  • Bytes — raw bytes for custom parsing
  • Debezium JSON / Maxwell JSON — for CDC events from Kafka topics
-- Avro source with schema registry
CREATE SOURCE events WITH (
  connector = 'kafka',
  topic = 'events',
  properties.bootstrap.server = 'broker:9092'
) FORMAT PLAIN ENCODE AVRO (
  schema.registry = 'http://schema-registry:8081'
);