> ## Documentation Index
> Fetch the complete documentation index at: https://docs.risingwave.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Data formats and encoding options

> Configure RisingWave to handle various data formats and encodings from your sources.

When connecting to a data source, you specify how the data is formatted and encoded using the `FORMAT` and `ENCODE` keywords in your `CREATE SOURCE` or `CREATE TABLE` statement.  This page explains the supported combinations and how to configure them.

## Understanding `FORMAT` and `ENCODE`

* **`FORMAT`:** Specifies the high-level structure of the data. Options:
  * `PLAIN`: Standard formats like JSON, Avro, Protobuf, CSV, Bytes, or Parquet.
  * `UPSERT`: A stream of updates and deletes (key-value pairs), typically from Kafka.
  * `DEBEZIUM`: The Debezium change data capture (CDC) format.
  * `DEBEZIUM_MONGO`: The Debezium CDC format for MongoDB.

* **`ENCODE`:** Specifies the specific data encoding. Options:
  * `JSON`: JavaScript Object Notation.
  * `AVRO`: Apache Avro.
  * `PROTOBUF`: Protocol Buffers.
  * `CSV`: Comma-Separated Values.
  * `BYTES`: Raw byte stream (no decoding).
  * `PARQUET`: Apache Parquet

`FORMAT` and `ENCODE` work together to define how RisingWave interprets incoming data.

## Supported combinations

| FORMAT           | ENCODE     | Description                                                 |
| :--------------- | :--------- | :---------------------------------------------------------- |
| `PLAIN`          | `JSON`     | Standard JSON data.                                         |
| `UPSERT`         | `JSON`     | JSON for updates/deletes (key-value pairs).                 |
| `PLAIN`          | `AVRO`     | Standard Avro data (requires schema registry).              |
| `UPSERT`         | `AVRO`     | Avro for updates/deletes (requires schema registry).        |
| `DEBEZIUM`       | `AVRO`     | Debezium-formatted Avro (requires schema registry).         |
| `PLAIN`          | `PROTOBUF` | Standard Protobuf data (requires schema location/registry). |
| `DEBEZIUM`       | `JSON`     | Debezium-formatted JSON data.                               |
| `DEBEZIUM_MONGO` | `JSON`     | Debezium-formatted JSON data for MongoDB.                   |
| `PLAIN`          | `CSV`      | Comma-Separated Values.                                     |
| `PLAIN`          | `BYTES`    | Raw byte stream.                                            |
| `PLAIN`          | `PARQUET`  | Apache Parquet                                              |

## General considerations

* **Schema Registry:** For Avro and Protobuf, you often need a schema registry (Confluent Schema Registry or AWS Glue Schema Registry).  The `schema.registry` parameter (where applicable) can accept multiple addresses; RisingWave tries each until it finds the schema.
* **`TopicNameStrategy`:**  For Avro, RisingWave uses the `TopicNameStrategy` by default for the schema registry, looking for a schema with the subject name `{topic name}-value`.

## JSON handling

RisingWave directly decodes JSON data. Different `FORMAT` options add specific behaviors:

### `PLAIN JSON`

Standard JSON ingestion. Define the schema in your `CREATE SOURCE`/`CREATE TABLE` statement or use a schema registry.

```sql theme={null}
FORMAT PLAIN
ENCODE JSON [ (
   schema.registry = 'schema_registry_url [, ...]',
   [schema.registry.username = 'username'],
   [schema.registry.password = 'password']
) ]
```

## `UPSERT JSON`

For Kafka topics with updates/deletes (key-value pairs). RisingWave inserts, updates, or deletes rows based on the key and value.

When using `FORMAT UPSERT`, you must include the `INCLUDE KEY` clause and define a primary key. This format can be used with both `CREATE SOURCE` and `CREATE TABLE`.

```sql theme={null}
FORMAT UPSERT
ENCODE JSON [ (
   schema.registry = 'schema_registry_url [, ...]',
   [schema.registry.username = 'username'],
   [schema.registry.password = 'password']
) ]
```

**Example with CREATE SOURCE:**

```sql Example theme={null}
CREATE SOURCE upsert_source (
    PRIMARY KEY (rw_key),
    id INT,
    name VARCHAR
)
INCLUDE KEY AS rw_key
WITH (
    connector='kafka',
    topic='user_updates',
    properties.bootstrap.server='localhost:9092'
)
FORMAT UPSERT ENCODE JSON;
```

<Note>
  When using `FORMAT UPSERT` with `CREATE SOURCE`, be aware that upsert sources cannot be used directly with stateful operations like JOIN, aggregations, or window functions. For these use cases, either materialize the source as a view first, or use `CREATE TABLE` instead. See [Upsert sources](/sql/commands/sql-create-source#upsert-sources) for details.
</Note>

### `DEBEZIUM JSON`

For Debezium CDC data. You can usually define the schema directly. The `ignore_key` option (default: false) lets you consume only the payload.

```sql theme={null}
FORMAT DEBEZIUM
ENCODE JSON [ (
   [ ignore_key = 'true | false ' ]
) ]
```

### `DEBEZIUM_MONGO JSON`

When loading data from MongoDB via Kafka topics in Debezium Mongo JSON format, the source table schema has a few limitations. The table schema must have the columns \_id and payload, where \_id comes from the MongoDB document's id and is the primary key, and payload is type jsonb and contains the rest of the document. If the document's \_id is type ObjectID, then when creating the column in RisingWave, specify the type of \_id as varchar. If the document's \_id is of type int32 or int64, specify the type of \_id as int or bigint in RisingWave.

```sql theme={null}
FORMAT DEBEZIUM_MONGO
ENCODE JSON
```

## AVRO handling

RisingWave supports Apache Avro. You must use a schema registry with Avro.

### `PLAIN AVRO`

Standard Avro ingestion.

```sql theme={null}
FORMAT PLAIN
ENCODE AVRO (
    schema.registry = 'schema_registry_url [, ...]',
)
```

You can ingest Avro map type into RisingWave map type or jsonb:

```sql theme={null}
FORMAT [ DEBEZIUM | UPSERT | PLAIN ] ENCODE AVRO (
    map.handling.mode = 'map' | 'jsonb'
)
```

### `UPSERT AVRO`

Handles Avro updates/deletes (key-value pairs) from Kafka.

```sql theme={null}
FORMAT UPSERT
ENCODE AVRO (
   schema.registry = 'schema_registry_url [, ...]',
)
```

### `DEBEZIUM AVRO`

For Debezium-formatted Avro. Specify the message and schema.registry.

```sql theme={null}
FORMAT DEBEZIUM
ENCODE AVRO (
    message = 'main_message',
    schema.registry = 'schema_registry_url [, ...]',
    [ignore_key = 'true | false']
)
```

## Protobuf handling

### `PLAIN PROTOBUF`

For data in protobuf format, you must specify a message (fully qualified by package path) and a schema location. The schema location can be an actual Web location that is in `http://...`, `https://...`, or `S3://...` format. For Kafka data in protobuf, instead of providing a schema location, you can provide a Confluent Schema Registry that RisingWave can get the schema from.
For more details about using Schema Registry for Kafka data, see [Kafka source configurations](/ingestion/sources/kafka-config).

If you provide a file location, the schema file must be a `FileDescriptorSet`, which can be compiled from a `.proto` file with a command like this:

```bash theme={null}
protoc -I=$include_path --include_imports --descriptor_set_out=schema.pb schema.proto
```

```sql theme={null}
FORMAT PLAIN
ENCODE PROTOBUF (
   message = 'com.example.MyMessage',
   schema.location = 'location' | schema.registry = 'schema_registry_url [, ...]',
)
```

### CSV handling

#### `PLAIN CSV`

To consume data in CSV format, you can use `ENCODE PLAIN FORMAT CSV` with options. Configurable options include delimiter and without\_header.

```sql theme={null}
FORMAT PLAIN
ENCODE CSV (
    delimiter = 'delimiter',
    without_header = 'false' | 'true'
)
```

## Bytes handling

### `PLAIN BYTES`

RisingWave allows you to read data streams without decoding the data by using the `BYTES` row format. However, the table or source can have exactly one field of `BYTEA` data.

```sql theme={null}
FORMAT PLAIN
ENCODE BYTES
```

## Parquet handling

### `PLAIN PARQUET`

Parquet format allows you to efficiently store and retrieve large datasets by utilizing a columnar storage architecture. RisingWave supports reading Parquet files from object storage systems including Amazon S3, Google Cloud Storage (GCS), and Azure Blob Storage.

```sql theme={null}
FORMAT PLAIN
ENCODE PARQUET
```

When using a file source to read Parquet files, the schema must be defined according to the following data type mappings between Parquet and RisingWave:

| Parquet data type       | RisingWave file source data type |
| :---------------------- | :------------------------------- |
| boolean                 | boolean                          |
| int16                   | smallint                         |
| int32                   | int                              |
| int64                   | bigint                           |
| float                   | real                             |
| double                  | double precision                 |
| string                  | varchar                          |
| date                    | date                             |
| decimal                 | decimal                          |
| int8                    | smallint                         |
| uint8                   | smallint                         |
| uint16                  | int                              |
| uint32                  | bigint                           |
| uint64                  | decimal                          |
| float16                 | double precision                 |
| `timestamp(_, Some(_))` | timestamptz                      |
| `timestamp(_, None)`    | timestamp                        |

<Note>
  Parquet sources require case-sensitive column names. However, PostgreSQL converts unquoted column names to lowercase by default. To preserve case sensitivity when defining the schema, use double quotes around column names.
</Note>

## Parameter reference

| FORMAT   | ENCODE   | Parameter                     | Description                                                                                                                                                                                             | Required     |
| :------- | :------- | :---------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | :----------- |
| PLAIN    | AVRO     | `schema.registry`             | URL(s) of the schema registry.                                                                                                                                                                          | Yes          |
| PLAIN    | AVRO     | `schema.registry.username`    | The username for schema registry. Optional, but if set it must be used together with `schema.registry.password`.                                                                                        | No           |
| PLAIN    | AVRO     | `schema.registry.password`    | The password for schema registry. Optional, but if set it must be used together with `schema.registry.username`.                                                                                        | No           |
| PLAIN    | AVRO     | `schema.registry.ca_pem_path` | Path to a custom root CA certificate file (PEM) for the schema registry TLS connection, or `'ignore'` to skip certificate verification (not recommended for production). Supports `SECRET ... AS FILE`. | No           |
| UPSERT   | AVRO     | `schema.registry`             | URL(s) of the schema registry.                                                                                                                                                                          | Yes          |
| UPSERT   | AVRO     | `schema.registry.username`    | The username for schema registry. Required if `schema.registry.password` is set.                                                                                                                        | No           |
| UPSERT   | AVRO     | `schema.registry.password`    | The password for schema registry. Required if `schema.registry.username` is set.                                                                                                                        | No           |
| UPSERT   | AVRO     | `schema.registry.ca_pem_path` | Path to a custom root CA certificate file (PEM) for the schema registry TLS connection, or `'ignore'` to skip certificate verification (not recommended for production). Supports `SECRET ... AS FILE`. | No           |
| DEBEZIUM | AVRO     | `schema.registry`             | URL(s) of the schema registry.                                                                                                                                                                          | Yes          |
| DEBEZIUM | AVRO     | `schema.registry.username`    | The username for schema registry. Required if `schema.registry.password` is set; configure together with `schema.registry.password`.                                                                    | No           |
| DEBEZIUM | AVRO     | `schema.registry.password`    | The password for schema registry. Required if `schema.registry.username` is set; configure together with `schema.registry.username`.                                                                    | No           |
| DEBEZIUM | AVRO     | `schema.registry.ca_pem_path` | Path to a custom root CA certificate file (PEM) for the schema registry TLS connection, or `'ignore'` to skip certificate verification (not recommended for production). Supports `SECRET ... AS FILE`. | No           |
| DEBEZIUM | AVRO     | `message`                     | The main message name.                                                                                                                                                                                  | Yes          |
| DEBEZIUM | AVRO     | `ignore_key`                  | Whether to ignore the key (default: false).                                                                                                                                                             | No           |
| PLAIN    | JSON     | `schema.registry`             | URL(s) of the schema registry (optional).                                                                                                                                                               | No           |
| PLAIN    | JSON     | `schema.registry.username`    | The username for schema registry (optional).                                                                                                                                                            | No           |
| PLAIN    | JSON     | `schema.registry.password`    | The password for schema registry (optional).                                                                                                                                                            | No           |
| PLAIN    | JSON     | `schema.registry.ca_pem_path` | Path to a custom root CA certificate file (PEM) for the schema registry TLS connection, or `'ignore'` to skip certificate verification (not recommended for production). Supports `SECRET ... AS FILE`. | No           |
| UPSERT   | JSON     | `schema.registry`             | URL(s) of the schema registry (optional).                                                                                                                                                               | No           |
| UPSERT   | JSON     | `schema.registry.username`    | The username for schema registry (optional).                                                                                                                                                            | No           |
| UPSERT   | JSON     | `schema.registry.password`    | The password for schema registry (optional).                                                                                                                                                            | No           |
| UPSERT   | JSON     | `schema.registry.ca_pem_path` | Path to a custom root CA certificate file (PEM) for the schema registry TLS connection, or `'ignore'` to skip certificate verification (not recommended for production). Supports `SECRET ... AS FILE`. | No           |
| PLAIN    | PROTOBUF | `message`                     | Fully qualified name of the Protobuf message.                                                                                                                                                           | Yes          |
| PLAIN    | PROTOBUF | `schema.location`             | URL of the schema file (alternative to `schema.registry`).                                                                                                                                              | Yes (one of) |
| PLAIN    | PROTOBUF | `schema.registry`             | URL(s) of the schema registry (alternative to `schema.location`).                                                                                                                                       | Yes (one of) |
| PLAIN    | PROTOBUF | `schema.registry.ca_pem_path` | Path to a custom root CA certificate file (PEM) for the schema registry TLS connection, or `'ignore'` to skip certificate verification (not recommended for production). Supports `SECRET ... AS FILE`. | No           |
| DEBEZIUM | JSON     | `ignore_key`                  | Whether to ignore the key (default: false).                                                                                                                                                             | No           |
| PLAIN    | CSV      | `delimiter`                   | Delimiter character.                                                                                                                                                                                    | Yes          |
| PLAIN    | CSV      | `without_header`              | Whether the CSV data has a header row (default: false).                                                                                                                                                 | No           |
| PLAIN    | AVRO     | `map.handling.mode`           | How to ingest Avro map type. Available values: 'map'(default) and 'jsonb'.                                                                                                                              | No           |
| UPSERT   | AVRO     | `map.handling.mode`           | How to ingest Avro map type. Available values: 'map'(default) and 'jsonb'.                                                                                                                              | No           |
| DEBEZIUM | AVRO     | `map.handling.mode`           | How to ingest Avro map type. Available values: 'map'(default) and 'jsonb'.                                                                                                                              | No           |

## General parameters

### `timestamptz.handling.mode`

The `timestamptz.handling.mode` parameter controls the input format for timestamptz values. It accepts the following values:

* `micro`: The input number will be interpreted as the number of microseconds since 1970-01-01T00:00:00Z in UTC.

* `milli`: The input number will be interpreted as the number of milliseconds since 1970-01-01T00:00:00Z in UTC.

* `guess_number_unit`: This has been the default setting and restricts the range of timestamptz values to \[1973-03-03 09:46:40, 5138-11-16 09:46:40) in UTC.

* `utc_string`: This format is the least ambiguous and can usually be correctly inferred without needing explicit specification.

* `utc_without_suffix`: Allows the user to indicate that a naive timestamp is in UTC, rather than local time.

You can set this parameter for these three combinations:

* `FORMAT PLAIN ENCODE JSON`
* `FORMAT UPSERT ENCODE JSON`
* `FORMAT DEBEZIUM ENCODE JSON`

You cannot set this parameter for `FORMAT DEBEZIUM_MONGO ENCODE JSON`.
