Sink data from RisingWave to Elasticsearch

You can deliver the data that has been ingested and transformed in RisingWave to Elasticsearch to serve searches or analytics. Elasticsearch is a distributed, RESTful search and analytics engine capable of addressing a growing number of use cases. It centrally stores your data for lightning-fast search, fine‑tuned relevancy, and powerful analytics that scale with ease. The Elasticsearch sink connecter in RisingWave will perform index operations via the bulk API, flushing updates whenever one of these criteria is met:

1,000 operations
5mb of updates
5 seconds since the last flush (assuming new actions are queued)

The Elasticsearch sink connector in RisingWave provides at-least-once delivery semantics. Events may be redelivered in case of failures.

Prerequisites

Ensure the Elasticsearch cluster (version 7.x or 8.x) is accessible from RisingWave.
If you are running RisingWave locally from binaries, make sure that you have JDK 11 or later versions installed in your environment.

Create an Elasticsearch sink

Use the following syntax to create an Elasticsearch sink. Once a sink is created, any insert or update to the sink will be streamed to the specified Elasticsearch endpoint.

CREATE SINK sink_name
[ FROM sink_from | AS select_query ]
WITH (
  connector = 'elasticsearch',
  primary_key = '<primary key of the sink_from object>',
  { index = '<your Elasticsearch index>' | index_column = '<your index column>' },
  url = 'http://<ES hostname>:<ES port>',
  username = '<your ES username>',
  password = '<your password>',
  delimiter='<delimiter>'
);

Parameters

Parameter	Description
sink_name	Name of the sink to be created.
sink_from	A clause that specifies the direct source from which data will be output. `sink_from` can be a materialized view or a table. Either this clause or a SELECT query must be specified.
AS select_query	A SELECT query that specifies the data to be output to the sink. Either this query or a FROM clause must be specified. See SELECT for the syntax and examples of the SELECT command.
primary_key	Optional. The primary keys of the sink. If the primary key has multiple columns, set a delimiter in the delimiter parameter below to join them.
index	Required if `index_column` is not set. Name of the Elasticsearch index that you want to write data to.
index_column	Allows writing to multiple indexes dynamically based on the column’s value. This parameter is mutually exclusive with `index`. When `index` is set, the write index is `index`; When `index_column` is set, the target index is the value of this column, and it must be of `string` type. Avoiding setting this column as the first column, because the sink defaults to the first column as the key.
routing_column	Optional. Allows setting a column as a routing key, so that it can direct writes to specific shards based on its value.
retry_on_conflict	Optional. Number of retry attempts after an optimistic locking conflict occurs.
batch_size_kb	Optional. Maximum size (in kilobytes) for each request batch sent to Elasticsearch. If the data exceeds this size，the current batch will be sent.
batch_num_messages	Optional. Maximum number of messages per request batch sent to Elasticsearch. If the number of messages exceeds this size，the current batch will be sent.
concurrent_requests	Optional. Maximum number of concurrent threads for sending requests.
url	Required. URL of the Elasticsearch REST API endpoint.
username	Optional. elastic user name for accessing the Elasticsearch endpoint. It must be used with password.
password	Optional. Password for accessing the Elasticsearch endpoint. It must be used with username.
delimiter	Optional. Delimiter for Elasticsearch ID when the sink’s primary key has multiple columns.

For versions under 8.x, there was once a parameter type. In Elasticsearch 6.x, users could directly set the type, but starting from 7.x, it is set to not recommended and the default value is unified to _doc. In version 8.x, the type has been completely removed. See Elasticsearch’s official documentation for more details. So, if you are using Elasticsearch 7.x, we set it to the official’s recommended value, which is _doc. If you are using Elasticsearch 8.x, this parameter has been removed by the Elasticsearch official, so no setting is required.

Primary keys and Elasticsearch IDs

The Elasticsearch sink defaults to the upsert sink type. It does not support the append-only sink type. If you want to customize your Elasticsearch ID, please specify it via the primary_key parameter. RisingWave will combine multiple primary key values into a single string with the delimiter you set, and use it as the Elasticsearch ID. If you don’t want to customize your Elasticsearch ID, RisingWave will use the first column in the sink definition as the Elasticsearch ID.

Data type mapping

ElasticSearch uses a mechanism called dynamic field mapping to dynamically create fields and determine their types automatically. It treats all integer types as long and all floating-point types as float. To ensure data types in RisingWave are mapped to the data types in Elasticsearch correctly, we recommend that you specify the mapping via index templates or dynamic templates before creating the sink.

RisingWave Data Type	ElasticSearch Field Type
boolean	boolean
smallint	long
integer	long
bigint	long
numeric	text
real	float
double precision	float
character varying	text
bytea	text
date	date
time without time zone	text
timestamp without time zone	text
timestamp with time zone	text
interval	text
struct	object
array	array
JSONB	object (RisingWave’s Elasticsearch sink will send JSONB as a JSON string, and Elasticsearch will convert it into an object)

Elasticsearch doesn’t require users to explicitly CREATE TABLE. Instead, it infers the schema on-the-fly based on the first record ingested. For example, if a record contains a jsonb {v1: 100}, v1 will be inferred as a long type. However, if the next record is {v1: "abc"}, the ingestion will fail because "abc" is inferred as a string and the two types are incompatible. This behavior should be noted, or your data may be less than it should be. In terms of monitoring, you can check out Grafana, where there is a panel for all sink write errors.

Get started

Working with data

Deploy & operate

Performance

Troubleshooting

Reference

Cloud

Sink data from RisingWave to Elasticsearch

Prerequisites

Create an Elasticsearch sink

Parameters

Primary keys and Elasticsearch IDs

Data type mapping

Get started

Working with data

Deploy & operate

Performance

Troubleshooting

Reference

Cloud

​Prerequisites

​Create an Elasticsearch sink

​Parameters

​Primary keys and Elasticsearch IDs

​Data type mapping

Prerequisites

Create an Elasticsearch sink

Parameters

Primary keys and Elasticsearch IDs

Data type mapping