Overview of data processing

It involves applying various operations such as filtering, aggregating, and joining data to derive meaningful information. In the upcoming section, we will delve into the techniques used in the process of transforming and querying data.

Declare data processing logic in SQL

RisingWave uses Postgres-compatible SQL as the interface for declaring data processing. We are committed to making it easy to use while being powerful in expression. Easy to use: By aligning with PostgreSQL’s syntax, functions, and data types, RisingWave eases the learning curve and enhances accessibility for users.Even when creating stream tasks, there is no need to know many concepts related to streaming processing. Powerful: RisingWave fully supports and optimizes a variety of SQL features, including advanced features such OVER window and different kinds of JOINs. At the same time, we are also committed to expanding the expressive power of SQL, such as by adding semi-structured data types and corresponding expressions.

Ad hoc (on read) vs. Streaming (on write)

There are 2 execution modes in our system serving different analytics purposes. The results of these two modes are the same and the difference lies in the timing of data processing, whether it occurs at the time of data ingestion(on write) or when the query is executed(on read).

Understanding execution modes

Streaming: RisingWave allows users to predefine SQL queries with CREATE MATERIALIZED VIEW statement. RisingWave continuously listens changes in upstream tables (in the FROM clause) and incrementally update the results automatically. Ad-hoc: Also like traditional databases, RisingWave allows users to send SELECT statement to query the result. At this point, RisingWave reads the data from the current snapshot, processes it, and returns the results.

Both modes have their unique advantages. Here are some considerations: Cost & Performance: Compared to traditional databases, streaming mode can pre-compute and store results. The heavy lifting is done upfront, eliminating duplicate computation over each ad hoc query, and therefore has better performance and a lower cost. Flexibility: The streaming mode is less flexible to changes in query requirements. The ad-hoc query usually is created on-the-fly to fulfill immediate and specific information needs. Unlike predefined queries, ad-hoc queries are generated in real-time based on your current requirements. They are commonly used in data analysis, decision-making, and exploratory data tasks, where flexibility and quick access to information are crucial.

Examples

Create a table

To illustrate, let’s consider a hypothetical scenario where we have a table called sales_data. This table stores information about product IDs (product_id) and their corresponding sales amounts (sales_amount).

table of sales_data

product_id | sales_amount
------------+--------------
          1 |           75
          2 |          150
          2 |          125
          1 |          100
          3 |          200

You can use the following statement to create this table.

CREATE TABLE sales_data (
    product_id INT,
    sales_amount INT
);

INSERT INTO sales_data (product_id, sales_amount)
VALUES
    (1, 100),
    (1, 75),
    (2, 150),
    (2, 125),
    (3, 200);

Create a materialized view to build continuous streaming pipeline

Based on the sales_data table, we can create a materialized view called mv_sales_summary to calculate the total sales amount for each product.

CREATE MATERIALIZED VIEW mv_sales_summary AS
SELECT product_id, SUM(sales_amount) AS total_sales
FROM sales_data
GROUP BY product_id;

By the SQL statement above, you have successfully transformed the data from the sales_data table into a materialized view called mv_sales_summary. This materialized view provides the total sales amount for each product. Utilizing materialized views allows for precomputing and storing aggregated data, which in turn improves query performance and simplifies data analysis tasks.

Ad-hoc query on materialized view’s result

Then we can directly query the materialized view to retrieve the transformed data:

SELECT * FROM mv_sales_summary;

----RESULT
 product_id | total_sales
------------+-------------
          1 |         175
          2 |         275
          3 |         200
(3 rows)

Also, analysts or applications can send more flexible and complex ad-hoc queries, such as querying how many products have higher sales volumes than a specific product

SELECT count(*) FROM mv_sales_summary where mv_sales_summary.total_sales >
    (SELECT total_sales FROM mv_sales_summary where product_id = 1);

----RESULT
 count
-------
     2
(1 row)

Get started

Working with data

Deploy & operate

Performance

Troubleshooting

Reference

Cloud

Overview of data processing

Declare data processing logic in SQL

Ad hoc (on read) vs. Streaming (on write)

Understanding execution modes

Examples

Create a table

Create a materialized view to build continuous streaming pipeline

Ad-hoc query on materialized view’s result

Get started

Working with data

Deploy & operate

Performance

Troubleshooting

Reference

Cloud

​Declare data processing logic in SQL

​Ad hoc (on read) vs. Streaming (on write)

​Understanding execution modes

​Examples

​Create a table

​Create a materialized view to build continuous streaming pipeline

​Ad-hoc query on materialized view’s result

Declare data processing logic in SQL

Ad hoc (on read) vs. Streaming (on write)

Understanding execution modes

Examples

Create a table

Create a materialized view to build continuous streaming pipeline

Ad-hoc query on materialized view’s result