A JOIN clause, also known as a join, combines the results of two or more table expressions based on certain conditions, such as whether the values of some columns are equal.
SET streaming_force_filter_inside_join = true
.
t.v2 > t2.v4 + 1000
is evaluated within the join rather than afterward.
=
) and one inequality condition (>=
, >
, <=
, or <
). The inequality condition applies to all data types that support inequality comparison while a time-related type is commonly used.
For example, suppose you have two tables:
stock_prices
: Contains stock price data at certain timestamps.
stock_name | stock_time | price |
---|---|---|
TSLA | 2024-09-24 09:30:00 | 250 |
TSLA | 2024-09-24 10:30:00 | 252 |
TSLA | 2024-09-24 11:30:00 | 255 |
AMZN | 2024-09-24 09:30:00 | 3300 |
AMZN | 2024-09-24 10:30:00 | 3310 |
AMZN | 2024-09-24 11:30:00 | 3320 |
GOOG | 2024-09-24 09:30:00 | 1400 |
GOOG | 2024-09-24 10:30:00 | 1410 |
GOOG | 2024-09-24 11:30:00 | 1420 |
market_data
: Contains market sentiment data at different timestamps.
stock_name | market_time | sentiment |
---|---|---|
TSLA | 2024-09-24 09:00:00 | 0.7 |
TSLA | 2024-09-24 10:00:00 | 0.8 |
TSLA | 2024-09-24 11:00:00 | 0.9 |
AMZN | 2024-09-24 09:00:00 | 0.6 |
AMZN | 2024-09-24 10:00:00 | 0.65 |
AMZN | 2024-09-24 11:00:00 | 0.7 |
NVDA | 2024-09-24 09:00:00 | 0.55 |
NVDA | 2024-09-24 10:00:00 | 0.6 |
NVDA | 2024-09-24 11:00:00 | 0.65 |
market_data
where the market_time
is less than or equal to the stock_time
:
stock_name | stock_time | price | sentiment |
---|---|---|---|
TSLA | 2024-09-24 09:30:00 | 250 | 0.7 |
TSLA | 2024-09-24 10:30:00 | 252 | 0.8 |
TSLA | 2024-09-24 11:30:00 | 255 | 0.9 |
AMZN | 2024-09-24 09:30:00 | 3300 | 0.6 |
AMZN | 2024-09-24 10:30:00 | 3310 | 0.65 |
AMZN | 2024-09-24 11:30:00 | 3320 | 0.7 |
stock_name | stock_time | price | sentiment |
---|---|---|---|
TSLA | 2024-09-24 09:30:00 | 250 | 0.7 |
TSLA | 2024-09-24 10:30:00 | 252 | 0.8 |
TSLA | 2024-09-24 11:30:00 | 255 | 0.9 |
AMZN | 2024-09-24 09:30:00 | 3300 | 0.6 |
AMZN | 2024-09-24 10:30:00 | 3310 | 0.65 |
AMZN | 2024-09-24 11:30:00 | 3320 | 0.7 |
GOOG | 2024-09-24 09:30:00 | 1400 | NULL |
GOOG | 2024-09-24 10:30:00 | 1410 | NULL |
GOOG | 2024-09-24 11:30:00 | 1420 | NULL |
market_data
, so they show the closest preceding sentiment.
GOOG has no corresponding data in market_data
, so the sentiment column is NULL.
join_conditions
must be an equality condition based on the watermarks of the two table expressions. For the syntax of <time_window_expression>
, see Time window functions.
For example, suppose you have these two sources:
interval_condition
must be a watermark-based range.
For example, for sources s1
and s2
used in the above section, you can create an interval join:
FOR SYSTEM_TIME AS OF PROCTIME()
is included in the right table expression.transaction_id | product_id | quantity | sale_date | process_time |
---|---|---|---|---|
1 | 101 | 3 | 2023-06-18 | 2023-06-18 10:15:00 |
2 | 102 | 2 | 2023-06-19 | 2023-06-19 15:30:00 |
3 | 101 | 1 | 2023-06-20 | 2023-06-20 11:45:00 |
products
:
id | product_name | price | valid_from | valid_to |
---|---|---|---|---|
101 | Product A | 20 | 2023-06-01 00:00:00 | 2023-06-15 23:59:59 |
101 | Product A | 25 | 2023-06-16 00:00:00 | 2023-06-19 23:59:59 |
101 | Product A | 22 | 2023-06-20 00:00:00 | NULL |
102 | Product B | 15 | 2023-06-01 00:00:00 | NULL |
products
table and form a wider table. To further improve performance, you can create an index for table products
, and join sales
with the index instead.
transaction_id | product_id | quantity | sale_date | product_name | price |
---|---|---|---|---|---|
1 | 101 | 3 | 2023-06-18 | Product A | 25 |
2 | 102 | 2 | 2023-06-19 | Product B | 15 |
3 | 101 | 1 | 2023-06-20 | Product A | 22 |
sales
:
transaction_id | product_id | quantity | sale_date | product_name | price |
---|---|---|---|---|---|
1 | 101 | 4 | 2023-06-18 | Product A | 25 |
2 | 102 | 3 | 2023-06-19 | Product B | 15 |
3 | 101 | 2 | 2023-06-20 | Product A | 22 |
JOIN
operator is used to join two streams of data together. It can be used to join two streams of data that are both being ingested in real-time.
Whenever there’s some updates on one side of the join, it will look for matches on the other side of the join.
Consider a fact and dimension table join:
PROCESS TIME JOIN
, which will only look up the dimension table at the time of the join.
This will be more efficient, as whenever there’s dimension table update,
it will not recompute the join result for all matching fact side rows.