Clickstream analysis
Overview
Whether a website gets a few hundred visitors a day or a few million, clickstream data can reveal valuable insights by using different metrics from the website. Not only does clickstream data reveal how popular a particular page on a website is, but it is also a useful substitute for traditional market research methods. We can determine the effectiveness of a digital marketing campaign, UX design, and more based on user behaviors without using traditional surveys.
Clickstream data provides a detailed log of how users travel through a particular website. It shows how long a user has been on a specific page, how they arrived at it, and what the user clicked on next. If many users quickly leave a page, it suggests that improvements are needed to encourage more user interaction.
In this tutorial, you will learn how to track the number of clicks a webpage gets over time with RisingWave. We have set up a demo cluster for this tutorial so you can easily try it out.
Prerequisites
- Ensure you have Docker and Docker Compose installed in your environment. Note that Docker Compose is included in Docker Desktop for Windows and macOS. If you use Docker Desktop, ensure that it is running before launching the demo cluster.
- Ensure that the PostgreSQL interactive terminal,
psql
, is installed in your environment. For detailed instructions, see Download PostgreSQL.
Step 1: Launch the demo cluster
In the demo cluster, we packaged RisingWave and a workload generator. The workload generator will start to generate random traffic and feed it into Kafka as soon as the cluster is started.
First, clone the risingwave repository to the environment.
git clone https://github.com/risingwavelabs/risingwave.git
Now navigate to the integration_tests/clickstream
directory and start the demo cluster from the docker-compose file.
cd risingwave/integration_tests/clickstream
docker compose up -d
The default command-line syntax in Compose V2 starts with docker compose
. See details in the Docker docs.
If you're using Compose V1, use docker-compose
instead.
Necessary RisingWave components will be started, including the compute node, metadata node, and MinIO. The workload generator will start to generate random data and feed them into Kafka topics. In this demo cluster, data of materialized views will be stored in the MinIO instance.
Now connect to RisingWave to manage data streams and perform data analysis.
psql -h localhost -p 4566 -d dev -U root
Step 2: Connect RisingWave to data streams
Now that we have set up the data stream in Kafka (in JSON) using the demo cluster, we can connect to the streams with the following SQL statement. The data contains information about what each user is clicking on and the timestamp of each event.
CREATE SOURCE user_behaviors (
user_id VARCHAR,
target_id VARCHAR,
target_type VARCHAR,
event_timestamp TIMESTAMP WITH TIME ZONE,
behavior_type VARCHAR,
parent_target_type VARCHAR,
parent_target_id VARCHAR
) WITH (
connector = 'kafka',
topic = 'user_behaviors',
properties.bootstrap.server = 'message_queue:29092',
scan.startup.mode = 'earliest'
) FORMAT PLAIN ENCODE JSON;
RisingWave is connected to the streams but has not started to consume data yet. For data to be processed, we need to define materialized views. After a materialized view is created, RisingWave will consume data from the specified offset.
Step 3: Create materialized views
In this tutorial, we will create a materialized view that counts how many times a thread was clicked on over a day.
First, the tumble()
function will map each event into a 10-minute window to create an intermediary table t
, where the events will be aggregated on target_id
and window_time
to get the number of clicks for each thread. The events will also be filtered by target_type
and behavior_type
.
Next, the hop()
function will create 24-hour time windows every 10 minutes. Each event will be mapped to corresponding windows. Finally, they will be grouped by target_id
and window_time
to calculate the total number of clicks of each thread within 24 hours.
Please refer to Time window functions for an explanation of the tumble and hop functions and aggregations.
CREATE MATERIALIZED VIEW thread_view_count AS WITH t AS (
SELECT
target_id,
COUNT() AS view_count,
window_start AS window_time
FROM
TUMBLE(
user_behaviors,
event_timestamp,
INTERVAL '10 minutes'
)
WHERE
target_type = 'thread'
AND behavior_type = 'show'
GROUP BY
target_id,
window_start
)
SELECT
target_id,
SUM(t.view_count) AS view_count,
window_start,
window_end
FROM
HOP(
t,
t.window_time,
INTERVAL '10 minutes',
INTERVAL '1440 minutes'
)
GROUP BY
target_id,
window_start,
window_end;
Step 4: Query the results
We can query the most often viewed threads with the following statement.
SELECT * FROM thread_view_count
ORDER BY view_count DESC, window_start
LIMIT 5;
The result may look like this:
target_id | view_count | window_start | window_end
-----------+------------+---------------------------+---------------------------
thread58 | 15 | 2022-09-22 06:50:00+00:00 | 2022-09-23 06:50:00+00:00
thread58 | 15 | 2022-09-22 07:00:00+00:00 | 2022-09-23 07:00:00+00:00
thread58 | 15 | 2022-09-22 07:10:00+00:00 | 2022-09-23 07:10:00+00:00
thread58 | 15 | 2022-09-22 07:20:00+00:00 | 2022-09-23 07:20:00+00:00
thread58 | 15 | 2022-09-22 07:30:00+00:00 | 2022-09-23 07:30:00+00:00
(5 rows)
We can also query results by specifying a time interval. To learn more about data and time functions and operators, see Date and time.
SELECT * FROM thread_view_count
WHERE window_start > ('2022-09-23 06:50Z' :: TIMESTAMP WITH TIME ZONE - INTERVAL '1 day')
AND window_start <
('2022-09-23 07:40Z' :: TIMESTAMP WITH TIME ZONE - INTERVAL '1 day' + INTERVAL '10 minutes')
AND target_id = 'thread58'
ORDER BY window_start;
The result looks like this:
target_id | view_count | window_start | window_end
-----------+------------+---------------------------+---------------------------
thread58 | 15 | 2022-09-22 06:50:00+00:00 | 2022-09-23 06:50:00+00:00
thread58 | 15 | 2022-09-22 07:00:00+00:00 | 2022-09-23 07:00:00+00:00
thread58 | 15 | 2022-09-22 07:10:00+00:00 | 2022-09-23 07:10:00+00:00
thread58 | 15 | 2022-09-22 07:20:00+00:00 | 2022-09-23 07:20:00+00:00
thread58 | 15 | 2022-09-22 07:30:00+00:00 | 2022-09-23 07:30:00+00:00
thread58 | 15 | 2022-09-22 07:40:00+00:00 | 2022-09-23 07:40:00+00:00
(6 rows)
When you finish, run the following command to disconnect RisingWave.
\q
Optional: To remove the containers and the data generated, use the following command.
docker compose down -v
Summary
In this tutorial, we learn:
- How to get time-windowed aggregate results by using the tumble and hop time window functions.
- How to compare time intervals.