Improve user profiling, retention, and selection with StarRocks

The user portrait is closely linked with user operations, which have gradually developed from rough to delicate. Some companies have even taken a P2P strategy. Thus user profiling is widely used in many scenarios, including:

  • Marketing
  • Group analysis
  • Risk Warning
  • Effectiveness analysis
  • Channel analysis

While user profiling also has bought many business difficulties:

  • The data volume is enormous, which requires a flexible retrieval method
  • Combined label calculation calls for sophisticated development
  • Accurate de-duplication calculation, which costs huge resources
  • Aggregate label collection requires high query concurrency


Generally, marketers will check the attribute conditions with combination relationships and push channels according to needs. The data platform needs to spell out the corresponding query by selecting the combination conditions and calculating and returning the number of customers in the selection.

Solution Based on Elasticsearch and Hbase

When the volume of data is small, users tend to use Elasticsearch as the storage and query engine. When the volume of data is large, users use Hbase to store the detailed data and Elasticsearch to speed up the queries,as shown in the figure below.


But such a solution has limitations:


It takes complicated procedures to filter customer groups, which needs to aggregate after searching multiple indexes separately.


Heavy structure and complicated maintenance

Poor SQL capabilities (cannot complete join and aggregation operations), high development costs

Inflexible expansion and requires customized development


Solution Based on StarRocks

StarRocks is a new generation of full-scene MPP database, which has superior performance covering all kinds of OLAP analytics scenarios in a timely manner. StarRocks has a vectorization engine, CBO optimizer, and can support a variety of data models.

StarRocks advantages

  • Support standard SQL
  • Strong multi-table join and aggregation queries
  • Supports Bitmap data structure, which can accomplish:
    • Set calculation
    • Precise deduplication
    • Convert column to multiple rows by using Bitmap_to_array and unnest
    • Support high concurrency query of detailed data
    • Simple architecture, easy to operate and maintain

This can be illustrated by the following examples:

User Retention

The data structure of StarRocks is based on Roaring Bitmap. Traditional de-duplication analysis requires the execution of "select distinct count...from...groupby join..." statistics similar to SQL, which will form a large number of SQL task scheduling and large table joins, thus bringing heavy pressure to the cluster.

Based on Roaring Bitmap, StarRocks can easily implement analysis such as UV, retention, funnel, etc., that use a lot of intersection operations, which can be easily used in user behavior and user profiling.

Example 1: Find out user retention of yesterday

select tag, bitmap_intersect(user_id) 
from (
   select tag, date, bitmap_union(user_id) user_id 
   from table 
   where date in ('2020-05-18', '2020-05-19') 
   group by tag, date) a 
group by tag;

The combination of this function and the bitmap_to_string function can obtain the specific data of the intersection.

Example 2: Find the specific users retained yesterday:

select tag, bitmap_to_string(bitmap_intersect(user_id)) 
from (
  select tag, date, bitmap_union(user_id) user_id 
  from table 
  where date in ('2020-05-18', '2020-05-19') 
  group by tag, date) a 
group by tag;

User Selection

The combination of StarRocks Bitmap_union and materialized view can further speed up user selection on the duplicate table. For example, we want to find a list of users who have clicked shopping cart and viewed favorite page in the detailed list.

Step 1: Create a table

create table t1(
  event_day date,
  event_time datetime,
  uid int,
  action string,
  page string,
  product_code string,
  from_days int)
DUPLICATE KEY(`event_day`,`event_time`, `uid`)
  PARTITION p20210401 VALUES LESS THAN ('2021-04-01'),
  PARTITION p20210501 VALUES LESS THAN ('2021-05-01')

Step 2:Import the sample data

mysql> select * from t1;
| event_day |     event_time      |    uid    | action | page   | product_code | from_days |
| 2021-04-03 | 2021-04-03 10:01:30 | 274649163 | click | shopping_cart  | MDS      |     1     |
| 2021-04-03 | 2021-04-03 10:04:30 | 274649163 | view  | favorite_page  | MDS      |     4     |
| 2021-04-03 | 2021-04-03 10:03:30 | 274649164 | click | shopping_cart  | MDS      |     2     |
| 2021-04-03 | 2021-04-03 10:06:30 | 274649165 | click | shopping_cart  | MMS      |     8     |
| 2021-04-03 | 2021-04-03 10:08:30 | 274649164 | view  | favorite page  | MDS      |     3     |
| 2021-04-03 | 2021-04-03 10:09:30 | 274649165 | view  | shopping_cart  | MDS      |    10     |
6 rows in set (0.01 sec)

Step: Create a materialized view

SELECT event_day ,action, page , bitmap_union(to_bitmap(uid)) b_uid   
  FROM t1 
  GROUP BY event_day , action, page;

Step 4: Select your target clients

WITH tbl_c AS (
  SELECT bitmap_union(to_bitmap(uid)) b_uid   
  FROM t1 
  WHERE event_day = '2021-04-03'
  AND action='click' and page= 'shopping_cart'
  GROUP BY action, page
) , tbl_v AS(
  SELECT bitmap_union(to_bitmap(uid)) b_uid   
  FROM t1 
  WHERE event_day = '2021-04-03'
  AND action='view' and page='favorite_page'
  GROUP BY action, page
) ,tbl_u AS(
  SELECT b_uid from tbl_c
  SELECT b_uid from tbl_v
  bitmap_count(bitmap_intersect(b_uid)) uid_ct,
  bitmap_to_string(bitmap_intersect(b_uid)) uid_list
FROM tbl_u

We can use bitmap_and(not) to pursue the intersection of specific labels, such as "male" or "high spenders".

with t1 as (
 SELECT bitmap_union(user_ids) uid ,create_time
 FROM tbl 
 WHERE create_time = '2021-07-09' AND tag_name = 'high spenders'
 group by create_time
), t2 as (
 SELECT bitmap_union(user_ids) uid ,create_time
 FROM tbl 
 WHERE create_time = '2021-07-09' AND tag_name = 'male'
 group by create_time
select bitmap_and(t1.uid, t2.uid)
from t1 join t2
on t1.create_time = t2.create_time

StarRocks is currently deployed in production environments at several companies. StarRocks has increased the speed of profiling generation in all dimensions by 30%+ for a company providing microblogging and social networking service.

Get Started for Free!

Download Now