Row Key Definition

Define a Row Key in a DDL statement

The user can define the row key for a relation by adding certain properties to the relation’s DDL statement in the WITH clause:

  • key.format: This is the data format for the row key (i.e. how key bytes are serialized) in the messages coming from the relation’s topic. Supported formats for a row key are: primitive, json, protobuf, and avro.

  • key.type: It defines the structure of the row key’s fields:

    • For a row key with a primitive format, the key.type defines the data type of key values. Valid data types for a primitive row key are: SMALLINT, INTEGER, BIGINT, DOUBLE, and VARCHAR.

    • For a row key in the json, avro, or protobuf format, the key.type is a struct whose fields define the names and data types of the key’s fields.

Examples

The following DDL statement defines a stream that has a primitive row key of the VARCHAR type:

CREATE STREAM pageviews (
   viewtime BIGINT,
   userid VARCHAR,
   pageid VARCHAR
)
WITH (
   'topic'='pageviews',
   'value.format'='json',
   'key.format'='primitive',
   'key.type'='VARCHAR'
);

The following DDL statement defines a stream that has a row key in the json format. The row key in each record can be accessed via the rowkey function, and it is a struct that has a field named id of the INTEGER data type:

CREATE STREAM orders (
   order_time BIGINT,
   item_name VARCHAR,
   price BIGINT
)
WITH (
   'topic'='sales',
   'value.format'='json',
   'key.format'='json',
   'key.type'='"STRUCT<id INTEGER>"'
);

The following DDL statement defines a stream that has a row key in the protobuf format. The row key in each record has two fields: an INTEGER field named id and a VARCHAR field named region. A protobuf descriptor for the row key is defined with the name customer_key_msg within the descriptor source of the relation’s topic.

CREATE STREAM customers (
   name VARCHAR,
   address VARCHAR,
   acct_balance INTEGER
)
WITH (
   'topic' = 'customers',
   'value.format' = 'protobuf',
   'key.format' = 'protobuf',
   'key.type' = 'STRUCT<id INTEGER, region VARCHAR>',
   'value.descriptor.name' = 'customer_value_msg',
   'key.descriptor.name' = 'customer_key_msg'
);

Row Key in CREATE AS SELECT Statements

CAS with Simple SELECT

In a CAS statement with a SELECT clause consisting of only projection and filter operators, using SELECT and WHERE clauses, respectively, the row key definition for the new relation will be the same as the row key definition (if any) of the source.

As an example, the below CSAS statement creates a new pagevisits stream by filtering some records from the pageviews stream, which we defined above (see Examples). Given that pageviews records have row keys with the primitive format,pagevisits records will have row keys with a similar definition:

CREATE STREAM
    pagevisits
AS
    SELECT userid, pageid
    FROM pageviews
    WHERE userid != 'User_2';

When running CAS, the key.format for the new relation (sink) can be specified in the sink’s WITH clause. For example, the below CSAS statement is similar to the above one; However, it changes the row key format to json for the new pagevisits stream, while the source relation pageviews has its row key with the primitive format:

CREATE STREAM
    pagevisits
WITH ('key.format'='json')
AS
    SELECT userid, pageid
    FROM pageviews
    WHERE userid != 'User_2';

CCAS with GROUP BY

If the SELECT clause in a CREATE CHANGELOG AS SELECT statement has a GROUP BY clause, the row key for the new relation will consist of columns in the GROUP BY clause.

The key.format for the new relation will be the same as a source relation’s key.format. Similar to the example above, the key.format can be changed using the sink’s WITH clause.

For example, the below query runs grouping and aggregation on pageviews using a tumbling window to create a new changelog visits_rate. The row key for visits_rate will have three fields: window_start, window_end, and userid as they are the columns referred in the GROUP BY clause:

CREATE CHANGELOG
    visits_rate
AS
    SELECT window_start, window_end, userid, count(pageid) AS page_count
    FROM TUMBLE(pageviews, size 5 second)
    GROUP BY window_start, window_end, userid;

The below query creates a new changelog region_stats by running a grouping and aggregation on the source stream users. Each record in region_stats will have a row key with a single field named location:

CREATE CHANGELOG
    region_stats
AS
    SELECT contactinfo->city AS location, count(userid) AS usr_cnt
    FROM "users"
    WHERE interests[2] != 'Game'
    GROUP BY contactinfo->city;

CSAS with JOIN

If the SELECT clause in a CREATE STREAM AS SELECT statement has a JOIN clause, the row key for the new relation will consist of the column from the left relation in the JOIN criteria.

The key.format for the new relation will be the same as the left source relation’s key.format. Similar to the example above, the key.format can be changed using the sink’s WITH clause.

For example, the below query runs an interval join on two streams, pageviews and users, to create a new stream named pvusers. Each record in pvusers will have a row key with one field, userid, as userid is the join column in pageviews, which is left-side of the join, referred in the join criteria: p.userid = u.userid:

CREATE STREAM
    pvusers
AS
    SELECT p.userid, u.registertime
    FROM pageviews p JOIN "users" u WITHIN 5 minutes
    ON p.userid = u.userid
    WHERE p.userid != 'User_5';

Last updated