Meet Your AI Copilot fot Data Learn More

Kyligence Cloud Model Design Principles Part 2 - “Perfect” Layout

Author
Lori Lu
Solution Architect & Technology Advocate
May. 18, 2022
 

Data Modelling sounds scary, like a big chore and months of work. But it is not so bad, and the process is actually super FUN. It feels just like building up an F1 car, and everything is designed for speed from the ground up.

Following through this blog series, hopefully, any data practitioner can design a model meeting a query SLA such as 95%+ < 3s & slowest < 10s and then surprise the business users.

 
 

The Building Blocks for a “PERFECT” Query Performance

 

In this blog, let’s look at the first building block — “Perfect Layout” in detail.

 
The formula for a “PERFECT” Query Performance

 

What is a “Perfect” Layout?

 
 

In a Kyligence data model, a layout represents a group of curated keys — partition key, shard key, and sort key. Those carefully selected keys divide a physical data model into folders and ultimately individual parquet files organized in a columnar fashion.

 

Why a “Perfect” Layout Matters?

 
Data Pruning is like Coffee Filtering

A “perfect” layout is an effective way to restrict the amount of data for analysis. A “perfect” structure allows for extremely granular pruning of data blocks. It enables the query engine to efficiently locate the relevant data blocks at query runtime without having to scan all data blocks. The less data to analyze, the faster the query executes.

 

How to Create a “Perfect” Layout?

 
Kyligence Physical Model Structure

Partition Key — Folder-Level Pruning
 

Partition Key is a time-unit column responsible for splitting precomputed data up into physical folders, also known as Segments in Kyligence. This key enables folder-level pruning when a query filters on this partition column. Otherwise, the Kyligence query engine will do a full scan of all folders/segments, which significantly impacts query performance.

 

Strategies for Selecting Partition Key

 
  • This key should be a time-unit column and is most actively referenced in selective filters.
  • This column is strongly recommended to be made a compulsory part of all queries for better query performance.

 

Note: In some cases, Secondary Partition Key can be helpful for folder-level pruning. The rules for defining the primary and secondary partition keys are generally the same and the only difference is that the candidates of the secondary partition key can be of any data type.

 
Shard Key — File-Level Pruning

A shard key is a column in a data model that is explicitly designated to distribute rows into parquet files in accordance with a hash function. Thus, rows containing the same values of this shard key column will be co-located in the same physical file. This is to say, leveraging the sharding feature will improve scan efficiency in queries filtered on this column by skipping files that do not match filtering predicates.

 

Strategies for Selecting Shard Key:

  • This column is heavily used in filter operations.
  • Shard works best when a query searches for a handful of distinct values in this column. In other words, it does not support range search. Since adjacent elements in range might hash to different files, there is no efficient way to scan files to locate all values within the range.
 

Note: Use this shard key column in conjunction with “=” or “IN” operator

 
Sort Key — Row-Level Pruning
 

Sort Key is responsible for sorting/ordering rows within a file. It consists of one or more columns, which can be of any data type. A properly designed sort key enables precise pruning of rows at query runtime. For example, assume a query that specifies a filter predicate that targets 10% of the values should ideally only scan 10% of the files.

Strategies for Selecting Sort Key:

  • Sort key columns are the most frequently referenced ones in where clauses across all queries
  • For the popular columns used in filter operations, assess the filtering capability of each candidate and then pick the top 5 columns as sort keys. Ideally, the top 3 to 5 columns are expected to filter out the majority of unrequested data for blasting fast query response time.
 

In this context, Data Skew actually becomes a benefit that can be leveraged to boost query performance.

Example: User querying records of “Computed_Risk”= “HIGH“ and “Data_Level” = “Rule“

The following data distribution statistics are suggesting that “Computed_Risk” column has better filtering capability than “Data_Level” column for a query searching for high-risk, rule-level records. For this particular query, ordering by “Computed_Risk” and then “Data_Level” will only require the engine to access only 1% of all lines, approximately 87% less than ordering by “Data_Level” and then “Computed_Risk”.

 
 

What's Next

In Part 3, we will look at the second building block — “Perfect” Index fromthe “PERFECT” Query Performance Formula. Stay Tuned!