Features
Platforms
Services
Deployments
By Use Cases
By BI Tool
Overview
Customer Success
About
Partners
Data Modelling sounds scary, like a big chore and months of work. But it is not so bad, and the process is actually super FUN. It feels just like building up an F1 car, and everything is designed for speed from the ground up.
Following through this blog series, hopefully, any data practitioner can design a model meeting a query SLA such as 95%+ < 3s & slowest < 10s and then surprise the business users.
In this blog, let’s look at the first building block — “Perfect Layout” in detail.
In a Kyligence data model, a layout represents a group of curated keys — partition key, shard key, and sort key. Those carefully selected keys divide a physical data model into folders and ultimately individual parquet files organized in a columnar fashion.
A “perfect” layout is an effective way to restrict the amount of data for analysis. A “perfect” structure allows for extremely granular pruning of data blocks. It enables the query engine to efficiently locate the relevant data blocks at query runtime without having to scan all data blocks. The less data to analyze, the faster the query executes.
Partition Key is a time-unit column responsible for splitting precomputed data up into physical folders, also known as Segments in Kyligence. This key enables folder-level pruning when a query filters on this partition column. Otherwise, the Kyligence query engine will do a full scan of all folders/segments, which significantly impacts query performance.
Strategies for Selecting Partition Key
Note: In some cases, Secondary Partition Key can be helpful for folder-level pruning. The rules for defining the primary and secondary partition keys are generally the same and the only difference is that the candidates of the secondary partition key can be of any data type.
A shard key is a column in a data model that is explicitly designated to distribute rows into parquet files in accordance with a hash function. Thus, rows containing the same values of this shard key column will be co-located in the same physical file. This is to say, leveraging the sharding feature will improve scan efficiency in queries filtered on this column by skipping files that do not match filtering predicates.
Strategies for Selecting Shard Key:
Note: Use this shard key column in conjunction with “=” or “IN” operator
Sort Key is responsible for sorting/ordering rows within a file. It consists of one or more columns, which can be of any data type. A properly designed sort key enables precise pruning of rows at query runtime. For example, assume a query that specifies a filter predicate that targets 10% of the values should ideally only scan 10% of the files.
Strategies for Selecting Sort Key:
In this context, Data Skew actually becomes a benefit that can be leveraged to boost query performance.
Example: User querying records of “Computed_Risk”= “HIGH“ and “Data_Level” = “Rule“
The following data distribution statistics are suggesting that “Computed_Risk” column has better filtering capability than “Data_Level” column for a query searching for high-risk, rule-level records. For this particular query, ordering by “Computed_Risk” and then “Data_Level” will only require the engine to access only 1% of all lines, approximately 87% less than ordering by “Data_Level” and then “Computed_Risk”.
In Part 3, we will look at the second building block — “Perfect” Index fromthe “PERFECT” Query Performance Formula. Stay Tuned!
Come to see the Next Generation of SQL Query Engine
Learn how to achieve alternatives to SSAS.
In this article, we’ll dive into the unified Metrics Platform at Beike, introduce Beike’s practice of building the Metrics Platform infrastructure using Apache Kylin and some real use cases at Beike.
Learn Kyligence Cloud model design principles and how to use Kyligence Cloud to build models.
Already have an account? Click here to login