Build the Common Data Language with the Metrics Platform Start Now
Kyligence Zen Kyligence Zen
Kyligence Enterprise Kyligence Enterprise
Cloud
Services
By Use Cases
By BI Tools
Customers
Resources
Apache Kylin
About
Partners
Data Modelling sounds scary, like a big chore and months of work. But it is not so bad, and the process is actually super FUN. It feels just like building up an F1 car, and everything is designed for speed from the ground up.
Following through this blog series, hopefully, any data practitioner can design a model meeting a query SLA such as 95%+ < 3s & slowest < 10s and then surprise the business users.
In this blog, let’s look at the first building block — “Perfect Layout” in detail.
In a Kyligence data model, a layout represents a group of curated keys — partition key, shard key, and sort key. Those carefully selected keys divide a physical data model into folders and ultimately individual parquet files organized in a columnar fashion.
A “perfect” layout is an effective way to restrict the amount of data for analysis. A “perfect” structure allows for extremely granular pruning of data blocks. It enables the query engine to efficiently locate the relevant data blocks at query runtime without having to scan all data blocks. The less data to analyze, the faster the query executes.
Partition Key is a time-unit column responsible for splitting precomputed data up into physical folders, also known as Segments in Kyligence. This key enables folder-level pruning when a query filters on this partition column. Otherwise, the Kyligence query engine will do a full scan of all folders/segments, which significantly impacts query performance.
Strategies for Selecting Partition Key
Note: In some cases, Secondary Partition Key can be helpful for folder-level pruning. The rules for defining the primary and secondary partition keys are generally the same and the only difference is that the candidates of the secondary partition key can be of any data type.
A shard key is a column in a data model that is explicitly designated to distribute rows into parquet files in accordance with a hash function. Thus, rows containing the same values of this shard key column will be co-located in the same physical file. This is to say, leveraging the sharding feature will improve scan efficiency in queries filtered on this column by skipping files that do not match filtering predicates.
Strategies for Selecting Shard Key:
Note: Use this shard key column in conjunction with “=” or “IN” operator
Sort Key is responsible for sorting/ordering rows within a file. It consists of one or more columns, which can be of any data type. A properly designed sort key enables precise pruning of rows at query runtime. For example, assume a query that specifies a filter predicate that targets 10% of the values should ideally only scan 10% of the files.
Strategies for Selecting Sort Key:
In this context, Data Skew actually becomes a benefit that can be leveraged to boost query performance.
Example: User querying records of “Computed_Risk”= “HIGH“ and “Data_Level” = “Rule“
The following data distribution statistics are suggesting that “Computed_Risk” column has better filtering capability than “Data_Level” column for a query searching for high-risk, rule-level records. For this particular query, ordering by “Computed_Risk” and then “Data_Level” will only require the engine to access only 1% of all lines, approximately 87% less than ordering by “Data_Level” and then “Computed_Risk”.
In Part 3, we will look at the second building block — “Perfect” Index fromthe “PERFECT” Query Performance Formula. Stay Tuned!
Kyligence Zen intelligently manages data in the retail industry. Read to learn how to develop the "North Star Metric" system to track goals and progress.
Kyligence introduces the deployment of OLAP on top of Azure, including data sources, features, benefits, and prerequisites. Learn more about Kyligence for Azure.
What's OLAP on big data? What're its benefits? Here's everything you need to know about OLAP.
Learn how one big fast-food brand leveraged Kyligence capabilities and implemented precision marketing to maximize profit opportunities.
Already have an account? Click here to login
预约演示,您将获得
完整的产品体验
从数据导入、建模到分析的全流程操作演示。
行业专家解惑
与资深行业专家的交流机会,解答您的个性化问题。
您还可以在云平台中 部署 Kyligence
直接获得 30 天免费试用
请填写真实信息,我们会在 1-2 个工作日内电话与您联系。