Meet Your AI Copilot fot Data Learn More
Your AI Copilot for Data
Kyligence Zen Kyligence Zen
Kyligence Enterprise Kyligence Enterprise
Metrics Platform
OLAP Platform
Customers
Definitive Guide to Decision Intelligence
Recommended
Resources
Apache Kylin
About
Partners
In this blog, we’ll look at the third building block for a “PERFECT” Query Performance — “Spark Tuning”. If you have not read the previous blogs of this series, please go to the following links — Part 1, Part 2, Part 3.
Hunting for SSAS Alternatives on AWS or Azure or Google Cloud?
Check it out here.
Parquet file block size is one of the hidden tricks that can be used to fine-tune a model for extreme performance. This parameter represents the size of a data block in a parquet file and manages the minimum amount of data a Spark task reads. Slicing a parquet file into smaller chunks by making this number smaller will produce higher parallelism and thus a shorter execution time when resources are infinite.
However, in resource-limited settings, things are a bit more complicated. The belief that “The smaller the block size, the better the query performance” is not applicable here. And also, there is no fixed formula to calculate a magic number for the block size because this number is more an experimental conclusion depending on various factors, including resources available and the complexity of queries. I have found a sweet spot through loads of testing and hands-on work, with an excellent query performance guaranteed and high resource utilization achieved. To make your life easier, here are the best pairs of parquet block sizes for both writing and reading jobs you should get started with:
Start by experimenting and comparing the performance of a model in 32m and 64m. 32m or 64m should give the relatively same query performance.
Parquet file block size — 32m
In most cases, 64m should give the best query performance.
Parquet file block size — 64m
Kyligence Workspace Config Center
The parquet block size for model building jobs (aka Spark writing jobs) can be set up either in the Kyligence workspace config centre or on a Kyligence Model setting page. For query jobs (aka Spark reading jobs), this parameter can only be configured at the workspace level. It is highly recommended those two numbers should match each other to ensure the best query performance.
Making the Final Touches to a Kyligence Data Model
Only One Step Away from a “Perfect” Kyligence Data Model
Stay Tuned!
Unlock potentials of analytics query accelerators for swift data processing and insights from cloud data lakes. Explore advanced features of Kyligence Zen.
Optimize data analytics with AWS S3. Leverage large language models and accelerate decision-making.
Optimize data analytics with Snowflake's Data Copilot. Leverage large language models and accelerate decision-making.
Discover the 7 top AI analytics tools! Learn about their pros, cons, and pricing, and choose the best one to transform your business.
Discover operational and executive SaaS metrics that matter for customers success, importance, and why you should track them with Kyligence Zen.
Unlock the future of augmented analytics with this must-read blog. Discover the top 5 tools that are reshaping the analytics landscape.
What website metrics matter in business? Learn about categories, vital website metrics, how to measure them, and how Kyligence simplifies it.
Already have an account? Click here to login
You'll get
A complete product experience
A guided demo of the whole process, from data import, modeling to analysis, by our data experts.
Q&A session with industry experts
Our data experts will answer your questions about customized solutions.
Please fill in your contact information.We'll get back to you in 1-2 business days.
Industrial Scenario Demostration
Scenarios in Finance, Retail, Manufacturing industries, which best meet your business requirements.
Consulting From Experts
Talk to Senior Technical Experts, and help you quickly adopt AI applications.