Blog > Company

Kyligence Cloud Model Design Principles — Part 4 Spark Tuning

Lori Lu

Solution Architect & Technology Advocate

May. 22, 2022

In this blog, we’ll look at the third building block for a “PERFECT” Query Performance — “Spark Tuning”. If you have not read the previous blogs of this series, please go to the following links — Part 1, Part 2, Part 3.

Hunting for SSAS Alternatives on AWS or Azure or Google Cloud?

Check it out here.

a “little bit” Spark Tuning

Parquet file block size is one of the hidden tricks that can be used to fine-tune a model for extreme performance. This parameter represents the size of a data block in a parquet file and manages the minimum amount of data a Spark task reads. Slicing a parquet file into smaller chunks by making this number smaller will produce higher parallelism and thus a shorter execution time when resources are infinite.

However, in resource-limited settings, things are a bit more complicated. The belief that “The smaller the block size, the better the query performance” is not applicable here. And also, there is no fixed formula to calculate a magic number for the block size because this number is more an experimental conclusion depending on various factors, including resources available and the complexity of queries. I have found a sweet spot through loads of testing and hands-on work, with an excellent query performance guaranteed and high resource utilization achieved. To make your life easier, here are the best pairs of parquet block sizes for both writing and reading jobs you should get started with:

Use Case 1 — Model with COUNT_DISTINCT measures

Start by experimenting and comparing the performance of a model in 32m and 64m. 32m or 64m should give the relatively same query performance.

Parquet file block size — 32m

Use Case 2 — Model without COUNT_DISTINCT measures

In most cases, 64m should give the best query performance.

Parquet file block size — 64m

How to Configure Parquet File Block Size in Kyligence

Kyligence Workspace Config Center

The parquet block size for model building jobs (aka Spark writing jobs) can be set up either in the Kyligence workspace config centre or on a Kyligence Model setting page. For query jobs (aka Spark reading jobs), this parameter can only be configured at the workspace level. It is highly recommended those two numbers should match each other to ensure the best query performance.

What’s Next

Making the Final Touches to a Kyligence Data Model

Only One Step Away from a “Perfect” Kyligence Data Model

Stay Tuned!

DOWNLOAD WHITEPAPER

Post Views: 4,776

Kyligence Cloud Model Design Principles — Part 4 Spark Tuning

a “little bit” Spark Tuning

How to Configure Parquet File Block Size in Kyligence

What’s Next

Building a Metrics Store for Snowflake

AWS Bedrock and Kyligence Copilot: Revolutionizing Data Analysis

Build Your Data Copilot on AWS S3

Build Your Data Copilot on Snowflake

These 7 AI Analytics Tools Can Transform Your Data Game Effortlessly!

SaaS Metrics that Matter for Customer Success

Top 5 Augmented Analytics Tools for 2023

Website Metrics that Matter for Business Growth: Why They Matter and How to Measure Them

What Are Analytics Query Accelerators? How Does It Work With Cloud Data Lakes?

Kyligence Cloud Model Design Principles — Part 4 Spark Tuning

a “little bit” Spark Tuning

How to Configure Parquet File Block Size in Kyligence

What’s Next

You might be interested:

Read Next

Building a Metrics Store for Snowflake

AWS Bedrock and Kyligence Copilot: Revolutionizing Data Analysis

Build Your Data Copilot on AWS S3

Build Your Data Copilot on Snowflake

These 7 AI Analytics Tools Can Transform Your Data Game Effortlessly!

SaaS Metrics that Matter for Customer Success

Top 5 Augmented Analytics Tools for 2023

Website Metrics that Matter for Business Growth: Why They Matter and How to Measure Them

What Are Analytics Query Accelerators? How Does It Work With Cloud Data Lakes?