Meet Your AI Copilot fot Data Learn More
Your AI Copilot for Data
Kyligence Zen Kyligence Zen
Kyligence Enterprise Kyligence Enterprise
Metrics Platform
OLAP Platform
Customers
Definitive Guide to Decision Intelligence
Recommended
Resources
Apache Kylin
About
Partners
Modern corporations are increasingly looking for near real-time analytics and insights to make actionable decisions. To help organizations understand more about the benefits of ClickHouse, we will focus on how you can achieve sub-second analysis in Excel with Kyligence Tiered Storage.
This article is based on the benchmark dataset of the retail industry, which contains over 100 MILLION rows in the transaction table, I will operate with dimensions and measures in Excel pivot to compare the query performance among Object Storage, Base Index and Tiered Storage.
Microsoft Excel is, undoubtedly, the most popular spreadsheet software in the market. Each iteration of this software gives the users a better tool for their statistical needs.PivotTable in Excel is a powerful tool to calculate, summarize, and analyze data that allows you to see comparisons, patterns, and trends in your data.
ClickHouse is an open-source, OLAP (Online Analytical Processing) database originally developed by the company Yandex, for the needs of its Metrica solution (similar to Google Analytics). Yandex is the first search engine used in Russia.
ClickHouse was developed with a simple objective: to filter and aggregate as much data as possible and as quickly as possible. Similar to other solutions of the same type (eg. Druid, Pinot), ClickHouse uses a column-oriented model for data storage. It also relies on various parallelization and vectorization mechanisms to take the most advantage of multi-core architectures. Hence, Clickhouse can support data volumes of several petabytes.
First, I used Kyligence Cloud, if you also like to try Kyligence cloud, you may visit Kyligence to get a free trial. In order to analyze data, it is necessary to import data. I used a 100 million rows dataset which can fully exhibit strong query performance of ClickHouse.
The SSB (Star Schema Benchmark) dataset is designed to measure the performance of database products in support of classical data warehousing applications and is based on the TPC-H benchmark.There are 5 tables in the SSB dataset - table sampling is turned on by default. With sampling I can get statistical information of the entire dataset.
I ran a sample SQL query to check the row amount of SSB.LINEORDER table. As you can see from the screenshot below, there are over 1 million rows in this table.
In Kyligence Cloud, I enabled AI Augmented Mode to build models with SQL.
Created a model automatically from SQL command.
Below is my SQL command which was used for creating the model.
SELECT SSB.LINEORDER.LO_ORDERKEY, SSB.LINEORDER.LO_LINENUMBER, SSB.LINEORDER.LO_CUSTKEY, SSB.LINEORDER.LO_PARTKEY, SSB.LINEORDER.LO_SUPPKEY, SSB.LINEORDER.LO_ORDERDATE, SSB.LINEORDER.LO_ORDERPRIOTITY, SSB.LINEORDER.LO_SHIPPRIOTITY, SSB.LINEORDER.LO_TAX, SSB.LINEORDER.LO_COMMITDATE, SSB.LINEORDER.LO_SHIPMODE, SSB.PART.P_PARTKEY, SSB.PART.P_NAME, SSB.PART.P_MFGR, SSB.PART.P_CATEGORY, SSB.PART.P_BRAND, SSB.PART.P_COLOR, SSB.PART.P_TYPE, SSB.PART.P_SIZE, SSB.PART.P_CONTAINER, SSB.SUPPLIER.S_SUPPKEY, SSB.SUPPLIER.S_NAME, SSB.SUPPLIER.S_ADDRESS, SSB.SUPPLIER.S_CITY, SSB.SUPPLIER.S_NATION, SSB.SUPPLIER.S_REGION, SSB.SUPPLIER.S_PHONE, SSB.CUSTOMER.C_CUSTKEY, SSB.CUSTOMER.C_NAME, SSB.CUSTOMER.C_ADDRESS, SSB.CUSTOMER.C_CITY, SSB.CUSTOMER.C_NATION, SSB.CUSTOMER.C_REGION, SSB.CUSTOMER.C_PHONE, SSB.CUSTOMER.C_MKTSEGMENT, SSB.CALENDAR.D_DATEKEY, SSB.CALENDAR.D_DATE, SSB.CALENDAR.D_DAYOFWEEK, SSB.CALENDAR.D_MONTH, SSB.CALENDAR.D_YEAR, SSB.CALENDAR.D_YEARMONTHNUM, SSB.CALENDAR.D_YEARMONTH, SSB.CALENDAR.D_DAYNUMINWEEK, SSB.CALENDAR.D_DAYNUMINMONTH, SSB.CALENDAR.D_DAYNUMINYEAR, SSB.CALENDAR.D_MONTHNUMINYEAR, SSB.CALENDAR.D_WEEKNUMINYEAR, SSB.CALENDAR.D_SELLINGSEASON, SSB.CALENDAR.D_LASTDAYINWEEKFL, SSB.CALENDAR.D_LASTDAYINMONTHFL, SSB.CALENDAR.D_HOLIDAYFL, SSB.CALENDAR.D_WEEKDAYFL, SUM(SSB.LINEORDER.LO_REVENUE) AS SUM_REVENUE, SUM(SSB.LINEORDER.LO_SUPPLYCOST) AS SUM_COST FROM SSB.LINEORDER LEFT JOIN SSB.PART ON SSB.LINEORDER.LO_PARTKEY=SSB.PART.P_PARTKEY LEFT JOIN SSB.CUSTOMER ON SSB.LINEORDER.LO_CUSTKEY=SSB.CUSTOMER.C_CUSTKEY LEFT JOIN SSB.SUPPLIER ON SSB.LINEORDER.LO_SUPPKEY=SSB.SUPPLIER.S_SUPPKEY LEFT JOIN SSB.CALENDAR ON SSB.LINEORDER.LO_ORDERDATE=SSB.CALENDAR.D_DATEKEY GROUP BY SSB.LINEORDER.LO_ORDERKEY, SSB.LINEORDER.LO_LINENUMBER, SSB.LINEORDER.LO_CUSTKEY, SSB.LINEORDER.LO_PARTKEY, SSB.LINEORDER.LO_SUPPKEY, SSB.LINEORDER.LO_ORDERDATE, SSB.LINEORDER.LO_ORDERPRIOTITY, SSB.LINEORDER.LO_SHIPPRIOTITY, SSB.LINEORDER.LO_TAX, SSB.LINEORDER.LO_COMMITDATE, SSB.LINEORDER.LO_SHIPMODE, SSB.PART.P_PARTKEY, SSB.PART.P_NAME, SSB.PART.P_MFGR, SSB.PART.P_CATEGORY, SSB.PART.P_BRAND, SSB.PART.P_COLOR, SSB.PART.P_TYPE, SSB.PART.P_SIZE, SSB.PART.P_CONTAINER, SSB.SUPPLIER.S_SUPPKEY, SSB.SUPPLIER.S_NAME, SSB.SUPPLIER.S_ADDRESS, SSB.SUPPLIER.S_CITY, SSB.SUPPLIER.S_NATION, SSB.SUPPLIER.S_REGION, SSB.SUPPLIER.S_PHONE, SSB.CUSTOMER.C_CUSTKEY, SSB.CUSTOMER.C_NAME, SSB.CUSTOMER.C_ADDRESS, SSB.CUSTOMER.C_CITY, SSB.CUSTOMER.C_NATION, SSB.CUSTOMER.C_REGION, SSB.CUSTOMER.C_PHONE, SSB.CUSTOMER.C_MKTSEGMENT, SSB.CALENDAR.D_DATEKEY, SSB.CALENDAR.D_DATE, SSB.CALENDAR.D_DAYOFWEEK, SSB.CALENDAR.D_MONTH, SSB.CALENDAR.D_YEAR, SSB.CALENDAR.D_YEARMONTHNUM, SSB.CALENDAR.D_YEARMONTH, SSB.CALENDAR.D_DAYNUMINWEEK, SSB.CALENDAR.D_DAYNUMINMONTH, SSB.CALENDAR.D_DAYNUMINYEAR, SSB.CALENDAR.D_MONTHNUMINYEAR, SSB.CALENDAR.D_WEEKNUMINYEAR, SSB.CALENDAR.D_SELLINGSEASON, SSB.CALENDAR.D_LASTDAYINWEEKFL, SSB.CALENDAR.D_LASTDAYINMONTHFL, SSB.CALENDAR.D_HOLIDAYFL, SSB.CALENDAR.D_WEEKDAYFL ;
Before saving the current model, cancel the option of Add Base Indexes.
Finally we have a start schema model for data analysis.
After building the model, let's explore MDX in Kyligence Cloud.
I need to set a name for MDX dataset. Here, the dataset was Kyligence Cloud model.
You could even define more complicated relationships among MDX datasets.
When defining Semantics, you can define more objects, like Dimension, Measure, Hierarchy, Calculated Measure, etc. These semantics will be synchronized to uplevel BI tools. In Tableau, PowerBI or Excel, end users can drag & drop MDX objects to analyze the data. Let's create a Profit measure as an example, which equals subtracting cost from revenue. profit = revenue - cost.
Opened Microsoft Excel from Windows 11 , or other version of Windows.
Click Data > Get Data > From Database > From Analysis Services. Note: this service only works in Windows. In MacOS, Analysis Services is not yet compatible.
Kyligence Cloud provides enough clear connection information, which helped me set up a connection between Excel and MDX.
First, add "YEAR" "CATEGORY" in Rows, "Revenue" in Values.
It took 17s to get all the results.
From Query History, the MDX query had been parsed into different parts of SQL queries. The queries were answered by OBJECT STORAGE, which means they got push down to the original dataset.
Second, add "YEAR" "CATEGORY" in Rows, "Supplycost" in Values.
It took 11s.
Third, add "YEAR" "CATEGORY" "REGION""NATION" in Rows, "Profit" in Values.
If there is no base index or when a base index is missing, click + Index to add the base indexes in the drop-down box.
We often treat base index as safety index because it includes all dimensions and measures. Base index can answer the ad hoc queries on the fly, then avoid queries pushing down to object storage. To benefit from the base index, you should build your segments based on the base index.
After finishing building the base index, I found queries could be answered by model instead of object storage. Hence, base indexes can avoid queries pushed downto original datasets.
Here I'm listing the query performance of base table index, with 3 of the same queries in Excel as before.
Kyligence Cloud 4.5 provides our users with the ClickHouse option. To benefit from ClickHouse, user has to add another node named Tiered Storage Node when installing Wordspace in their clouds.
Also, Tiered Storage needs to be turned on in the "Setting" page.
Chose the base table index, then "Load to the Tiered Storage".
It will take some time to load data to ClickHouse.
After the data was loaded into ClickHouse, I ran the same 3 queries in Excel pivot as before.
Then, I compared these three engines: "Query Pushdown," "Base table index" and "Tiered Storage with ClickHouse." The performance of Tiered Storage with the same queries was really impressive. I tried some other combinations of dimensions and measures, and the performance of ClickHouse was relatively fast even in self-service analysis or ad-hoc.
Powered by the technology of ClickHouse, Kyligence Cloud 4.5 offers our users the advanced feature of Tiered Storage, which can help users quickly and cold start to query without pre-computation, and can significantly improve the performance of ultra-multi-dimensional flexible analysis and detailed query analysis. At the same time, the two-level storage design based on HDFS/object storage and ClickHouse can fully cover various analysis scenarios, bringing users more possibilities for analysis and exploration based on massive data.
Tiered storage not only supports flexible ad-hoc query scenarios, such as tagging analysis or user behavior analysis, which will bring greater assistance to the enterprise's refined operation and auxiliary business decision-making; it also can be used with the Kyligence unified model to manage data so as to seamlessly connect with mainstream business BI tools and Excel capabilities to empower business analysts to analyze.
After enabling tiered storage, users do not need to worry about how data is distributed and how to import, they only need to focus on modeling according to business needs. This can greatly reduce the cost of modeling and model optimization. Tiered Storage can make business more agile in the future.
The following figure shows the architecture of tiered storage:
Unlock potentials of analytics query accelerators for swift data processing and insights from cloud data lakes. Explore advanced features of Kyligence Zen.
Optimize data analytics with AWS S3. Leverage large language models and accelerate decision-making.
Optimize data analytics with Snowflake's Data Copilot. Leverage large language models and accelerate decision-making.
Discover the 7 top AI analytics tools! Learn about their pros, cons, and pricing, and choose the best one to transform your business.
Discover operational and executive SaaS metrics that matter for customers success, importance, and why you should track them with Kyligence Zen.
Unlock the future of augmented analytics with this must-read blog. Discover the top 5 tools that are reshaping the analytics landscape.
What website metrics matter in business? Learn about categories, vital website metrics, how to measure them, and how Kyligence simplifies it.
Already have an account? Click here to login
You'll get
A complete product experience
A guided demo of the whole process, from data import, modeling to analysis, by our data experts.
Q&A session with industry experts
Our data experts will answer your questions about customized solutions.
Please fill in your contact information.We'll get back to you in 1-2 business days.
Industrial Scenario Demostration
Scenarios in Finance, Retail, Manufacturing industries, which best meet your business requirements.
Consulting From Experts
Talk to Senior Technical Experts, and help you quickly adopt AI applications.