Build the Common Data Language with the Metrics Platform Start Now
By Use Cases
By BI Tools
Subscribe to our newsletter>
Get the latest products updates, community events and other news.
The Apache Kylin community is pleased to announce the release of Apache Kylin v2.5.0.
Apache Kylin is an open source Distributed Analytics OLAP engine designed to provide SQL interface and multi-dimensional analysis (extreme OLAP) on Big Data supporting extremely large datasets.
This is a major release after 2.4.0. There are many enhancements in this release. Here just highlight the major ones:
Now Kylin’s Spark engine will run all distributed jobs in Spark, including fetch distinct dimension values, converting cuboid files to HBase HFile, merging segments, merging dictionaries, etc. The default configurations are tuned so the user can get an out-of-box experience. The overall performance with the previous version is close, but we assume Spark has more room to improve. The related tasks are KYLIN-3427, KYLIN-3441, KYLIN-3442.
There are also improvements in the job management. Now you can get the job link on the web console once Spark starts to run. If you discard the job, Kylin will kill the Spark job to release the resource in time. If Kylin is restarted, it can resume from the previous job instead of re-submitting a new job.
In the past, HBase is the only option for Kylin metadata. In some cases, this is not applicable, for example using replicated HBase cluster for Kylin’s HA (the replicated HBase is read only). Now we introduce the MySQL metastore to fulfill such need. This function is in beta now. Check KYLIN-3488 for more.
Hybrid is an advanced model for compositing multiple Cubes. It can be used for the Cube schema change issue. This function had no GUI in the past so only a small portion of Kylin users know it. Now we added the web GUI for it so all users can try it.
The OLAP Cube planner can greatly optimize the cube structure, save the computing/storage resources and improve the query performance. It was introduced in v2.3 but is disabled by default. In order to let more users seeing and trying it, we enable it by default in v2.5. The algorithm will automatically optimize the cube by your data statistics on the first build.
Segment (partition) pruning can efficiently reduce the disk and network I/O, so to greatly improve the query performance. In the past, Kylin only prunes segments by the partition column’s value. If the query doesn’t have the partition column as the filtering condition, the pruning won’t work, all segments will be scanned.
Now from v2.5, Kylin will record the min/max value for EVERY dimension at the segment level. Before scanning a segment, it will compare the query’s conditions with the min/max index. If not matched, the segment will be skipped. Check KYLIN-3370 for more.
When segments get merged, their dictionaries also need to be merged. In the past, the merging happens in Kylin’s JVM, which takes a lot of memory and CPU resources. In extreme case (if you have a couple of concurrent jobs) it may crash the Kylin process. Since this, some users have to allocate much more memory to Kylin job node or run multiple job nodes to balance the workload.
Now from v2.5, Kylin will submit this task to Hadoop MR or Spark, so this bottleneck can be solved. Check KYLIN-3471 for more.
Global Dictionary is a must for bitmap count distinct. The GD can be very large if the column has a very high cardinality. In the cube building phase, Kylin need to translate the non-integer values to integers by the GD. Although the GD has been split into several slices, the values are often scrambled. Kylin needs swap in/out the slices into memory repeatedly, which causes the building slowly.
The enhancement introduces a new step to build a shrunken dictionary for each data block. Then each task only loads the shrunken dictionary, which is quite small, so there is no swap in/out any more in the cubing step. Then the performance can be 3x faster than before. Check KYLIN-3491 for more.
OLAP Cube size estimation is used in several steps, such as decides the MR/Spark job partition number, calculates the HBase region number etc. It will affect the build performance much. The estimation can be wild when there is COUNT DISTINCT, TOPN measures because their size is flexible. The incorrect estimation may cause too many data partitions and then too many tasks. In the past, users need to tune several parameters to make the size estimation more close to real size, that is hard to do.
Now Kylin will correct the size estimation automatically based on the collected data statistics. This can make the estimation much closer with the real size than before. Check KYLIN-3453 for more.
Hadoop 3 and HBase 2 starts to be adopted by many users. Now we provide new binary packages compiled with the new Hadoop and HBase API. We tested them on Hortonworks HDP 3.0 and Cloudera CDH 6.0.
All of the changes can be found in the release notes.
To download Apache Kylin v2.5.0 source code or binary package, visit the download page.
Follow the upgrade guide.
If you face issue or question, please send mail to Apache Kylin dev or user mailing list: firstname.lastname@example.org , email@example.com; Before sending, please make sure you have subscribed the mailing list by dropping an email to firstname.lastname@example.org or email@example.com.
Great thanks to everyone who contributed!
Get all the facts about Apache Kylin and find out how its extreme OLAP technology matches up against Kyligence. Learn more about the benefits these two augmented OLAP solutions provide on our Kylin vs. Kyligence comparison page.
The driving force behind Meituan’s success is not simply a robust analytics system, but the OLAP engine that system is built upon - Apache Kylin.
Cloud Analytics News will share the important news on Apache Kylin, Kyligence Cloud, and related technologies. In this edition, we cover Apache Kylin 4.X beta, the launch of Kyligence Cloud 4, Pivot to Snowflake, and more.
UnionPay was able to consolidate the 1,200 Cognos cubes into 2 Kyligence cubes and a single ETL process. Besides extending the life of the analytics executed against this data, there was a massive improvement in operational efficiency.
A peek behind the curtain of the world's leading open source big data analytics project, Apache Kylin.
An introduction to Apache Kylin's new storage and compute architecture, Apache Parquet. This article introduces Kylin's query principles, Parquet storage, and accurate duplicate removal
99 Almaden Boulevard Suite #663
San Jose, CA 95113
+1 (669) 256-3378
Ⓒ 2023 Kyligence, Inc. All rights reserved.
Already have an account? Click here to login
您还可以在云平台中 部署 Kyligence
直接获得 30 天免费试用
请填写真实信息，我们会在 1-2 个工作日内电话与您联系。