Apache Kylin is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (extreme OLAP) on Big Data supporting extremely large datasets.
This is a major release after 2.5.0 and includes many new enhancements. All of these changes can be found in the release notes. Here is a highlight of the major ones:
SDK for JDBC Sources
Apache Kylin already supports several data sources like Amazon Redshift and SQL Server through JDBC.
To help developers handle SQL dialect differences and easily implement a new data source through JDBC, Kylin provides a new data source SDK with APIs for:
- Synchronizing metadata and data from JDBC source
- Building OLAP cubes from JDBC source
- Query pushdown to JDBC source engine when cube is unmatched
Check KYLIN-3552 for more.
Memcached as Distributed Cache
In the past, query caches were not efficiently used in Kylin for two reasons: aggressive cache expiration strategy and local cache. Because of the aggressive cache expiration strategy, useful caches are often cleaned up unnecessarily.
Because query caches are stored in local servers, they cannot be shared between servers. And because of the size limitation of local cache, not all useful query results can be cached.
To deal with these shortcomings, we changed the query cache expiration strategy by signature checking and introduced the memcached as Kylin’s distributed cache so that Kylin servers are able to share cache between servers.
And it’s easy to add memcached servers to scale out distributed cache. With enough memcached servers, we can cache things as much as possible. Then we also introduce segment level query cache which can not only speed up queries but also reduce the rpcs to HBase.
The related tasks are KYLIN-2895, KYLIN-2894, KYLIN-2896, KYLIN-2897, KYLIN-2898, KYLIN-2899.
ForkJoinPool for Fast Extreme OLAP Cubing
In the past, fast OLAP cubing used split threads, task threads, and main thread to do the OLAP cube building. There is complex join and error handling logic.
This new implementation leverages the ForkJoinPool from JDK, and the event split logic is handled in main thread. Cuboid task and sub-tasks are handled in fork join pool, cube results are collected async and can be written to output earlier.
Check KYLIN-2932 for more.
Improve HLLCounter Performance
In the past, the way to create HLLCounter and to compute harmonic mean were not efficient.
The new implementation improves the HLLCounter creation by copying register from another HLLCounter instead of merge. To compute harmonic mean in the HLLCSnapshot, it does this enhancement by :
- Using table to cache all 1/2^r without computing on the fly
- Remove floating addition by using integer addition in the bigger loop
- Remove branch, e.g. needn’t check whether registers[i] is zero or not (although this is a minor improvement)
Check KYLIN-3656 for more.
Improve Cuboid Recommendation Algorithm
In the past, to add cuboids which are not prebuilt, the cube planner turns to mandatory cuboids which are selected if its rollup row count is above some threshold.
There are two shortcomings:
- The way to estimate the rollup row count is not good
- It’s hard to determine the threshold of rollup row count for recommending mandatory cuboids
The new implementation improves the way to estimate the row count of un-prebuilt cuboids by rollup ratio rather than exact rollup row count. With better estimated row counts for un-prebuilt cuboids, the cost-based cube planner algorithm will decide which cuboid to be built or not and the threshold for previous mandatory cuboids.
With this improvement, we don’t need the threshold for mandatory cuboids recommendation, and mandatory cuboids can only be manually set and will not be recommended.
Check KYLIN-3540 for more.
To download Apache Kylin v2.6.0 source code or binary package, visit the download page.
Follow the upgrade guide.
If you face issues or have any questions, please send mail to the Apache Kylin dev or user mailing list: email@example.com, firstname.lastname@example.org; Before sending, please make sure you have subscribed the mailing list by dropping an email to email@example.com or firstname.lastname@example.org.
Great thanks to everyone who contributed!