Why Kylin Is the Only OLAP Engine for Sub-Second Count Distinct Queries

Author
Shaofeng Shi
Technical Partner & Principle Architect, Kyligence Engineering
Apr. 22, 2020

This article is the fourth in a four-part series that looks at Count Distinct, how it works with Big Data, and how to perform it quickly on even the largest datasets. Part 1 | Part 2 | Part 3


In our last article, we delved deeper into the development behind Apache Kylin’s approach to generating faster, accurate Count Distinct queries. By now, you may be wondering what else sets Kylin apart from other solutions that also deliver accurate Count Distinct queries.

This is a reasonable question, and the perfect topic for closing out our Count Distinct series. Keep reading to learn why Apache Kylin is the only OLAP engine for accurate, sub-second Count Distinct queries.


OLAP as the Optimal Approach to Count Distinct on Big Data

It is a known fact that for extremely large datasets, Apache Kylin is one of the few OLAP engines that can achieve sub-second low latency analysis. Apache Kylin is able to achieve this based on its unique precomputational methodology, which splits this process into two phases: one for the processing of the OLAP cube’s offline construction, and the other for the cube’s online query.

Moreover, these two steps can be grouped into two independent clusters without affecting each other. The cube may use up a large chunk of time for its construction, but this drastically speeds up subsequent queries. For situations that require frequent searches or queries, a single construction that can satisfy a variety of needs is extremely useful. On the other hand, an engine that lacks precomputation must always start computing with the original values.

Not only does this require a large amount of computing resources, but its performance, concurrency, and efficiency will hardly satisfy the demanding requirements of any modern business.

Apache Kylin Architecture Diagram
Apache Kylin Architecture Diagram

After introducing Bitmap and Global Dictionary, Kylin was able to achieve sub-second latency in accurate count distinct queries. Looking across the entire field of big data, it can be said that this is the only true universal solution (quoted from one user at a large internet company). But with so many big data analysis engines, Spark, SQL, Presto, ClickHouse, Phoenix, Redshift, and so on, can none of these achieve the same results?

In reality, it is impossible for any other engine to achieve the same results, but the difference lies in the limitations set by their frameworks. Without any data preparation beforehand, to achieve fast and accurate Count Distinct queries requires a large amount of computational resources.

For example, data can be warmed up in RAM in between nodes using a 10G network connection, but this is precisely what many users can’t afford. Additionally, with the increase in the amount of data and growth in concurrency, performance and stability will significantly decrease, causing users to experience sudden drops, leading to the loss of usability.


Lightning-Fast, Accurate Count Distinct – Made Possible by Apache Kylin

Not only does Kylin support estimated Count Distinct, but it also supports accurate Count Distinct, and users have the freedom to select their desired deduplicating algorithm depending on their situation. Compared to other available technologies, Kylin’s accurate distinct counting excels in the following areas:

  • While offline, the data is automatically generated and compressed into Bitmap, and data will neither shuffle nor drop during queries, promising low latency and 100% accuracy at the same time
  • Unique visitor (UV) values can be combined, satisfying the need for flexible queries
  • Queries use SQL’s standard function, which is seamless and compatible for concurrent systems
  • Supports both int and string types
  • Simple to use, no programming required
  • Based on Kylin’s UDAF, Bitmap can also do intersect calculations and analytics functions
  • Already in use by many large enterprises, including eBay, Meituan, Didi, Cisco, Vivo, and so on, for several years under stable conditions.

Apache Kylin’s precise Count Distinct feature is a major achievement for the developers in Kylin’s community who have overcome all of these complex issues through intelligence, determination, and their ceaseless research efforts.


Looking Forward: Better Analytics on Big Data

The need for an improved approach to Count Distinct queries was a direct result of today’s modern analytics landscape. Now, more than ever, analysts, as well as the IT and data engineers who support them, must find ways of working with larger and larger datasets to uncover new insights.

We have entered the era of big data, and continued pressure will be placed on businesses, and the technology those businesses employ, to embrace big data or fall behind. Apache Kylin and the open source community supporting it remains committed to overcoming the challenges of this new era and delivering solutions to all those who are trying to make a difference with their data.

If you’re curious about Apache Kylin, and how you can get involved with the community, you can learn more about Kylin. Also, if you’re interested in how Kylin compares to its enterprise-ready counterpart, Kyligence, check out this comparison page.


I would like to express my gratitude to Ye Rui Sun, Da Yue Gao, Kai Sen Kang, Yang Hong Zhong, and Guo Wei Jin. This series would not have been possible without your support and hard work.