Build the Common Data Language with the Metrics Platform Start Now
By Use Cases
By BI Tools
Subscribe to our newsletter>
Get the latest products updates, community events and other news.
This article is the fourth in a four-part series that looks at Count Distinct, how it works with Big Data, and how to perform it quickly on even the largest datasets. Part 1 | Part 2 | Part 3
In our last article, we delved deeper into the development behind Apache Kylin’s approach to generating faster, accurate Count Distinct queries. By now, you may be wondering what else sets Kylin apart from other solutions that also deliver accurate Count Distinct queries.
This is a reasonable question, and the perfect topic for closing out our Count Distinct series. Keep reading to learn why Apache Kylin is the only OLAP engine for accurate, sub-second Count Distinct queries.
It is a known fact that for extremely large datasets, Apache Kylin is one of the few OLAP engines that can achieve sub-second low latency analysis. Apache Kylin is able to achieve this based on its unique precomputational methodology, which splits this process into two phases: one for the processing of the OLAP cube’s offline construction, and the other for the cube’s online query.
Moreover, these two steps can be grouped into two
independent clusters without affecting each other. The cube may use up a large
chunk of time for its construction, but this drastically speeds up subsequent
queries. For situations that require frequent searches or queries, a single
construction that can satisfy a variety of needs is extremely useful. On the
other hand, an engine that lacks precomputation must always start computing
with the original values.
Not only does this require a large amount of computing resources, but its performance, concurrency, and efficiency will hardly satisfy the demanding requirements of any modern business.
After introducing Bitmap and Global Dictionary, Kylin was
able to achieve sub-second latency in accurate count distinct queries. Looking
across the entire field of big data, it can be said that this is the only true
universal solution (quoted from one user at a large internet company). But with
so many big data analysis engines, Spark, SQL, Presto, ClickHouse, Phoenix,
Redshift, and so on, can none of these achieve the same results?
In reality, it is impossible for any other engine to achieve the same results, but the difference lies in the limitations set by their frameworks. Without any data preparation beforehand, to achieve fast and accurate Count Distinct queries requires a large amount of computational resources.
For example, data can be warmed up in RAM in between nodes using a 10G network connection, but this is precisely what many users can’t afford. Additionally, with the increase in the amount of data and growth in concurrency, performance and stability will significantly decrease, causing users to experience sudden drops, leading to the loss of usability.
Not only does Kylin support estimated Count Distinct, but it also supports accurate Count Distinct, and users have the freedom to select their desired deduplicating algorithm depending on their situation. Compared to other available technologies, Kylin’s accurate distinct counting excels in the following areas:
Apache Kylin’s precise Count Distinct feature is a major achievement for the developers in Kylin’s community who have overcome all of these complex issues through intelligence, determination, and their ceaseless research efforts.
The need for an improved approach to Count Distinct queries was
a direct result of today’s modern analytics landscape. Now, more than ever,
analysts, as well as the IT and data engineers who support them, must find ways
of working with larger and larger datasets to uncover new insights.
We have entered the era of big data, and continued pressure
will be placed on businesses, and the technology those businesses employ, to
embrace big data or fall behind. Apache Kylin and the open source community supporting
it remains committed to overcoming the challenges of this new era and delivering
solutions to all those who are trying to make a difference with their data.
If you’re curious about Apache Kylin, and how you can get involved with the community, you can learn more about Kylin. Also, if you’re interested in how Kylin compares to its enterprise-ready counterpart, Kyligence, check out this comparison page.
I would like to express my gratitude to Ye Rui Sun, Da Yue Gao, Kai Sen Kang, Yang Hong Zhong, and Guo Wei Jin. This series would not have been possible without your support and hard work.
Learn about the fundamentals of a data product and how we help build better data products with real customer success stories.
Learn about the importance of the Metrics Layer and its impact on data analysis and decision-making. Enables businesses to measure, track, and interpret KPI effectively.
Learn about metrics store and how it can help enterprises achieve metrics reusability, consistency, self-service definition, and scalability.
Everything you should know about Metrics Store and how to extend DataOps practices to managing your business metrics. Read Now.
Read on to learn the key competencies and critical features to look for when evaluating a semantic layer offering for your BI tool.
Kyligence Zen intelligently manages data in the retail industry. Read to learn how to develop the "North Star Metric" system to track goals and progress.
99 Almaden Boulevard Suite #663
San Jose, CA 95113
+1 (669) 256-3378
Ⓒ 2023 Kyligence, Inc. All rights reserved.
Already have an account? Click here to login
您还可以在云平台中 部署 Kyligence
直接获得 30 天免费试用
请填写真实信息，我们会在 1-2 个工作日内电话与您联系。