Excel Your KPIs with AI Copilot Start for free today
Your AI Copilot for Data
Definitive Guide to Decision Intelligence
Subscribe to our newsletter>
Get the latest products updates, community events and other news.
This article is the fourth in a four-part series that looks at Count Distinct, how it works with Big Data, and how to perform it quickly on even the largest datasets. Part 1 | Part 2 | Part 3
In our last article, we delved deeper into the development behind Apache Kylin’s approach to generating faster, accurate Count Distinct queries. By now, you may be wondering what else sets Kylin apart from other solutions that also deliver accurate Count Distinct queries.
This is a reasonable question, and the perfect topic for closing out our Count Distinct series. Keep reading to learn why Apache Kylin is the only OLAP engine for accurate, sub-second Count Distinct queries.
It is a known fact that for extremely large datasets, Apache Kylin is one of the few OLAP engines that can achieve sub-second low latency analysis. Apache Kylin is able to achieve this based on its unique precomputational methodology, which splits this process into two phases: one for the processing of the OLAP cube’s offline construction, and the other for the cube’s online query.
Moreover, these two steps can be grouped into two
independent clusters without affecting each other. The cube may use up a large
chunk of time for its construction, but this drastically speeds up subsequent
queries. For situations that require frequent searches or queries, a single
construction that can satisfy a variety of needs is extremely useful. On the
other hand, an engine that lacks precomputation must always start computing
with the original values.
Not only does this require a large amount of computing resources, but its performance, concurrency, and efficiency will hardly satisfy the demanding requirements of any modern business.
After introducing Bitmap and Global Dictionary, Kylin was
able to achieve sub-second latency in accurate count distinct queries. Looking
across the entire field of big data, it can be said that this is the only true
universal solution (quoted from one user at a large internet company). But with
so many big data analysis engines, Spark, SQL, Presto, ClickHouse, Phoenix,
Redshift, and so on, can none of these achieve the same results?
In reality, it is impossible for any other engine to achieve the same results, but the difference lies in the limitations set by their frameworks. Without any data preparation beforehand, to achieve fast and accurate Count Distinct queries requires a large amount of computational resources.
For example, data can be warmed up in RAM in between nodes using a 10G network connection, but this is precisely what many users can’t afford. Additionally, with the increase in the amount of data and growth in concurrency, performance and stability will significantly decrease, causing users to experience sudden drops, leading to the loss of usability.
Not only does Kylin support estimated Count Distinct, but it also supports accurate Count Distinct, and users have the freedom to select their desired deduplicating algorithm depending on their situation. Compared to other available technologies, Kylin’s accurate distinct counting excels in the following areas:
Apache Kylin’s precise Count Distinct feature is a major achievement for the developers in Kylin’s community who have overcome all of these complex issues through intelligence, determination, and their ceaseless research efforts.
The need for an improved approach to Count Distinct queries was
a direct result of today’s modern analytics landscape. Now, more than ever,
analysts, as well as the IT and data engineers who support them, must find ways
of working with larger and larger datasets to uncover new insights.
We have entered the era of big data, and continued pressure
will be placed on businesses, and the technology those businesses employ, to
embrace big data or fall behind. Apache Kylin and the open source community supporting
it remains committed to overcoming the challenges of this new era and delivering
solutions to all those who are trying to make a difference with their data.
If you’re curious about Apache Kylin, and how you can get involved with the community, you can learn more about Kylin. Also, if you’re interested in how Kylin compares to its enterprise-ready counterpart, Kyligence, check out this comparison page.
I would like to express my gratitude to Ye Rui Sun, Da Yue Gao, Kai Sen Kang, Yang Hong Zhong, and Guo Wei Jin. This series would not have been possible without your support and hard work.
Learn about the fundamentals of a data product and how we help build better data products with real customer success stories.
Unlock potentials of analytics query accelerators for swift data processing and insights from cloud data lakes. Explore advanced features of Kyligence Zen.
Optimize data analytics with AWS S3. Leverage large language models and accelerate decision-making.
Optimize data analytics with Snowflake's Data Copilot. Leverage large language models and accelerate decision-making.
Discover the 7 top AI analytics tools! Learn about their pros, cons, and pricing, and choose the best one to transform your business.
Discover operational and executive SaaS metrics that matter for customers success, importance, and why you should track them with Kyligence Zen.
Unlock the future of augmented analytics with this must-read blog. Discover the top 5 tools that are reshaping the analytics landscape.
What website metrics matter in business? Learn about categories, vital website metrics, how to measure them, and how Kyligence simplifies it.
99 Almaden Boulevard Suite #663
San Jose, CA 95113
+1 (669) 256-3378
Ⓒ 2023 Kyligence, Inc. All rights reserved.
Already have an account? Click here to login
A complete product experience
A guided demo of the whole process, from data import, modeling to analysis, by our data experts.
Q&A session with industry experts
Our data experts will answer your questions about customized solutions.
Please fill in your contact information.We'll get back to you in 1-2 business days.