Meet Your AI Copilot fot Data Learn More
Your AI Copilot for Data
Kyligence Zen Kyligence Zen
Kyligence Enterprise Kyligence Enterprise
Metrics Platform
OLAP Platform
Customers
Definitive Guide to Decision Intelligence
Recommended
Resources
Apache Kylin
About
Partners
This article is the fourth in a four-part series that looks at Count Distinct, how it works with Big Data, and how to perform it quickly on even the largest datasets. Part 1 | Part 2 | Part 3
In our last article, we delved deeper into the development behind Apache Kylin’s approach to generating faster, accurate Count Distinct queries. By now, you may be wondering what else sets Kylin apart from other solutions that also deliver accurate Count Distinct queries.
This is a reasonable question, and the perfect topic for closing out our Count Distinct series. Keep reading to learn why Apache Kylin is the only OLAP engine for accurate, sub-second Count Distinct queries.
It is a known fact that for extremely large datasets, Apache Kylin is one of the few OLAP engines that can achieve sub-second low latency analysis. Apache Kylin is able to achieve this based on its unique precomputational methodology, which splits this process into two phases: one for the processing of the OLAP cube’s offline construction, and the other for the cube’s online query.
Moreover, these two steps can be grouped into two independent clusters without affecting each other. The cube may use up a large chunk of time for its construction, but this drastically speeds up subsequent queries. For situations that require frequent searches or queries, a single construction that can satisfy a variety of needs is extremely useful. On the other hand, an engine that lacks precomputation must always start computing with the original values.
Not only does this require a large amount of computing resources, but its performance, concurrency, and efficiency will hardly satisfy the demanding requirements of any modern business.
After introducing Bitmap and Global Dictionary, Kylin was able to achieve sub-second latency in accurate count distinct queries. Looking across the entire field of big data, it can be said that this is the only true universal solution (quoted from one user at a large internet company). But with so many big data analysis engines, Spark, SQL, Presto, ClickHouse, Phoenix, Redshift, and so on, can none of these achieve the same results?
In reality, it is impossible for any other engine to achieve the same results, but the difference lies in the limitations set by their frameworks. Without any data preparation beforehand, to achieve fast and accurate Count Distinct queries requires a large amount of computational resources.
For example, data can be warmed up in RAM in between nodes using a 10G network connection, but this is precisely what many users can’t afford. Additionally, with the increase in the amount of data and growth in concurrency, performance and stability will significantly decrease, causing users to experience sudden drops, leading to the loss of usability.
Not only does Kylin support estimated Count Distinct, but it also supports accurate Count Distinct, and users have the freedom to select their desired deduplicating algorithm depending on their situation. Compared to other available technologies, Kylin’s accurate distinct counting excels in the following areas:
Apache Kylin’s precise Count Distinct feature is a major achievement for the developers in Kylin’s community who have overcome all of these complex issues through intelligence, determination, and their ceaseless research efforts.
The need for an improved approach to Count Distinct queries was a direct result of today’s modern analytics landscape. Now, more than ever, analysts, as well as the IT and data engineers who support them, must find ways of working with larger and larger datasets to uncover new insights.
We have entered the era of big data, and continued pressure will be placed on businesses, and the technology those businesses employ, to embrace big data or fall behind. Apache Kylin and the open source community supporting it remains committed to overcoming the challenges of this new era and delivering solutions to all those who are trying to make a difference with their data.
If you’re curious about Apache Kylin, and how you can get involved with the community, you can learn more about Kylin. Also, if you’re interested in how Kylin compares to its enterprise-ready counterpart, Kyligence, check out this comparison page.
I would like to express my gratitude to Ye Rui Sun, Da Yue Gao, Kai Sen Kang, Yang Hong Zhong, and Guo Wei Jin. This series would not have been possible without your support and hard work.
Learn about the fundamentals of a data product and how we help build better data products with real customer success stories.
Unlock potentials of analytics query accelerators for swift data processing and insights from cloud data lakes. Explore advanced features of Kyligence Zen.
Optimize data analytics with AWS S3. Leverage large language models and accelerate decision-making.
Optimize data analytics with Snowflake's Data Copilot. Leverage large language models and accelerate decision-making.
Discover the 7 top AI analytics tools! Learn about their pros, cons, and pricing, and choose the best one to transform your business.
Discover operational and executive SaaS metrics that matter for customers success, importance, and why you should track them with Kyligence Zen.
Unlock the future of augmented analytics with this must-read blog. Discover the top 5 tools that are reshaping the analytics landscape.
What website metrics matter in business? Learn about categories, vital website metrics, how to measure them, and how Kyligence simplifies it.
Already have an account? Click here to login
You'll get
A complete product experience
A guided demo of the whole process, from data import, modeling to analysis, by our data experts.
Q&A session with industry experts
Our data experts will answer your questions about customized solutions.
Please fill in your contact information.We'll get back to you in 1-2 business days.
Industrial Scenario Demostration
Scenarios in Finance, Retail, Manufacturing industries, which best meet your business requirements.
Consulting From Experts
Talk to Senior Technical Experts, and help you quickly adopt AI applications.