Excel Your KPIs with AI Copilot Start for free today
Your AI Copilot for Data
Definitive Guide to Decision Intelligence
Subscribe to our newsletter>
Get the latest products updates, community events and other news.
This article is the second in a four-part series that looks at Count Distinct, how it works with Big Data, and how to perform it quickly on even the largest datasets. Part 1 | Part 3 | Part 4
In our last article, we examined the critical role Distinct Counting (also known as Count Distinct) plays when it comes to working with massive datasets. While Distinct Counting is invaluable for navigating today’s data-driven business landscape, it’s not without its weaknesses.
Distinct Counting with Big Data is a resource-intensive computational process and performing it in a timely way is a tall order. Effectively utilizing Big Data to generate actionable insights is what sets market leaders ahead of their industry peers, but speed is a factor as well. It doesn’t matter how much data you have, or how much you’re able to use, it’s meaningless if it takes too long to manage and analyze it.
Fortunately, two approaches exist that make Distinct Counting quick when it comes to working with Big Data: HyperLogLog (HLL) and Bitmap.
As we discussed previously, HyperLogLog (HLL) and Bitmap are two popular algorithms for optimizing the calculations required when performing Distinct Counting on Big Data. We won’t rehash the argument for or against them here, but here’s a recap of the main points we’ve already covered:
Generally speaking, HLL is very good but it lacks accuracy; Bitmap may take up a lot more space than HLL, but it does guarantee accuracy.
So, which is the correct approach? The truth is that it really depends. For most organizations, HLL has proven to be the approach of choice. As a result, its use has become ubiquitous in the field of Big Data.
This, of course, is a major reason why Apache Kylin has always supported HLL calculations. The rapid increase in dataset size, along with the storage constraints as well as the speed and processing requirements of modern businesses, left few other options. The cost of lower accuracy seemed to be worth it.
If someone had asked us about the existence of possible
errors in our calculation? We believed that within the thousands of millions of
results, users would not pay as much attention to that 1% error.
However, we came to discover that many users often did not share this belief. In some situations, having even a slight error in the result was unacceptable.
For example, in traffic redirection or advertisement
placement, costs are calculated through the summation of the number of channels
or clicks from viewers. Having a slight mistake in the values is intolerable
for both sides of this business. On one hand, the buyer is worried about paying
too much, while on the other, the provider is worried about receiving too
As we discussed above, HLL is not 100% accurate. 99% of the
time its margin of error is within 1%, with the remaining 1% of the time resulting
in even larger margins of error. If the error does happen to be extremely
large, it stands to reason that it would lead to extreme problems.
Additionally, if we must do multiplication or division with our UV (Unique Visitors) results, then this error will increase in magnitude. For instance, (users increase rate) = (today’s user rate) / (yesterday’s user rate); if the numerator is slightly larger and the denominator is slightly smaller, then this could ultimately result in a huge mistake and you won’t be able to determine how large this error is. If we have 100,000,000 counts of users, then a 1% error means an error of 1,000,000 users.
For websites and apps with a constant flowrate of visitors,
this error is more than enough to completely overshadow their actual
operational effects, meaning the data is not able to provide useful feedback or
insight for the business.
So, if you do not want to receive a phone call from your boss or business partners in the middle of the night to check on the data, then it is in your best interest to figure out a solution for the accuracy of this calculation (so you can secure a good night’s sleep).
When it came to developing Apache Kylin, we very quickly realized that simply having a good estimate was not enough. Kylin needed to support accurate Distinct Counting. If it couldn’t, users would surely lose out on opportunities and significant insights during major calculations.
To address this challenge, the Kylin community came together to develop a new approach. In our next article, we’ll delve into the development process and explain how Kylin was able to find a way to deliver both speed and accuracy when using Distinct Count with Big Data.
Learn about the fundamentals of a data product and how we help build better data products with real customer success stories.
Discover the 7 top AI analytics tools! Learn about their pros, cons, and pricing, and choose the best one to transform your business.
Discover operational and executive SaaS metrics that matter for customers success, importance, and why you should track them with Kyligence Zen.
Unlock the future of augmented analytics with this must-read blog. Discover the top 5 tools that are reshaping the analytics landscape.
What website metrics matter in business? Learn about categories, vital website metrics, how to measure them, and how Kyligence simplifies it.
Unlock potentials of analytics query accelerators for swift data processing and insights from cloud data lakes. Explore advanced features of Kyligence Zen.
Unlock power of data storytelling in business. Learn how to convey insights using narrative and visual representations, examples, and benefits.
Explore these exceptional cloud analytics tools. Assess their pros, cons, and pricing to pinpoint the optimal one for your business.
Learn what natural language query is and how it transforms your data analytics. Explore examples of natural language queries in Kyligence Zen.
Discover how AI shapes banking, healthcare, and data analytics sectors. Get insights into the future of industry disruption to guide your decisions.
99 Almaden Boulevard Suite #663
San Jose, CA 95113
+1 (669) 256-3378
Ⓒ 2023 Kyligence, Inc. All rights reserved.
Already have an account? Click here to login
A complete product experience
A guided demo of the whole process, from data import, modeling to analysis, by our data experts.
Q&A session with industry experts
Our data experts will answer your questions about customized solutions.
Please fill in your contact information.We'll get back to you in 1-2 business days.