Build the Common Data Language with the Metrics Platform Start Now
By Use Cases
By BI Tools
Subscribe to our newsletter>
Get the latest products updates, community events and other news.
This article is the second in a four-part series that looks at Count Distinct, how it works with Big Data, and how to perform it quickly on even the largest datasets. Part 1 | Part 3 | Part 4
In our last article, we examined the critical role Distinct Counting (also known as Count Distinct) plays when it comes to working with massive datasets. While Distinct Counting is invaluable for navigating today’s data-driven business landscape, it’s not without its weaknesses.
Distinct Counting with Big Data is a resource-intensive computational process and performing it in a timely way is a tall order. Effectively utilizing Big Data to generate actionable insights is what sets market leaders ahead of their industry peers, but speed is a factor as well. It doesn’t matter how much data you have, or how much you’re able to use, it’s meaningless if it takes too long to manage and analyze it.
Fortunately, two approaches exist that make Distinct Counting quick when it comes to working with Big Data: HyperLogLog (HLL) and Bitmap.
As we discussed previously, HyperLogLog (HLL) and Bitmap are two popular algorithms for optimizing the calculations required when performing Distinct Counting on Big Data. We won’t rehash the argument for or against them here, but here’s a recap of the main points we’ve already covered:
Generally speaking, HLL is very good but it lacks accuracy; Bitmap may take up a lot more space than HLL, but it does guarantee accuracy.
So, which is the correct approach? The truth is that it really depends. For most organizations, HLL has proven to be the approach of choice. As a result, its use has become ubiquitous in the field of Big Data.
This, of course, is a major reason why Apache Kylin has always supported HLL calculations. The rapid increase in dataset size, along with the storage constraints as well as the speed and processing requirements of modern businesses, left few other options. The cost of lower accuracy seemed to be worth it.
If someone had asked us about the existence of possible
errors in our calculation? We believed that within the thousands of millions of
results, users would not pay as much attention to that 1% error.
However, we came to discover that many users often did not share this belief. In some situations, having even a slight error in the result was unacceptable.
For example, in traffic redirection or advertisement
placement, costs are calculated through the summation of the number of channels
or clicks from viewers. Having a slight mistake in the values is intolerable
for both sides of this business. On one hand, the buyer is worried about paying
too much, while on the other, the provider is worried about receiving too
As we discussed above, HLL is not 100% accurate. 99% of the
time its margin of error is within 1%, with the remaining 1% of the time resulting
in even larger margins of error. If the error does happen to be extremely
large, it stands to reason that it would lead to extreme problems.
Additionally, if we must do multiplication or division with our UV (Unique Visitors) results, then this error will increase in magnitude. For instance, (users increase rate) = (today’s user rate) / (yesterday’s user rate); if the numerator is slightly larger and the denominator is slightly smaller, then this could ultimately result in a huge mistake and you won’t be able to determine how large this error is. If we have 100,000,000 counts of users, then a 1% error means an error of 1,000,000 users.
For websites and apps with a constant flowrate of visitors,
this error is more than enough to completely overshadow their actual
operational effects, meaning the data is not able to provide useful feedback or
insight for the business.
So, if you do not want to receive a phone call from your boss or business partners in the middle of the night to check on the data, then it is in your best interest to figure out a solution for the accuracy of this calculation (so you can secure a good night’s sleep).
When it came to developing Apache Kylin, we very quickly realized that simply having a good estimate was not enough. Kylin needed to support accurate Distinct Counting. If it couldn’t, users would surely lose out on opportunities and significant insights during major calculations.
To address this challenge, the Kylin community came together to develop a new approach. In our next article, we’ll delve into the development process and explain how Kylin was able to find a way to deliver both speed and accuracy when using Distinct Count with Big Data.
Learn about the fundamentals of a data product and how we help build better data products with real customer success stories.
Low-code platform to build reusable metrics that are agile and user-friendly for business users. Here's all you need to know about headless BI.
Learn about the importance of the Metrics Layer and its impact on data analysis and decision-making. Enables businesses to measure, track, and interpret KPI effectively.
Learn about metrics store and how it can help enterprises achieve metrics reusability, consistency, self-service definition, and scalability.
Everything you should know about Metrics Store and how to extend DataOps practices to managing your business metrics. Read Now.
Read on to learn the key competencies and critical features to look for when evaluating a semantic layer offering for your BI tool.
99 Almaden Boulevard Suite #663
San Jose, CA 95113
+1 (669) 256-3378
Ⓒ 2023 Kyligence, Inc. All rights reserved.
Already have an account? Click here to login
您还可以在云平台中 部署 Kyligence
直接获得 30 天免费试用
请填写真实信息，我们会在 1-2 个工作日内电话与您联系。