Why Accuracy Is So Important for Distinct Counting

Author
Shaofeng Shi
Technical Partner & Principle Architect, Kyligence Engineering
Mar. 11, 2020

This article is the second in a four-part series that looks at Distinct Counting, how it works with Big Data, and how to perform it quickly on even the largest datasets. Read part one here.

In our last article, we examined the critical role Distinct Counting (also known as Count Distinct) plays when it comes to working with massive datasets. While Distinct Counting is invaluable for navigating today’s data-driven business landscape, it’s not without its weaknesses.

Distinct Counting with Big Data is a resource-intensive computational process and performing it in a timely way is a tall order. Effectively utilizing Big Data to generate actionable insights is what sets market leaders ahead of their industry peers, but speed is a factor as well. It doesn’t matter how much data you have, or how much you’re able to use, it’s meaningless if it takes too long to manage and analyze it.

Fortunately, two approaches exist that make Distinct Counting quick when it comes to working with Big Data: HyperLogLog (HLL) and Bitmap.

HyperLogLog (HLL) vs. Bitmap

As we discussed previously, HyperLogLog (HLL) and Bitmap are two popular algorithms for optimizing the calculations required when performing Distinct Counting on Big Data. We won’t rehash the argument for or against them here, but here’s a recap of the main points we’ve already covered:

  1. HLL has a low complexity level in terms of storage space. Depending on the level of accuracy, one HLL takes up between 1KB to 64KB of space. On the other hand, since Bitmap uses one bit to represent each ID, as the dataset size increases, the space required will also increase. Bitmap requires a lot more storage space than HLL by an order of one or two.
  2. HLL supports multiple types of data entries as inputs making it very easy to use; Bitmap only supports int/long types of values as input, therefore, if the original value is a string, then the user needs to map it into int/long before we can use Bitmap.
  3. HLL supports multiple data types because it uses Hash Functions, but because it uses Hash Functions and probability estimations, there’s no guarantee of accuracy. Its theoretical margin of error is still over 1%.
  4. Bitmap faithfully uses a bit, (1) or (0), for every ID. So, as long as it can guarantee the fact that every unique user has a unique ID value, Bitmap’s results are usually very accurate.

Generally speaking, HLL is very good but it lacks accuracy; Bitmap may take up a lot more space than HLL, but it does guarantee accuracy.

Speed vs. Accuracy with Distinct Counting

So, which is the correct approach? The truth is that it really depends. For most organizations, HLL has proven to be the approach of choice. As a result, its use has become ubiquitous in the field of Big Data.

This, of course, is a major reason why Apache Kylin has always supported HLL calculations. The rapid increase in dataset size, along with the storage constraints as well as the speed and processing requirements of modern businesses, left few other options. The cost of lower accuracy seemed to be worth it.

Apache Kylin Architecture
Apache Kylin Overview

If someone had asked us about the existence of possible errors in our calculation? We believed that within the thousands of millions of results, users would not pay as much attention to that 1% error.

However, we came to discover that many users often did not share this belief. In some situations, having even a slight error in the result was unacceptable.

For example, in traffic redirection or advertisement placement, costs are calculated through the summation of the number of channels or clicks from viewers. Having a slight mistake in the values is intolerable for both sides of this business. On one hand, the buyer is worried about paying too much, while on the other, the provider is worried about receiving too little.

As we discussed above, HLL is not 100% accurate. 99% of the time its margin of error is within 1%, with the remaining 1% of the time resulting in even larger margins of error. If the error does happen to be extremely large, it stands to reason that it would lead to extreme problems.

Additionally, if we must do multiplication or division with our UV (Unique Visitors) results, then this error will increase in magnitude. For instance, (users increase rate) = (today’s user rate) / (yesterday’s user rate); if the numerator is slightly larger and the denominator is slightly smaller, then this could ultimately result in a huge mistake and you won’t be able to determine how large this error is. If we have 100,000,000 counts of users, then a 1% error means an error of 1,000,000 users.

For websites and apps with a constant flowrate of visitors, this error is more than enough to completely overshadow their actual operational effects, meaning the data is not able to provide useful feedback or insight for the business.

So, if you do not want to receive a phone call from your boss or business partners in the middle of the night to check on the data, then it is in your best interest to figure out a solution for the accuracy of this calculation (so you can secure a good night’s sleep).

How to Ensure Speed and Accuracy for Analytics on Big Data

When it came to developing Apache Kylin, we very quickly realized that simply having a good estimate was not enough. Kylin needed to support accurate Distinct Counting. If it couldn’t, users would surely lose out on opportunities and significant insights during major calculations.

To address this challenge, the Kylin community came together to develop a new approach. In our next article, we’ll delve into the development process and explain how Kylin was able to find a way to deliver both speed and accuracy when using Distinct Count with Big Data.