Making Distinct Counting Work for Big Data

Shaofeng Shi
Technical Partner & Principle Architect, Kyligence Engineering
Feb. 19, 2020

This article is the first in a four-part series that looks at Count Distinct, how it works with Big Data, and how to perform it quickly on even the largest datasets. Part 2 | Part 3 | Part 4

For many organizations, Big Data presents a big opportunity in the form of new customer insights that can place them firmly ahead of the competition. Just having the data isn’t enough, however, being able to dig into that data and effectively mine it for ideas is critical for any kind of analytics success.

There’s no shortage of ways to slice and dice data, but when it comes to Big Data, Distinct Counting is possibly one of the most important approaches. That said, Distinct Counting faces unique limitations when working with Big Data that can seriously hinder effective analytics. Fortunately, there are ways of avoiding these pitfalls to effectively employ Distinct Counting on any size dataset.

What Is Distinct Counting?

Distinct Counting (also referred to as Count Distinct) is a commonly used analyzing function for Big Data analysis. It refers to the number of unique values in a column or array of data – in SQL the function is count(distinct col). The difference between the function count(distinct col) and count(col) is the distinct descriptor. Its role is to remove the duplicate values, therefore earning its name “Distinct Count.”

Distinct Counting has a variety of uses. A common use case is with websites and apps that are counting values. Here, PV/UV is the most commonly used index, where UV (Unique Visitor) is the de-duplicated value, causing each unique visitor to be counted as one. For the owner of a website or app, PV (Page View) represents the frequency or time of uses, UV represents the number of users, and both values are important. Combining these two numbers, we can more accurately understand the users and any changes in the frequency of PV/UV.

Challenges Running Distinct Count Calculations with Big Data

Since Distinct Count operations involve the comparison of multiple values, calculation is a bit more complicated than the simple PV example we used above. A solo computer can barely perform these calculations on low volumes of data, and as the amount of data increases, the time and resources required grow significantly and using a single node to process the data becomes difficult.

At this point, we need to rely on a distributed framework like MapReduce or Spark for parallel processing to divide and make sense of Big Data.

Those who have used MapReduce should be very familiar with its WordCount example. The following figure explains how MapReduce counts the amount of duplicating terms.

Think about it this way: If the number of visitors to your website or app gets too large, say 10,000,000 visitors, but the visiting record notes 100,000,000 (assuming every viewer visits 10 times), and if every user’s ID is already shown by using int, then one simple Distinct Count calculation is 100,000,000 * 4 bytes = 400 MB = 3,200 Mb of data to be shuffled.

Assuming we are using the gigabit network to compute, with a 3 second delay for transport in addition to disk reading, sorting, serialization, and deserialization, we will end up with a total time of at least 10 seconds.

In real life scenarios the situation could be even more complicated:

• User ID could be an email, passport number, or phone number consisting of many characters and taking up a large space.
• You may need to filter out some information prior to Distinct Count, taking up larger computing resources - for instance: researching a certain area’s UV in the past 2 days.
• Excessive viewer count (when behavioral logs often note over 100,000,000,000+ counts).
• Internet or Disk IO could be occupied, so the query performance will be inconsistent with the original posting.

Overall, Distinct Counting with Big Data is often a resource-intensive computational process and perfecting this process to finish within one second latency is extremely difficult. If such a query becomes more popular, then we will definitely need to optimize the data structure and its calculation.

Calculations for Distinct Counting with Big Data

Many researchers have already realized there is room for optimization here and have developed a variety of formulas and data structures in response. The most popular two being HyperLogLog and Bitmap.

The similarity of the two algorithms is that both of them use extremely refined structures to store a set of distinct values (or complete set). Not only will this return the distinct value, but this structure can also perform follow up calculations (for example yesterday’s and today’s Distinct Count). Compared to de-duplicating at the origin value every single time, the efficiency of storage and calculations are greatly improved in both of these algorithms.

However, these two algorithms have very obvious differences:

1. HyperLogLog (HLL) has a low complexity level in terms of storage space (log(log(n)), giving it the name HLL), where it changes regardless of the dataset. Depending on the level of accuracy, one HLL takes up between 1KB to 64KB of space. On the other hand, since Bitmap uses one bit to represent each ID, as the dataset size increases, the space required will also increase. Storing 100,000,000 values with raw Bitmap will require around 12MB of space. As we can see here, Bitmap requires a lot more storage space than HLL by an order of one or two.
2. HLL supports multiple types of data entries as input making it very easy to use; Bitmap only supports int/long types of values as input. Therefore, if the original value is a string, then the user needs to map it into int/long before we can use Bitmap.
3. The reason why HLL supports multiple data types is because it uses Hash Functions. This function maps inputted values into binary bytes, then the binary byte undergoes bucketing until the first 1 appears in the final position, ultimately estimating how many multiple values are within this bucket. Since HLL uses Hash Functions and probability estimations, HLL calculated results are destined to be inaccurate. Even though HLL has multiple correction algorithms to reduce the margin of error, the fact of its inaccuracy remains. As accurate as it can be, its theoretical margin of error is still over 1%.
4. Bitmap faithfully uses a bit, (1) or (0), for every ID. So, as long as it can guarantee the fact that every unique user has a unique ID value, Bitmap’s results are usually very accurate.

Both of these calculations have their pros and cons: overall, HLL is very good but it lacks accuracy; Bitmap may take up a lot more space than HLL, but it does guarantee accuracy.

Understanding the Importance of Accuracy in Distinct Counting

So, how much does accuracy matter? When you’re talking about error rates of around 1%, it may not seem like a big deal – and it might not be. For many, however, Big Data has shifted the thinking around this.

There was a time when data was limited and ways of collecting it were few, this isn’t the case any longer. Now, with businesses investing substantially in new ways to collect and analyze all the data they can, even small error rates can significantly impact the result of a model or an analyst’s ability to confidently identify new opportunities when working with datasets in the hundreds of millions of rows or more.

In the next part of this series, we’ll delve deeper into why accuracy is important, and the challenges faced by using HLL with Big Data.