Build the Common Data Language with the Metrics Platform Start Now
Products
Clouds
Services
By Use Cases
By BI Tool
Customers
Resources
About
Community
Partners
This article is the first in a four-part series that looks at Count Distinct, how it works with Big Data, and how to perform it quickly on even the largest datasets. Part 2 | Part 3 | Part 4
For many organizations, Big Data presents a big opportunity in the form of new customer insights that can place them firmly ahead of the competition. Just having the data isn’t enough, however, being able to dig into that data and effectively mine it for ideas is critical for any kind of analytics success.
There’s no shortage of ways to slice and dice data, but when it comes to Big Data, Distinct Counting is possibly one of the most important approaches. That said, Distinct Counting faces unique limitations when working with Big Data that can seriously hinder effective analytics. Fortunately, there are ways of avoiding these pitfalls to effectively employ Distinct Counting on any size dataset.
Distinct Counting (also referred to as Count Distinct) is a commonly used analyzing function for Big Data analysis. It refers to the number of unique values in a column or array of data – in SQL the function is count(distinct col). The difference between the function count(distinct col) and count(col) is the distinct descriptor. Its role is to remove the duplicate values, therefore earning its name “Distinct Count.”
Distinct Counting has a variety of uses. A common use case is with websites and apps that are counting values. Here, PV/UV is the most commonly used index, where UV (Unique Visitor) is the de-duplicated value, causing each unique visitor to be counted as one. For the owner of a website or app, PV (Page View) represents the frequency or time of uses, UV represents the number of users, and both values are important. Combining these two numbers, we can more accurately understand the users and any changes in the frequency of PV/UV.
Since Distinct Count operations involve the comparison of multiple values, calculation is a bit more complicated than the simple PV example we used above. A solo computer can barely perform these calculations on low volumes of data, and as the amount of data increases, the time and resources required grow significantly and using a single node to process the data becomes difficult.
At this point, we need to rely on a distributed framework like MapReduce or Spark for parallel processing to divide and make sense of Big Data.
Those who have used MapReduce should be very familiar with its WordCount example. The following figure explains how MapReduce counts the amount of duplicating terms.
Think about it this way: If the number of visitors to your website or app gets too large, say 10,000,000 visitors, but the visiting record notes 100,000,000 (assuming every viewer visits 10 times), and if every user’s ID is already shown by using int, then one simple Distinct Count calculation is 100,000,000 * 4 bytes = 400 MB = 3,200 Mb of data to be shuffled.
Assuming we are using the gigabit network to compute, with a 3 second delay for transport in addition to disk reading, sorting, serialization, and deserialization, we will end up with a total time of at least 10 seconds.
In real life scenarios the situation could be even more complicated:
Overall, Distinct Counting with Big Data is often a resource-intensive computational process and perfecting this process to finish within one second latency is extremely difficult. If such a query becomes more popular, then we will definitely need to optimize the data structure and its calculation.
Many researchers have already realized there is room for optimization here and have developed a variety of formulas and data structures in response. The most popular two being HyperLogLog and Bitmap.
The similarity of the two algorithms is that both of them use extremely refined structures to store a set of distinct values (or complete set). Not only will this return the distinct value, but this structure can also perform follow up calculations (for example yesterday’s and today’s Distinct Count). Compared to de-duplicating at the origin value every single time, the efficiency of storage and calculations are greatly improved in both of these algorithms.
However, these two algorithms have very obvious differences:
Both of these calculations have their pros and cons: overall, HLL is very good but it lacks accuracy; Bitmap may take up a lot more space than HLL, but it does guarantee accuracy.
So, how much does accuracy matter? When you’re talking about error rates of around 1%, it may not seem like a big deal – and it might not be. For many, however, Big Data has shifted the thinking around this.
There was a time when data was limited and ways of collecting it were few, this isn’t the case any longer. Now, with businesses investing substantially in new ways to collect and analyze all the data they can, even small error rates can significantly impact the result of a model or an analyst’s ability to confidently identify new opportunities when working with datasets in the hundreds of millions of rows or more.
In the next part of this series, we’ll delve deeper into why accuracy is important, and the challenges faced by using HLL with Big Data.
Learn about the fundamentals of a data product and how we help build better data products with real customer success stories.
Kyligence Zen intelligently manages data in the retail industry. Read to learn how to develop the "North Star Metric" system to track goals and progress.
Kyligence introduces the deployment of OLAP on top of Azure, including data sources, features, benefits, and prerequisites. Learn more about Kyligence for Azure.
What's OLAP on big data? What're its benefits? Here's everything you need to know about OLAP.
Learn how one big fast-food brand leveraged Kyligence capabilities and implemented precision marketing to maximize profit opportunities.
Already have an account? Click here to login
预约演示,您将获得
完整的产品体验
从数据导入、建模到分析的全流程操作演示。
行业专家解惑
与资深行业专家的交流机会,解答您的个性化问题。
您还可以在云平台中 部署 Kyligence
直接获得 30 天免费试用
请填写真实信息,我们会在 1-2 个工作日内电话与您联系。