Excel Your KPIs with AI Copilot Start for free today
Your AI Copilot for Data
Kyligence Zen Kyligence Zen
Kyligence Enterprise Kyligence Enterprise
Metrics Platform
OLAP Platform
Customers
Resources
Apache Kylin
About
Partners
This article is the first in a four-part series that looks at Count Distinct, how it works with Big Data, and how to perform it quickly on even the largest datasets. Part 2 | Part 3 | Part 4
For many organizations, Big Data presents a big opportunity in the form of new customer insights that can place them firmly ahead of the competition. Just having the data isn’t enough, however, being able to dig into that data and effectively mine it for ideas is critical for any kind of analytics success.
There’s no shortage of ways to slice and dice data, but when it comes to Big Data, Distinct Counting is possibly one of the most important approaches. That said, Distinct Counting faces unique limitations when working with Big Data that can seriously hinder effective analytics. Fortunately, there are ways of avoiding these pitfalls to effectively employ Distinct Counting on any size dataset.
Distinct Counting (also referred to as Count Distinct) is a commonly used analyzing function for Big Data analysis. It refers to the number of unique values in a column or array of data – in SQL the function is count(distinct col). The difference between the function count(distinct col) and count(col) is the distinct descriptor. Its role is to remove the duplicate values, therefore earning its name “Distinct Count.”
Distinct Counting has a variety of uses. A common use case is with websites and apps that are counting values. Here, PV/UV is the most commonly used index, where UV (Unique Visitor) is the de-duplicated value, causing each unique visitor to be counted as one. For the owner of a website or app, PV (Page View) represents the frequency or time of uses, UV represents the number of users, and both values are important. Combining these two numbers, we can more accurately understand the users and any changes in the frequency of PV/UV.
Since Distinct Count operations involve the comparison of multiple values, calculation is a bit more complicated than the simple PV example we used above. A solo computer can barely perform these calculations on low volumes of data, and as the amount of data increases, the time and resources required grow significantly and using a single node to process the data becomes difficult.
At this point, we need to rely on a distributed framework like MapReduce or Spark for parallel processing to divide and make sense of Big Data.
Those who have used MapReduce should be very familiar with its WordCount example. The following figure explains how MapReduce counts the amount of duplicating terms.
Think about it this way: If the number of visitors to your website or app gets too large, say 10,000,000 visitors, but the visiting record notes 100,000,000 (assuming every viewer visits 10 times), and if every user’s ID is already shown by using int, then one simple Distinct Count calculation is 100,000,000 * 4 bytes = 400 MB = 3,200 Mb of data to be shuffled.
Assuming we are using the gigabit network to compute, with a 3 second delay for transport in addition to disk reading, sorting, serialization, and deserialization, we will end up with a total time of at least 10 seconds.
In real life scenarios the situation could be even more complicated:
Overall, Distinct Counting with Big Data is often a resource-intensive computational process and perfecting this process to finish within one second latency is extremely difficult. If such a query becomes more popular, then we will definitely need to optimize the data structure and its calculation.
Many researchers have already realized there is room for optimization here and have developed a variety of formulas and data structures in response. The most popular two being HyperLogLog and Bitmap.
The similarity of the two algorithms is that both of them use extremely refined structures to store a set of distinct values (or complete set). Not only will this return the distinct value, but this structure can also perform follow up calculations (for example yesterday’s and today’s Distinct Count). Compared to de-duplicating at the origin value every single time, the efficiency of storage and calculations are greatly improved in both of these algorithms.
However, these two algorithms have very obvious differences:
Both of these calculations have their pros and cons: overall, HLL is very good but it lacks accuracy; Bitmap may take up a lot more space than HLL, but it does guarantee accuracy.
So, how much does accuracy matter? When you’re talking about error rates of around 1%, it may not seem like a big deal – and it might not be. For many, however, Big Data has shifted the thinking around this.
There was a time when data was limited and ways of collecting it were few, this isn’t the case any longer. Now, with businesses investing substantially in new ways to collect and analyze all the data they can, even small error rates can significantly impact the result of a model or an analyst’s ability to confidently identify new opportunities when working with datasets in the hundreds of millions of rows or more.
In the next part of this series, we’ll delve deeper into why accuracy is important, and the challenges faced by using HLL with Big Data.
Learn about the fundamentals of a data product and how we help build better data products with real customer success stories.
Learn how the North Star Metric framework can boost business growth. Explore real-world NSM examples, implementation, and the Kyligence unique advantage.
Dive into the world of AI web analytics with Kyligence Copilot. Discover how AI-driven insights revolutionize traditional web analytics, and learn practical steps to implement it for your business.
What are marketing metrics? Discover the definition of marketing metrics, importance, types you should track, and how to analyze them using Kyligence Zen and Kyligence Copilot, AI tools for data
Will AI replace data analysts' jobs? Find the answer from expert arguments, benefits and limitations of AI, and how to maintain relevance as a data analyst.
Unlock the potential of self-service analytics with AI. Dive into the transformative power of accessible data and gain insights from industry leader Gartner. Discover why AI-driven self-service analytics is the standout trend of 2023.
Learn how the 5 Whys root cause analysis unravel complex data challenges. See how to implement automated root cause analysis with Kyligence Copilot.
Learn how top companies use data AI for business forecasting, personalization, and more. Analyze complex business metrics better with Kyligence Zen Copilot.
"Discover how AI Copilot for Data revolutionizes retail analytics, making complex insights easier to grasp. Explore the power of artificial intelligence in simplifying data interpretation for smarter retail decisions."
Already have an account? Click here to login
You'll get
A complete product experience
A guided demo of the whole process, from data import, modeling to analysis, by our data experts.
Q&A session with industry experts
Our data experts will answer your questions about customized solutions.
Please fill in your contact information.We'll get back to you in 1-2 business days.