Excel Your KPIs with AI Copilot Start for free today
Your AI Copilot for Data
Definitive Guide to Decision Intelligence
Subscribe to our newsletter>
Get the latest products updates, community events and other news.
This article is the third in a four-part series that looks at Count Distinct, how it works with Big Data, and how to perform it quickly on even the largest datasets. Part 1 | Part 2 | Part 4
By now, you’re familiar with the challenges Count Distinct (also
referred to as distinct counting) faces when working with big data, popular
approaches to addressing these challenges (as well as their drawbacks), and why
ensuring accurate count distinct queries is difficult (but also critical).
We’ve also discussed the tradeoffs between the two most common algorithms used with Count Distinct: HyperLogLog (HLL) and Bitmap, and their tradeoffs when it comes to ensuring timely and accurate query results. While both approaches have their strengths, they still leave a lot to be desired when it comes to delivering the performance and accuracy modern analytics work requires.
To address these shortcomings, modern approaches to fast and accurate Count Distinct queries have emerged. One of the leading examples of this comes out of the development work done as part of the open source project Apache Kylin – the world’s leading open source solution for query acceleration.
If you are already familiar with Apache Kylin, then you know how Kylin precomputes data according to a users’ specified measures and dimensions. From there it calculates and stores the index of the dimensional values in an OLAP Cube. Values such as daily sales records, sales values, and so on.
For Count Distinct dimensions, only storing one value is not enough, since users’ queries may require traversing through multiple cells and then merging, and a simple int number cannot de-merge. In the past, Kylin would first serialize HLL’s objects, then store this in the cube’s dimensional value with the matching cell.
During a query, it would undergo deserialization, moving it
to SQL’s executor for merging calculations (through Kylin’s Aggregate
Function). After returning the final value, it would go back to HLL’s objects
and retrieve the distinct count value. Using this same theory, as long as HLL
can be converted to Bitmap, then, theoretically, we can accurately store or
search for distinct count values.
Great, we now have a clear plan, but we are still facing two challenges:
1. Bitmap uses too much space
As was mentioned previously, Bitmap takes up more storage
space than HLL, but compared to the storage of the collection of original values,
this is small. A Bitmap whose cardinality is up to 100,000,000 requires around (100,000,000/8)
bytes, which is about 12MB. With increased dimensions and cardinality, the cube
will take up a lot of storage space.
After some additional research, Kylin adopted a new type of Bitmap
with compression: Roaring Bitmap. Roaring Bitmap divides a 32-digit integer
into 16 upper digits and 16 lower digits, then takes the upper 16 digits and
matches the data with the corresponding key, and every key has its own container.
The remaining 16 digits are placed into the container.
There are three types of containers for different situations: Array Containers, Bitmap Containers, and Run Containers. These 3 container types have their own specialized way of compressing a given set of data. Evidence suggests that Roaring Bitmap can greatly reduce Bitmap’s required storage space and RAM usage.
2. Bitmap only accepts int/long types as input
Another challenge needing to be addressed, Bitmap only
accepts int/long (or any type that can directly cast into these two types) as
input values. Therefore, if the values needing to be deduplicated are not any
of the two types the user has to do a 1:1 mapping, then it can proceed with Bitmap
distinct count, but doing so greatly increases the complexity.
Fortunately, Kylin naturally constructs a dictionary for the
dimensions and then uses this dictionary to start mapping the values 1:1 into int
values. Doing this greatly reduces the used space in the follow-up cube
calculations and storage while using int values to substitute string values.
This presents an interesting question, If Kylin were to use
a dictionary to code first for distinct count, then wouldn’t this support Bitmap?
Generally, this would work, but Kylin’s dimensional dictionary is not a perfect fit for distinct count, mainly because of the following reasons:
Kylin needed to develop a dictionary that could be updated while also promising all segments to perform unique mapping. Since this is only in response to distinct count, it does not need to support reverse mapping. To distinguish this from a regular dictionary, we refer to this type of dictionary as the Global Dictionary (in code this is called AppendTrieDictionary), meaning that it serves each and every segment (this can also serve multiple cubes).
Compared to a regular dictionary, a Global Dictionary
sacrifices order preservation and no longer supports two-way mapping (from int
decoding back into its original string value). This is exclusively made for
accurate distinct count of non-int values. Please, be cautious of the
difference while using this tool.
After solving the challenges mentioned above, Kylin could now be used on massive data sets, using Cube computations according to models designed by users to store the unique values of various dimensions and measures in the format of Bitmap.
Queries are based on Bitmap performing subsequent operations, for example:
select count(distinct UserID) from access_log where date = ‘2019/09/09’
The work was not yet over, however, and after years of practice, the Kylin community’s developers have continued to improve Kylin’s accurate Count Distinct capabilities, making it more robust and performant. These capabilities include:
1. Using Segment Global Dictionary
As was discussed, to make sure that the same values will always be mapped into one int during cross segment, a Global Dictionary was developed, which can grow in size overtime. So, as values increase, the Global Dictionary will also expand. Over time, however, loading it in memory will become strenuous.
Page processing has also been conducted within the Global Dictionary, so there is no need to load it all into the RAM in one go; however, if there is not enough RAM, it will still impact its effectiveness.
In some scenarios, the business will only look at the deduplicated results of each day, and not worry about the deduplicated results of different, separated days. Thus, there is no need to maintain the global mapping, but only the need to maintain a single segment-level dictionary mapping (segments are often constructed according to the day).
This style of local Global Dictionary, compared to a regular dictionary, will be a lot smaller and a lot simpler to load into RAM for processing. Once the construction is complete, it can also be deleted to save space. However, while using the segmented Global Dictionary, be wary of the drawback that the cube’s segments will no longer support merging.
2. Segmentation of smaller dictionaries will increase the cube’s construction speed
As the Global Dictionary increases in size during construction, some pages from the Global Dictionary will be loaded into the RAM, and when the RAM reaches its limit, those pages will be unloaded from memory. If the imported data is relatively out of order or scattered around multiple pages of the Global Dictionary, then the time required for repeatedly importing and exporting will be greater.
address this issue, Kylin introduced an optimization plan. Prior to coding, it
will first find every single Mapper data’s distinct value from every row of
deduplication, and then use this value to find the corresponding int value in
the Global Dictionary, generating a small dictionary only used by the current mapper.
As the cube is constructed, making Mapper use a smaller dictionary rather than a larger dictionary will reduce the amount of times it needs to swap in and out pages from the dictionary, which will improve its performance.
3. Using Hive/Spark distribution to construct the exterior of the Global Dictionary
Using a Global Dictionary also presents its own limitations. For example:
Therefore, contributors in the Kylin community suggested externalizing the Global Dictionary into a Hive table (two columns, one for the original value, the other for the coded int value). By using Hive/Spark to generate and append in the cluster, while also performing distributed mapping on the Cube’s construction, Kylin can reduce the load of its task-nodes. Moreover, the external Global Dictionary can be reused easily to become an enterprise’s data asset. This feature was officially released in Kylin 3.0.
4. Bitmap values are returned directly
Kylin utilizes Bitmap to lookup unique visitor (UV) values; however, some queries are a bit simpler. The cube is considered to have the ability to calculate values in one attempt, and Bitmap will not be a part of the secondary calculations, so in this circumstance every HBase node will not require passing Bitmap to Kylin.
Instead, it only has to return the resulting value, greatly reducing the network transmission of each HBase node to Kylin’s query nodes. Kylin will automatically decide if the query accurately matches the estimated value, then determine whether it needs to use this optimization.
Kylin’s approach to fast,
accurate Count Distinct queries is one of many, but hopefully, after learning
about its development, you’re starting to see what makes it special. In an era
where data is plentiful and time-to-insight is everything, being able to
deliver lightning-fast Count Distinct queries with true accuracy is a notable
In the final installment of this series, we’ll take one final look at Kylin’s approach to Count Distinct, and why it’s the only OLAP engine capable of deduplicating at sub-second speeds.
Learn about the fundamentals of a data product and how we help build better data products with real customer success stories.
Unlock potentials of analytics query accelerators for swift data processing and insights from cloud data lakes. Explore advanced features of Kyligence Zen.
Optimize data analytics with AWS S3. Leverage large language models and accelerate decision-making.
Optimize data analytics with Snowflake's Data Copilot. Leverage large language models and accelerate decision-making.
Discover the 7 top AI analytics tools! Learn about their pros, cons, and pricing, and choose the best one to transform your business.
Discover operational and executive SaaS metrics that matter for customers success, importance, and why you should track them with Kyligence Zen.
Unlock the future of augmented analytics with this must-read blog. Discover the top 5 tools that are reshaping the analytics landscape.
What website metrics matter in business? Learn about categories, vital website metrics, how to measure them, and how Kyligence simplifies it.
99 Almaden Boulevard Suite #663
San Jose, CA 95113
+1 (669) 256-3378
Ⓒ 2023 Kyligence, Inc. All rights reserved.
Already have an account? Click here to login
A complete product experience
A guided demo of the whole process, from data import, modeling to analysis, by our data experts.
Q&A session with industry experts
Our data experts will answer your questions about customized solutions.
Please fill in your contact information.We'll get back to you in 1-2 business days.