By now, you’re familiar with the challenges Count Distinct (also referred to as distinct counting) faces when working with big data, popular approaches to addressing these challenges (as well as their drawbacks), and why ensuring accurate count distinct queries is difficult (but also critical).
We’ve also discussed the tradeoffs between the two most common algorithms used with Count Distinct: HyperLogLog (HLL) and Bitmap, and their tradeoffs when it comes to ensuring timely and accurate query results. While both approaches have their strengths, they still leave a lot to be desired when it comes to delivering the performance and accuracy modern analytics work requires.
To address these shortcomings, modern approaches to fast and accurate Count Distinct queries have emerged. One of the leading examples of this comes out of the development work done as part of the open source project Apache Kylin – the world’s leading open source solution for query acceleration.
Accurate Count Distinct with Apache Kylin
If you are already familiar with Apache Kylin, then you know how Kylin precomputes data according to a users’ specified measures and dimensions. From there it calculates and stores the index of the dimensional values in an OLAP Cube. Values such as daily sales records, sales values, and so on.
For Count Distinct dimensions, only storing one value is not enough, since users’ queries may require traversing through multiple cells and then merging, and a simple int number cannot de-merge. In the past, Kylin would first serialize HLL’s objects, then store this in the cube’s dimensional value with the matching cell.
During a query, it would undergo deserialization, moving it to SQL’s executor for merging calculations (through Kylin’s Aggregate Function). After returning the final value, it would go back to HLL’s objects and retrieve the distinct count value. Using this same theory, as long as HLL can be converted to Bitmap, then, theoretically, we can accurately store or search for distinct count values.
Great, we now have a clear plan, but we are still facing two challenges:
1. Bitmap uses too much space
As was mentioned previously, Bitmap takes up more storage space than HLL, but compared to the storage of the collection of original values, this is small. A Bitmap whose cardinality is up to 100,000,000 requires around (100,000,000/8) bytes, which is about 12MB. With increased dimensions and cardinality, the cube will take up a lot of storage space.
After some additional research, Kylin adopted a new type of Bitmap with compression: Roaring Bitmap. Roaring Bitmap divides a 32-digit integer into 16 upper digits and 16 lower digits, then takes the upper 16 digits and matches the data with the corresponding key, and every key has its own container. The remaining 16 digits are placed into the container.
There are three types of containers for different situations: Array Containers, Bitmap Containers, and Run Containers. These 3 container types have their own specialized way of compressing a given set of data. Evidence suggests that Roaring Bitmap can greatly reduce Bitmap’s required storage space and RAM usage.
2. Bitmap only accepts int/long types as input
Another challenge needing to be addressed, Bitmap only accepts int/long (or any type that can directly cast into these two types) as input values. Therefore, if the values needing to be deduplicated are not any of the two types the user has to do a 1:1 mapping, then it can proceed with Bitmap distinct count, but doing so greatly increases the complexity.
Fortunately, Kylin naturally constructs a dictionary for the dimensions and then uses this dictionary to start mapping the values 1:1 into int values. Doing this greatly reduces the used space in the follow-up cube calculations and storage while using int values to substitute string values.
This presents an interesting question, If Kylin were to use a dictionary to code first for distinct count, then wouldn’t this support Bitmap?
Generally, this would work, but Kylin’s dimensional dictionary is not a perfect fit for distinct count, mainly because of the following reasons:
- Dimensional dictionaries are order preserving. Therefore, they cannot be edited after construction.
- Dimensional dictionaries are made according to each segment. When the next segment is being constructed, then it will create a brand-new dictionary. This will result in the same string values in two different segments being coded into different int values, or into different string values, where multiple cube segments can be coded with the same int values. So, if this is used in Bitmap, then this will result in a value error after a cross segment’s distinct count. Therefore, this is not a viable direction to take.
Kylin needed to develop a dictionary that could be updated while also promising all segments to perform unique mapping. Since this is only in response to distinct count, it does not need to support reverse mapping. To distinguish this from a regular dictionary, we refer to this type of dictionary as the Global Dictionary (in code this is called AppendTrieDictionary), meaning that it serves each and every segment (this can also serve multiple cubes).
Compared to a regular dictionary, a Global Dictionary sacrifices order preservation and no longer supports two-way mapping (from int decoding back into its original string value). This is exclusively made for accurate distinct count of non-int values. Please, be cautious of the difference while using this tool.
After solving the challenges mentioned above, Kylin could now be used on massive data sets, using Cube computations according to models designed by users to store the unique values of various dimensions and measures in the format of Bitmap.
Queries are based on Bitmap performing subsequent operations, for example:
select count(distinct UserID) from access_log where date = ‘2019/09/09’
Further Improvements to Kylin’s Count Distinct Approach
The work was not yet over, however, and after years of practice, the Kylin community’s developers have continued to improve Kylin’s accurate Count Distinct capabilities, making it more robust and performant. These capabilities include:
1. Using Segment Global Dictionary
As was discussed, to make sure that the same values will always be mapped into one int during cross segment, a Global Dictionary was developed, which can grow in size overtime. So, as values increase, the Global Dictionary will also expand. Over time, however, loading it in memory will become strenuous.
Page processing has also been conducted within the Global Dictionary, so there is no need to load it all into the RAM in one go; however, if there is not enough RAM, it will still impact its effectiveness.
In some scenarios, the business will only look at the deduplicated results of each day, and not worry about the deduplicated results of different, separated days. Thus, there is no need to maintain the global mapping, but only the need to maintain a single segment-level dictionary mapping (segments are often constructed according to the day).
This style of local Global Dictionary, compared to a regular dictionary, will be a lot smaller and a lot simpler to load into RAM for processing. Once the construction is complete, it can also be deleted to save space. However, while using the segmented Global Dictionary, be wary of the drawback that the cube’s segments will no longer support merging.
2. Segmentation of smaller dictionaries will increase the cube’s construction speed
As the Global Dictionary increases in size during construction, some pages from the Global Dictionary will be loaded into the RAM, and when the RAM reaches its limit, those pages will be unloaded from memory. If the imported data is relatively out of order or scattered around multiple pages of the Global Dictionary, then the time required for repeatedly importing and exporting will be greater.
To address this issue, Kylin introduced an optimization plan. Prior to coding, it will first find every single Mapper data’s distinct value from every row of deduplication, and then use this value to find the corresponding int value in the Global Dictionary, generating a small dictionary only used by the current mapper.
As the cube is constructed, making Mapper use a smaller dictionary rather than a larger dictionary will reduce the amount of times it needs to swap in and out pages from the dictionary, which will improve its performance.
3. Using Hive/Spark distribution to construct the exterior of the Global Dictionary
Using a Global Dictionary also presents its own limitations. For example:
- The task of constructing a dictionary must be completed on a single task-node, resulting in performance limitations. If there are multiple Count Distinct tasks running, this will result in tasks being brought to a standstill.
- Global Dictionaries cannot be used to decode, nor can they be directly used by other big data applications, which will result in a waste of data assets.
Therefore, contributors in the Kylin community suggested externalizing the Global Dictionary into a Hive table (two columns, one for the original value, the other for the coded int value). By using Hive/Spark to generate and append in the cluster, while also performing distributed mapping on the Cube’s construction, Kylin can reduce the load of its task-nodes. Moreover, the external Global Dictionary can be reused easily to become an enterprise’s data asset. This feature was officially released in Kylin 3.0.
4. Bitmap values are returned directly
Kylin utilizes Bitmap to lookup unique visitor (UV) values; however, some queries are a bit simpler. The cube is considered to have the ability to calculate values in one attempt, and Bitmap will not be a part of the secondary calculations, so in this circumstance every HBase node will not require passing Bitmap to Kylin.
Instead, it only has to return the resulting value, greatly reducing the network transmission of each HBase node to Kylin’s query nodes. Kylin will automatically decide if the query accurately matches the estimated value, then determine whether it needs to use this optimization.
Developing the Right Approach to Count Distinct
Kylin’s approach to fast, accurate Count Distinct queries is one of many, but hopefully, after learning about its development, you’re starting to see what makes it special. In an era where data is plentiful and time-to-insight is everything, being able to deliver lightning-fast Count Distinct queries with true accuracy is a notable achievement.
In the final installment of this series, we’ll take one final look at Kylin’s approach to Count Distinct, and why it’s the only OLAP engine capable of deduplicating at sub-second speeds.