Precomputation is commonly used in information retrieval and analysis. With precomputation, we compute the results once and then store them in a table or other data structure, either on a disk or in memory. In this table, each input (or combination of inputs) maps to an output value. In order to answer a question, we just need to find the output value corresponding to the input value(s).
Early Forms of Precomputation
The form you are likely familiar with, and probably the most common form of precomputation, is the multiplication table (Figure 1). In this table, to find the result of 7*8, all you need to do is find row 7 (or 8) and then move across to column 8 (or 7) – and there is your answer: 56! Most of us memorized the multiplication table in grade school, so we don’t even realize that we are actually using this table when we perform these operations now.
Another, more advanced, example of precomputation is the logarithms table (Figure 2). If you don’t remember how to use it, I’m sure you’re not alone. Fun fact: The first tables of logarithms were published independently by Scottish mathematician John Napier in 1614 and Swiss mathematician Justus Byrgius in 1620. Napier actually started his work in 1594 and it took him 20 years to complete the tables
From Database to Big Data to The Cloud
Precomputation has been used in databases for many years. To save time on joining tables and calculating columns, we can run these queries upfront and save the results into materialized views. Future queries will be directed to these materialized views to retrieve their results.
In 1993, Edgar F. Codd, the father of the relational database, coined the term OLAP for On-Line Analytical Processing in a white paper published by Arbor Software. In OLAP, transactional data is extracted from operational systems and loaded into data warehouses where data is organized for fast reporting performance. However, the query performance is still not satisfactory for interactive analysis.
For this purpose, OLAP systems calculate common queries from data warehouses and store those results in a data structure called a cube. When an OLAP product receives a query, it simply looks up the data already stored in the cube and fetches the result. This method significantly reduces query times. Over time, “OLAP” and “Cube” became synonyms, although, later “OLAP” was expanded into MOLAP, ROLAP, and HOLAP architectures.
As illustrated in the cube diagram here, if you need to analyze sales volume based on three attributes – e.g. Products, Cities, and Time – you can create a three-dimensional cube. This cube is basically a three-dimensional spreadsheet.
Keep in mind that although the data structure is called a cube, in reality most applications have more than three dimensions, but obviously there is no easy way to illustrate that in a diagram. OLAP cubes became quite popular in late 90s’ with products like Microsoft SSAS, Cognos, MicroStrategy, etc.
In the big data and cloud era, we have seen various technologies try to address query performance issues for large volumes of data. From query engines to cloud data warehouses to data virtualization products, these technologies use different forms of MPP (Massive Parallel Processing) architecture. Although they have improved dramatically over the years, their performances still degrade when running complex aggregate queries against a large amount of data.
Caching, as a special type of precomputation, is commonly used by these products to improve performance. Query engines cache result sets in memory so the exact same query can reuse the same result set. Cloud Data Warehouses bring tables into the compute layer to avoid future network traffic between the compute and storage layers.
These techniques can speed up a very limited subset of queries and don’t really solve the fundamental problem – they don’t truly support citizen analysts conducting ad-hoc analysis on large amounts of data, at the speed of thought.
Kyligence – A New Generation of Precomputation Technology
Kyligence fundamentally changes how modern analytics is done with its breakthrough precomputation technology. When the creators of the Apache Kylin project founded the company back in 2016, they had a vision to create a platform that could enable citizen analysts to do their jobs without worrying about the size of their data and the number of concurrent users. Today, its products are being used by some of the world largest banks, insurance companies, retailers, manufacturers, and so on.
Know the Questions Before They Are Asked
Wouldn’t it be nice to be fully prepared and know all the questions you will be asked before you walk into an interview? That’s exactly the approach Kyligence takes. By learning past query histories, analysts’ behaviors, data profiles, and system logs, Kyligence automatically predicts the common questions people ask, and prepares answers based on its predictions.
This process is driven by its AI engine, and the more queries it processes, the more accurately it predicts the questions. The system analysts can further optimize the AI engine’s work based on human knowledge about their business processes.
Blazing Fast Processing Speed
Knowing what questions users are going to ask is just the first step. The software needs to prepare the answers in time so they are available when users need them. This is the ‘compute’ part of precomputation technology. Kyligence’s Spark-based compute engine adopts many optimization techniques to speed up the building of aggregate indexes.
In addition to the typical batch mode, indexes can also be refreshed through a pre-scheduled incremental load, or real-time updates from messaging products such as Kafka. The fast processing speed allows users to incorporate the latest information into the aggregate index and gives them the most accurate picture of their business.
Built for the Modern Data Platform
Kyligence’s flagship products, Kyligence Cloud and Kyligence Enterprise, are built from the ground up for the modern data platform. They leverage the latest technologies to increase manageability and reduce infrastructure cost.
In the cloud, Kyligence can be deployed on AWS, Azure, and Google Cloud, either directly from the Kyligence portal or through the AWS and Azure Marketplace. It works with modern Cloud Data Warehouses as well as cloud data storage systems. Kyligence Cloud takes advantage of the elasticity of the cloud platform so that enterprises don’t have to overpay for infrastructure and still have the capability to support usage spikes.
Kyligence’s aggregate index is stored in Parquet files, a modern storage format best suited for analytics workloads. The aggregate index is stored either in cloud object stores or on Hadoop File Systems, when deployed on-premises. This type of distributed storage makes it possible to build an aggregate index 100s of times larger than other precomputation products.
One of our clients stores petabytes of data in the aggregate index so that all historical information is always available for their analysts. Businesses no longer have to choose between having complete access to historical data and the speed of their analytics.
Traditional precomputation products, such as OLAP Cubes, were very rigid. Adding an extra measure or dimension meant you had to rebuild the cubes, which lead to extra development costs and system downtime. Kyligence can handle the source schema changes and adjust measures and dimensions accordingly, and conveniently update the aggregate index.
The system also automatically adjusts the index to balance the often-conflicting requirements of query performance, storage space, update time, etc. Users can adjust these settings and the software will train itself to fine tune the indexes based on user behaviors.
In this blog, we looked at the history of precomputation and introduced Kyligence’s game-changing precomputation technology. This technology opens the doors for enterprises and their data scientists to conduct analytics at unprecedented speed and scale.