Meet Your AI Copilot fot Data Learn More

OLAP vs. Other Approaches to Big Data Analytics

Author
Kyligence
Jun. 29, 2022
 

 

It doesn’t take long to find a blog or article proclaiming: OLAP is dead! Because there is an unlimited scale in the cloud, the argument goes that you no longer need to precompute results to get good performance. Concepts associated with OLAP (Online Analytical Processing) such as star schema, multi-dimensional analysis, and data cubes are painted with the broad brush of “yesterday’s analytics.”

 
The Pain Points of Big Data
 

Most people see and experience similar pain when dealing with big data. This can include slow queries on big volumes of data, low concurrency due to competition over scarce system resources, and heterogeneous systems that have to be maintained to accommodate different data needs, both legacy and new.

 

People come up with different approaches to address this pain. The tendency to gravitate towards what’s new and cool in technology manifests in some of these approaches. People look down on “old” things like OLAP, OLAP Cubes, and multi-dimensional analysis, thinking they are dead like SQL was when MapReduce debuted. 

 

But is new necessarily better? How much is MapReduce code being written these days? Spark has killed MapReduce - with Hadoop teetering on the precipice. But Spark has not killed SQL, which may be the stickiest technology out there.

 
What Is OLAP?
 

Suppose you're relatively new to the big data and business intelligence field. In that case, you may be unfamiliar with the term OLAP, and it might be helpful to define it here to provide some context for the rest of this article.

 

OLAP (short for OnLine Analytical Processing) is an approach designed to quickly answer analytics queries involving multiple dimensions. It does this by rolling up large, sometimes separate datasets into a multidimensional database known as an OLAP Cube. This OLAP Cube is optimized for easy analysis and enables the "slicing and dicing" of data from different viewpoints for a streamlined query experience.

 

This approach has played a critical role in business intelligence analytics for years, especially in regards to big data. The data aggregation and precomputation that OLAP and OLAP Cubes enable have proven to be a great way to avoid the excessive processing times and slow query speeds that plague modern BI tools and complex big data infrastructures. If you're curious about what modern big data OLAP analytics looks like, check out our presentation and blog on the topic of augmented analytics.

 
Comparing OLAP Analytics to Big Data Approaches
 

Before we dive into why OLAP analytics is still so relevant, let’s agree on some basic laws of physics, mathematics, and business:

 

Law 1 - Memory is the most expensive storage

 

Memory is faster and getting cheaper, but it cannot compare to disk storage on cost and is still a relatively expensive and scarce resource, even with flash drives today.

 

Law 2 - Tiered storage makes sense

 

It doesn’t make financial sense to store all your data in the fastest and most expensive storage tier. There will always be a trade-off for different tiers of storage that have different performance/price ratios.

 

Law 3 - More data, slower scanning

 

As data volumes grow, SQL operations (joins, aggregates, unions) take longer and consume more CPU resources. On top of that, network throughput always has an upper limit.

 
OLAP on the Data Lake Solution Brief
 

Now, let’s see how different approaches try to address the pain of big data, and why they break down as data volumes grow.

 
In-Memory
 

Put intermediate results in memory with the hope that subsequent queries will reuse it. This is based on two points: 1) memory is fast and so queries will be fast and 2) a “cache”-like mechanism can accelerate similar queries by reusing previous results. But, if you agree with the basic laws above, that memory is limited. 

 

Your intermediate results can get you an Out of Memory error very easily, and the cache gets stale quickly when queries are too specific and fluid. Then, you may ask, how about “materializing” some of those results in a less constraining space, like a disk? You would basically be doing some modeling and OLAP already in that way, but without the advantage of defining or changing that model.

 

Your intermediate results can get you an Out of Memory error very easily, and the cache gets stale quickly when queries are too specific and fluid. Then, you may ask, how about “materializing” some of those results in a less constraining space, like a disk? You would basically be doing some modeling and OLAP already in that way, but without the advantage of defining or changing that model.

 
Massively Parallel Processing (MPP) Databases
 

Divide and conquer. Do things in parallel on smaller chunks of data. This is a very solid strategy, and how Hadoop works, but it’s limited by another set of basic laws. Concurrency suffers when data has to be moved around the system over the network, or when multiple users try to do separate things on the same system.

 

What’s worse is the potential waste of repeated scans of the same detailed raw dataset incurred by similar queries. You may ask, why not pre-scan and compile the results for those similar queries to reduce the need for additional full data scans? Bingo, you got OLAP!

 
Data Virtualization
 

Again, another hot idea. It’s ideal and could be your ultimate data platform, but to quote a CTO from a top New York bank, “none of them work.” The problem is that it’s not always possible to get only small aggregated datasets from here and there and combine them together on a virtualization layer. 

 

On top of that, moving massive amounts of data across data storage would cause prohibitively low performance.

 
Cloud Data Warehouses
 

Extract data from your data lake and then load it to an RDBMS-based data warehouse for analysis. A lot of companies are doing this, especially those who had an RDBMS-based data warehouse before adopting Hadoop and other big data technologies.

 

The cost of extra ETL and maintenance is significant, and to make matters worse, some create a spaghetti ETL process that will also get data from RDBMS back to the data lake to join to other datasets that are too expensive to move to RDBMS. This can become a nightmare for your data ops team.

 
The Future Is Augmented OLAP Analytics for Big Data
 

So, why is OLAP analytics still relevant in the era of big data? Just like SQL, it’s based on a sound theory that has been proven universal over time; the concept holds and can meet the increasingly demanding environment we now face. 

 

What people are really complaining about is the older generation of OLAP technologies such as Cognos and SSAS. Those complaints are valid: rigid manual modeling requires heavy maintenance, size limits of data cubes, and a scale-up architecture that hit the wall a long time ago.

 
Apache Kylin OLAP Architecture

Apache Kylin OLAP Architecture 

 

However, open source technologies like Apache Kylin, and its commercial enterprise counterpart Kyligence, combine proven classic precomputation theory with new big data and AI technologies to create Augmented OLAP. This modern, extreme OLAP engine is still modeled, but created and maintained by an intelligent, AI-augmented engine. 

   
Augmented OLAP vs. pay per query approaches
 

Augmented OLAP is capable of virtually limitless scale-out to interactively query petabytes of data and hundreds of dimensions in a single data cube. The Kyligence AI-Augmented Engine enables the fast, incremental cube building that leverages underlying  MPP architecture. To make things even better, Kyligence has introduced a Unified Semantic Layer that organizes models and metadata to support data governance and cell-level data security.

 

For the other approaches I mentioned earlier, all have their place in the big data ecosystem and serve some use cases very well, when data volumes are not extreme and the number of concurrent users is small.  

 

One can consider those approaches (data virtualization, MPP databases, Cloud Data Warehouses) a pay per query or “pay as you go” proposition, and Augmented OLAP can be seen as a flat-rate plan. For any medium to large data volume and any data user base more than a dozen in a company, the flat-rate plan actually costs less. 

 
Why augmented OLAP costs less than pay per query approaches
 

With precomputed result sets in OLAP, you query the back-end data source once and store the results in multi-dimensional cubes. So, for example, if ten analysts each run the exact same expensive join at the same time, the source database will get hammered. With Kylin/Kyligence, the join is performed once and the results can then be served up via simple, inexpensive lookups. That’s not hype, its arithmetic.

 

With your data only expected to grow, a pay-as-you-go approach will cost you much more thanks to increasing per-usage pricing, hidden fees (such as the cost of metrics discrepancies and delayed/misled decisions), and fine print (such as “performance is based on a single user on the whole cluster”).

 

None of this is to say that new solutions and approaches in the world of big data should always be avoided in favor of mature and tested technologies. Indeed, trying new things is often the only path to uncovering a better way to operate. But it’s worth keeping in mind that foundational principles and technologies like OLAP; even though they’ve been around a long time, given the right circumstances they can become new again.

 

Experience What Extreme OLAP Can Offer

 

 Are you ready to take the next step towards lightning-fast business intelligence with extreme OLAP analytics? Find out why the technology behind Apache Kylin and Kyligence is helping 1,000s of companies embrace OLAP technology for faster big data insights. We recommend continuing your OLAP-on-big-data education with this webinar: A High-Performance, High-Concurrency Architecture for Analytics on Azure.

 

TEST DRIVE TODAY with $300 worth of free usage

global businesses that uses Kyligence