It doesn’t take long these days to find a blog or article proclaiming: OLAP is dead! Because there is unlimited scale in the cloud, so the argument goes, you no longer need to precompute results to get good performance. In fact, concepts associated with OLAP (Online Analytical Processing) such as star schema, multi-dimensional analysis, and data cubes are all painted with the broad brush of “yesterday’s analytics.”
Instead, we are promised a whole new world where everyone needs to build, maintain, and access a big data platform in their company: you don’t need to model data anymore, you can throw away ETL, your query will return instant results no matter where you store it, etc. They make it sound like magic. Ours is an industry that favors new stuff and we are never short on fresh things to learn and be excited about. It’s not cool to stick with anything “old school.”
Yet SQL persists. A recent Dataquest blog proclaimed, “Want a job in data? Learn SQL.” SQL - which first became an ANSI standard in 1986 (about a zillion years ago in the IT world) - is definitely not the cool kid that people would like to hang out with anymore.
I remember how excited some of my fellow engineers were when NoSQL became a thing many years ago. Soon after, people started saying that it was not “No SQL,” it was “Not Only SQL,” and then there came “NewSQL.”
After passing the so-called “Peak of Inflated Expectations” phase on Gartner’s Hype Cycle, people finally realized how long-lived a theoretically sound, classic approach can be, and how it can continue providing the best value to its users. Nowadays, any data platform technology cannot call itself complete if it cannot provide a SQL interface that’s as fully standardized as possible (I’m looking at you, Druid).
The Current State of Information Management
History doesn’t repeat, but it rhymes. This same thing happened to the concept of the “Data Lake,” which promised vast data collection capabilities to meet all of your data needs. It was simple, just stop by whenever you wanted, bringing your collection bottle to fill with some data water to satisfy your thirst for insights. It turns out, this unfiltered water makes people sick, and the volume of the lake drowns people who don’t know how to swim in it.
An unmodeled and ungoverned data platform is hazardous to everyone. Data Lakes have their place, don’t get me wrong, but they don’t deliver the possibilities big data promises by themselves. To deliver on those promises, we need to overcome several obstacles, one of which is to ensure we’re getting true value out of this ocean of data, simply and quickly.
Evaluating Other Big Data Analytics Tools and Techniques
Most people see and experience similar pain when dealing with big data. This can include slow queries on big volumes of data, low concurrency due to competition over scarce system resources, and heterogeneous systems that have to be maintained to accommodate different data needs, both legacy and new.
People come up with different approaches to address this pain. The tendency to gravitate towards what’s new and cool in technology manifests in some of these approaches. People look down on “old” things like OLAP, OLAP Cubes, and multi-dimensional analysis, thinking they are dead like SQL was when MapReduce debuted. But is new necessarily better? How much MapReduce code is being written these days? Spark has all but killed MapReduce - with Hadoop itself teetering on the precipice. But Spark has not killed SQL, which may turn out to be the stickiest technology out there.
What Is OLAP?
If you're relatively new to the big data and business intelligence field, you may be unfamiliar with the term OLAP, and it might be helpful to define it here to provide some context for the rest of this article.
OLAP (short for OnLine Analytical Processing) is an approach designed to quickly answer analytics queries involving multiple dimensions. It does this by rolling up large, sometimes separate datasets into a multidimensional database known as an OLAP Cube. This OLAP Cube is optimized for easy analysis and enables the "slicing and dicing" of data from different viewpoints for a streamlined query experience.
This approach has played a critical role in business intelligence analytics for years, especially in regards to big data. The data aggregation and precomputation that OLAP and OLAP Cubes enable have proven to be a great way to avoid the excessive processing times and slow query speeds that plague modern BI tools and complex big data infrastructures. If you're curious about what modern big data OLAP analytics looks like, check out our presentation and blog on the topic of augmented analytics.
Comparing OLAP Analytics to Other Modern Approaches
Before we dive into why OLAP analytics is still so relevant, let’s agree on some basic laws of physics, mathematics, and business:
Law 1 - Memory is the most expensive storage
Memory is faster and getting cheaper, but it cannot compare to disk storage on cost and is still a relatively expensive and scarce resource, even with flash drives today.
Law 2 - Tiered storage makes sense
It doesn’t make financial sense to store all your data in the fastest and most expensive storage tier. There will always be a trade-off for different tiers of storage that have different performance/price ratios.
Law 3 - More data, slower scanning
As data volumes grow, SQL operations (joins, aggregates, unions) take longer and consume more CPU resources. On top of that, network throughput always has an upper limit.
Now, let’s see how different approaches try to address the pain of big data, and why they break down as data volumes grow.
Put intermediate results in memory with the hope that subsequent queries will reuse it. This is based on two points: 1) memory is fast and so queries will be fast and 2) a “cache”-like mechanism can accelerate similar queries by reusing previous results. But, if you agree with the basic laws above, that memory is limited.
Your intermediate results can get you an Out of Memory error very easily, and the cache gets stale quickly when queries are too specific and fluid. Then, you may ask, how about “materializing” some of those results in a less constraining space, like a disk? You would basically be doing some modeling and OLAP already in that way, but without the advantage of defining or changing that model.
Massively Parallel Processing (MPP) Databases
Divide and conquer. Do things in parallel on smaller chunks of data. This is a very solid strategy, and how Hadoop works, but it’s limited by another set of basic laws. Concurrency suffers when data has to be moved around the system over the network, or when multiple users try to do separate things on the same system.
What’s worse is the potential waste of repeated scans of the same detailed raw dataset incurred by similar queries. You may ask, why not pre-scan and compile the results for those similar queries to reduce the need for additional full data scans? Bingo, you got OLAP!
Again, another hot idea. It’s ideal and could be your ultimate data platform, but to quote a CTO from a top New York bank, “none of them work.” The problem is that it’s not always possible to get only small aggregated datasets from here and there and combine them together on a virtualization layer.
On top of that, moving massive amounts of data across data storage would cause prohibitively low performance.
Cloud Data Warehouses
Extract data from your data lake and then load it to an RDBMS-based data warehouse for analysis. A lot of companies are doing this, especially those who had an RDBMS-based data warehouse before adopting Hadoop and other big data technologies.
The cost of extra ETL and maintenance is significant, and to make matters worse, some create a spaghetti ETL process that will also get data from RDBMS back to the data lake to join to other datasets that are too expensive to move to RDBMS. This can become a nightmare for your data ops team.
The Future Is Augmented OLAP Analytics for Big Data
So, why is OLAP analytics still relevant in the era of big data? Just like SQL, it’s based on a sound theory that has been proven universal over time; the concept holds and can meet the increasingly demanding environment we now face.
What people are really complaining about is the older generation of OLAP technologies such as Cognos and SSAS. Those complaints are valid: rigid manual modeling requires heavy maintenance, size limits of data cubes, and a scale-up architecture that hit the wall a long time ago.
However, open source technologies like Apache Kylin, and its commercial enterprise counterpart Kyligence, combine proven classic precomputation theory with new big data and AI technologies to create Augmented OLAP. This modern, extreme OLAP engine is still modeled, but created and maintained by an intelligent, AI-augmented engine.
Augmented OLAP is capable of virtually limitless scale-out to interactively query petabytes of data and hundreds of dimensions in a single data cube. The Kyligence AI-Augmented Engine enables the fast, incremental cube building that leverages underlying MPP architecture. To make things even better, Kyligence has introduced a Unified Semantic Layer that organizes models and metadata to support data governance and cell-level data security.
For the other approaches I mentioned earlier, all have their place in the big data ecosystem and serve some use cases very well, when data volumes are not extreme and the number of concurrent users is small.
One can consider those approaches (data virtualization, MPP databases, Cloud Data Warehouses) a pay per query or “pay as you go” proposition, and Augmented OLAP can be seen as a flat-rate plan. For any medium to large data volume and any data user base more than a dozen in a company, the flat-rate plan actually costs less.
Why is that? It is because with precomputed result sets in OLAP, you query the back-end data source once and store the results in multi-dimensional cubes. So, for example, if ten analysts each run the exact same expensive join at the same time, the source database will get hammered. With Kylin/Kyligence, the join is performed once and the results can then be served up via simple, inexpensive lookups. That’s not hype, its arithmetic.
With your data only expected to grow, a pay-as-you-go approach will cost you much more thanks to increasing per-usage pricing, hidden fees (such as the cost of metrics discrepancies and delayed/misled decisions), and fine print (such as “performance is based on a single user on the whole cluster”).
None of this is to say that new solutions and approaches in the world of big data should always be avoided in favor of mature and tested technologies. Indeed, trying new things is often the only path to uncovering a better way to operate. But it’s worth keeping in mind that foundational principles and technologies like OLAP; even though they’ve been around a long time, given the right circumstances they can become new again.
Experience What Extreme OLAP Can Offer
Are you ready to take the next step towards lightning-fast business intelligence with extreme OLAP analytics? Find out why the technology behind Apache Kylin and Kyligence is helping 1,000s of companies embrace OLAP technology for faster big data insights. We recommend continuing your OLAP-on-big-data education with this webinar: A High-Performance, High-Concurrency Architecture for Analytics on Azure.