Recently, I’ve read several interesting posts online talking about how old the Online Analytical Processing (OLAP) concept is and how dead Star Schema, Multi-dimensional analysis, and Data Cubes are. Instead, they promise a whole new world to everyone needing to build, maintain, and access a Big Data platform in their company: you don’t need to model data anymore, you can throw away ETL, your query will return instant results no matter where you store it, etc. They make it sound like magic.
IT is an industry that favors new stuff and is never short on fresh things to learn and be excited about. It’s not cool to stick with anything “old school”. Yet recently, a post has gone viral in the data community talking about how “SQL is the best skill I’ve learned in my 20+ year career”. SQL, which first became an ANSI standard in 1986 (about a zillion years ago in the IT world), is definitely not the cool kid that people would like to hang out with anymore.
I remember how excited some of my fellow engineers were when NoSQL became a thing many years ago. Soon after, people started saying that it was not “No SQL”, it was “Not Only SQL”, and then there came “NewSQL”.
After passing the so-called “Peak of Inflated Expectations” phase on Gartner’s Hype Cycle, people finally realized how long-lived a theoretically sound classic approach can be, and how it can continue providing the best value to its users. Nowadays, any data platform technology cannot call itself complete if it cannot provide a SQL interface that’s as fully standardized as possible (I’m looking at you, Druid).
The Current State of Information Management
History doesn’t repeat, but it rhymes. This same thing happened to the concept of the “Data Lake”, which promised vast data collection capabilities to meet all of your data needs. It was simple, just come by whenever you wanted, bringing your collection bottle to fill with some data water to satisfy your thirst for insights. It turns out, this unfiltered water makes people sick, and the volume of the lake drowns people who don’t know how to swim in it.
An unmodeled and ungoverned data platform is hazardous to everyone. Data Lakes have their place, don’t get me wrong, but they don’t deliver the possibilities Big Data promises by themselves. To deliver on those promises, we need to overcome several obstacles, one of which is to ensure we’re getting true value out of this ocean of data, simply and quickly.
Evaluating Other Big Data Analytics Tools and Techniques
Most people see and experience similar pain when dealing with Big Data. This can include slow queries on big volumes of data, low concurrency due to competition over scarce system resources, and heterogeneous systems that have to be maintained to accommodate different data needs, both legacy and new.
People come up with different approaches to address this pain. The tendency to gravitate towards what’s “new” and “cool” in the IT industry manifests in some of these approaches. People look down on “old” things like OLAP, OLAP Cubes, and multi-dimensional analysis, thinking they are dead like SQL was when MapReduce debuted. But is new necessarily better?
What is OLAP (OnLine Analytical Processing)?
If you’re relatively new to the Big Data and business intelligence field, you may be unfamiliar with the term OLAP, and it might be helpful to define it here to provide some context for the rest of this article.
OLAP (short for OnLine Analytical Processing) is an approach designed to quickly answer analytics queries involving multiple dimensions. It does this by rolling up large, sometimes separate, datasets into a multidimensional database known as an OLAP Cube. This OLAP Cube is optimized for easy analysis and enables the “slicing and dicing” of data from different viewpoints for a streamlined query experience.
This approach has played a critical role in business intelligence analytics for years, especially in regards to Big Data. The data aggregation and pre-calculation OLAP and OLAP Cubes enable, have proven to be a great way to avoid the excessive processing times and slow query speeds that plague modern BI tools and complex Big Data infrastructures. If you’re curious about what modern Big Data OLAP analytics looks like, check out our latest presentation on the topic:
Comparing OLAP Analytics to Other Modern Approaches
Before we dive into why OLAP analytics is still so relevant, let’s agree on some basic laws of physics, mathematics, and business:
- Memory is faster and getting cheaper, but it cannot compare to disk storage and is still a relatively expensive and scarce resource, even with flash drives today.
- There will always be a trade-off for different tiers of storage space that have different performance/price ratios. It never makes business sense to only use the fastest (read “most expensive”) everywhere.
- Scanning more data will take a longer time/more resources in total.
- Network throughput always has an upper limit.
- The product grows exponentially if the multipliers grow.
- Boyle’s Law: no matter how big your cluster is, data will fill up the whole space.
Now, let’s see how different approaches try to address the pain of Big Data, and why they don’t work:
- In-Memory: Put intermediate results in memory with the hope that subsequent queries will reuse it. This is based on two points: 1) memory is fast and so queries will be fast and 2) a “cache”-like mechanism can accelerate similar queries by reusing previous results. But if you agree with the basic laws above, that memory is limited. Your intermediate results can get you an Out of Memory error very easily, and the cache gets stale quickly when queries are too specific and fluid. Then, you may ask, how about “materializing” some of those results in a less constraining space, like a disk? You would basically be doing some modeling and OLAP already in that way, but without the advantage of defining or changing that model.
- Massively Parallel Processing (MPP): Divide and conquer. Do things in parallel on smaller chunks of data. This is a very solid strategy, and how Hadoop works, but it’s limited by another set of basic laws. Concurrency suffers when data has to be moved around the system over the network, or when multiple users try to do separate things on the same system. What’s worse is the potential waste of repeated scans of the same detailed raw dataset incurred by similar queries. You may ask, why not pre-scan and compile the results for those similar queries to reduce the need for additional full data scans? Bingo you got OLAP!
- Data Virtualization: Again, another hot idea. It’s ideal and could be your ultimate data platform, but to quote a CTO from a top New York bank, “none of them work”. The problem is that it’s not always possible to get only small aggregated datasets from here and there and combine them together on a Virtualization layer. On top of that, moving massive amounts of data across data storage would be prohibitively low performance.
- Heterogeneous Systems: Extract data from your data lake and then load it to a RDBMS-based data warehouse for analysis. A lot of companies are doing this, especially those who had a RDBMS-based data warehouse before adopting Hadoop and other Big Data technologies. The cost of extra ETL and maintenance is significant, and to make matters worse, some create a spaghetti ETL process that will also get data from RDBMS back to the data lake to join to other datasets that are too expensive to move to RDBMS. This can become a nightmare for your data OPS team.
Besides the individual weaknesses of these approaches, there’s a common challenge they all share: data governance. Some of these approaches promise easy adoption and no modeling needed, but things can soon get out of control if users define their own queries and metrics with no one to govern the logic and version of the truth.
In a very ad-hoc startup environment where one or two “data servants” are feeding reports to the CEO and other executives, such easy ramp-up and flexibility provide a lot of value in business agility. However, once there are multiple business functions and more than several data analysts, the potential cost of no data governance can be fatal to the business.
The Future is Extreme OLAP Analytics for Big Data
So, why is OLAP analytics still relevant in the era of Big Data? Just like SQL, it’s based on a sound theory that has been proven universal over time, and the concept holds while implementation needs increase. What people are really complaining about is the older generation of OLAP technologies such as Cognos. Those complaints are valid: rigid manual modeling that requires heavy maintenance, size limits of data cubes that require compromising use cases, scale-up architecture that hit a wall long time ago, etc.
However, open source technologies like Apache Kylin, and its commercial enterprise counterpart Kyligence, combine proven classic theory with new Big Data and AI technologies to create an OLAP v2.0. This modern extreme OLAP engine is still modeled but done/maintained by AI. It’s capable of limitless petabyte-level scale-out and hundreds of dimensions in a single data cube, and fast/incremental cube building that leverages underlying Big Data MPP systems. To make things even better, Kyligence introduces a unified semantic layer that organizes models and metadata to support data governance and cell-level data security.
For the other approaches I mentioned earlier, all have their place in the Big Data ecosystem and serve some use cases very well, especially in a highly undefined scenario ramping up with a fairly small dataset. Consider those approaches “pay as you go” and OLAP as a flat-rate plan. For any medium to large data volume and any data user base more than a dozen in a company, the flat-rate plan actually costs less.
With your data only expected to grow, a pay-as-you-go approach will cost you much more thanks to increasing per-usage pricing, hidden fees (such as the cost of metrics discrepancies and delayed/misled decisions), and fine print (such as “performance is based on a single user on the whole cluster”).
None of this is to say that new or “hyped” solutions and approaches in the world of Big Data should always be avoided in favor of mature and tested technologies. Indeed, trying new things is often the only path to uncovering a better way to operate. But it’s worth keeping in mind that everything old, given the right circumstances, can become new again.
Experience What Next-Generation Extreme OLAP Analytics Can Offer
Are you ready to take the next step towards lightning-fast business intelligence with extreme OLAP analytics? Find out why the technology behind Apache Kylin and Kyligence is helping 1,000s of companies embrace OLAP technology for faster Big Data insights. We recommend continuing your OLAP on Big Data education with this video: