Snowflake: The Good, the Bad, and the Beautiful for Interactive Analytics

Author
Kaige Liu
Sr. Solutions Architect, Kyligence
Sep. 03, 2020

Many thanks to Glenn Platkin, Head of US Enterprise Sales in the East, for his contributions to this article.


For most enterprises, the rapid growth of data volumes has left business analysts struggling to glean impactful insights on these new data riches. The limited scale, slowing response times, and increasing latencies of existing BI and data warehousing approaches led to the rise of so-called “big data” architectures that were designed from the ground up for the volume, variety, and velocity of these new datasets. Many of these technologies have been developed with the unlimited scale of the cloud in mind.

One of the most successful of these new generation technologies is the Snowflake cloud data warehouse. At a time when enterprise data warehouse titans Oracle and Teradata began to look expensive, ponderous, and tied to the data center, Snowflake struck the right balance between data warehousing capabilities and the cloud. 

Snowflake features an MPP (massively parallel processing) architecture similar to MemSQL, Vertica, and several others.


The Good

The cloud paradigm makes getting started with Snowflake fairly simple. You fill out n online form, provide your credit card, and load in your data. Much of the inherent complexity is hidden – your applications and BI access the data, and your business users are off and running. It is secure, flexible, and requires less management than traditional data warehouse solutions. As a data warehouse solution, it provides more immediacy, autonomy, and control. With a market valuation of over $12B USD, clearly, they are doing something right.

We’ll leave it to Snowflake’s website to provide a ton more information on The Good. So, now let’s look at some of the drawbacks to the Snowflake approach.


The Bad

Snowflake’s incredible rise has clearly been a response to a technology sector that has long been ripe for disruption: the enterprise data warehouse (EDW). For Oracle or Teradata warehouses, IT buyers have typically figured on a minimum $1M allocated for each new warehouse, with the corresponding support costs leaving many reeling.

Snowflake has succeeded in disrupting EDW by delivering on the promise of the cloud.

The promise of Snowflake: 

  • Data analysts can load up lots of data and get to work
  • Historical data will be available without limit 
  • You can use the BI tool of your choice 
  • Your analytics will not impact production because it is now totally isolated 

The reality of Snowflake: 

Large enterprises that use Snowflake for their big data analytics eventually run into performance issues that have also plagued their traditional on-premise counterparts. 

  • Limitations of scale when their data surpasses the typical load Snowflake is built to handle
  • Limitations on the number of concurrent users hitting the data
  • Performance issues associated with distributed MPP architecture: data skew, network latency, complexity

The blessings of a cloud offering are somewhat offset by a new set of challenges. For instance, the benefit of hiding complexity in cloud environments makes the inner workings opaque to those hoping to tune and optimize their instances. So, for issues of performance and scale, the only thing you can do to remedy these issues is to add more compute, or you have to have your DBAs limit, subset, aggregate, or index the data. 

Of course, by adding more compute and more data, you will incur higher monthly costs. So, what tends to happen is that large enterprises with large amounts of data with high concurrency and complexity don’t get the performance they need, or they pay a premium to get it.

It is a little unfair to call the Snowflake solution “ugly.” It’s a great product. But you must be aware of potential pitfalls. Here are the top three most common issues:

Many users interacting with petabytes of data

Snowflake is not purpose-built for interactive querying of many concurrent users on extremely large datasets. This will be an issu­­e for your analysts if they:

  • Wish to interact with millions, billions, or trillions of rows of detail
  • Have hundreds, thousands, or even millions of analytics users
  • Wish to get detailed intelligence or AI-assisted guidance to improve performance

Limited options to tune and scale

There are limited remedies for the above problem on Snowflake. The only options available within Snowflake are to use more compute or limit your data, which will impact your performance, limit the scope of your analytics, or increase your cost.

Potential cost overruns

If you do nothing to address these shortcomings, your monthly statement will be the consequence of that inaction.  

The pay-per-use model looks great on the surface, but what you end up doing is paying for storage, compute, and users. That’s not inherently a bad thing, but there are a few problems with this setup for Snowflake.  

  • You can’t bring your own cloud. Snowflake provides you with a cloud environment with configuration, which is great for getting set up quickly, but because you can’t choose your cloud you don’t end up with the best price for your service. 
  • If you want anything above standard support i.e. dedicated hardware, dedicated compute nodes, faster results, bigger data, higher availability – you’re going to wind up paying a premium. 

Snowflake’s purpose is to let you run your queries. They do not have a financial incentive to make sure your queries run as fast as possible.  

You will end up with out-of-control costs and mediocre results for big data and high performance. Not only will you have unhappy users, but you will also have an unhappy CFO because he’s not getting the results that he needs to make decisions about the business and he’s paying a lot of money for this service.

Happily, there is a solution available to address these challenges.


The Solution

A few years ago, eBay also grappled with this exact scenario. They had a massive Teradata environment, so doing a forklift upgrade to bring in more compute was not an option. They had many petabytes of detail data and needed to be able to have many, many users run complex queries with high concurrency via a variety of analytics and reporting tools. They also required high performance with sub-second response times.

How eBay Responded to the Limitations on Their Big Data Analytics

First, eBay looked at the commercially available options including Greenplum, Netezza, Vertica, Postgres, and some others – these were the only candidates at the time that could have potentially handled the scale of data eBay was working with. But through their search for a solution, they found that none of these options could scale, none could perform, and all of them were too complex to implement and maintain.

The only remaining option was to build. 

The tool they needed had not yet been invented, so they set up a team to create something new that built on existing technologies. First, they looked at OLAP (online analytical processing) and cube technology. The problem with OLAP cubes initially was that they could tax compute and storage. But as compute and storage costs plummeted, that issue more or less went away.

What the engineers at eBay discovered was the power of combining the best parts of this OLAP technology with big data technologies like Hadoop. Together, they used OLAP cube and precompute technology on Hadoop to provide:

  • Ultra-high performance with sub-second response times regardless of data volumes
  • High concurrency with predictable performance and virtually unlimited scale
  • Users with the ability to slice and dice data using any BI tool
  • Ease of implementation and management
  • A cost-effective solution
Apache Kylin timeline
Figure 1 Timeline of Apache Kylin’s Development

The original team that founded the Apache Kylin project at eBay created a company and a commercial product called Kyligence. Kyligence team members are still the most active contributors and committers to the Apache Kylin project.


The Big Data Analytics Solution for Snowflake Users

Using Kylin as its base, Kyligence has added a tremendous amount of innovation to the front end, back end, and everything in between to create an enterprise-ready version of this software for on-premises or any cloud environment. 

Kyligence is to Kylin what Elastic is to Elastic Search, Confluent is to Kafka, Databricks is to Spark, and Cloudera is for Hadoop. 

The big data analytics issues raised in this article are not specific to one industry, these are challenges encountered by enterprises across the board regardless of if you’re in financial services, do online auctions, or are a social media or fast food giant – you have the same problem.

You need a lot of data. You have a lot of users. You need a lot of performance. This is the problem that Kyligence solves. 


Kyligence Architecture

Kyligence architecture
Figure 2 Kyligence Architecture Diagram

Bring Your Own Cloud

Kyligence offers a bring your own cloud environment. We support any cloud environment or any on-prem environment running Hadoop or Spark. 

Bring Your Own Tool

We support both SQL and MDX so, yes, you can use Excel to build pivot tables with billions of rows of data. We also partner with Power BI, Tableau, MicroStrategy, and many others, and provide OBDC, JDBC, and REST API. We can bring in any data source (to see how this works with Snowflake, see this demo – timestamp: 23:08). 

Unified Semantic Layer

We start off with cube technology and then provide a unified semantic layer that hides all of the complexity of BI back end servers, data tables, columns, SQL joins, and calculations from the users, instead providing them with business terminology defining hierarchies and dimensionality. 

This Unified Semantic Layer is shared across all of your applications so there’s no need to redefine your data models; you just drag and drop what you want to create your dashboards and reports. 

There is also an AI front end for easy implementation.

Purpose-Built for High Performance and Concurrency

One of our biggest use cases has one hundred thousand concurrent users going against three trillion rows of data and getting sub-second response times. With any traditional data source, like Snowflake or Oracle, the more data you put into the system, the more users you put onto the system, the higher your response times will be. With precomputation and precompute technology, such as Kyligence, your response times will remain flat and stable across the board regardless of the number of users and the amount of data.

online calc vs precomputation
Figure 3 Online Calculation vs. Precomputation Graph

AI-Augmented Precomputation and Query Layer

We provide an AI-augmented data cloud that uses existing SQL queries to build your models automatically. The system takes query history into account and aids in the precomputation of “aggregate indexes – which function like an OLAP cube. Note: for convenience, we’ll use “cube” to denote aggregate indexes for the rest of this paper.

Thus, Kyligence is constantly working to improve SQL and MDX query performance regardless of the size of your data sets. This is what enables interactive, sub-second response times for complex queries even with thousands of concurrent users.

Kyligence also marries OLAP with relational OLAP (ROLAP), which means that when you ask a question, the system knows if the data is sitting in a cube or not in a cube. If the data is not in a cube, Kyligence will add the needed additional data to the cube. It is also smart enough to then go and modify the model with that new information so the next query will run sub-second. 

What this means for your analysts: they are now able to interactively work with billions of rows of data. 

Security

Kyligence provides enterprise-grade security through its Unified Semantic Layer and can integrate with all major security platforms such as LDAP, Active Directory, SAML, Kerberos, etc. Within our products, we also provide project-level, table-level, and cell-level security. 


Kyligence + Snowflake

If you currently use Snowflake and can relate to the challenges described in this article, Kyligence can take the pain out of your analytics experience by speeding up your queries, unifying your data models and business definitions, and increasing your concurrency. 

When using Kyligence with Snowflake, your data remains the same – it is stored in the same places and loaded the same way, so you don’t need to make any changes to your existing ETL processes or any changes to the defined databases or tables in Snowflake. 

The only thing you need to do is switch the entry for queries from Snowflake to Kyligence.

Fast Query Performance

We add Kyligence on top of Snowflake as an acceleration and analytical layer. Adding this complexity brings many benefits to your existing solution. Most importantly, it brings extremely fast query performance – both Kyligence and Apache Kylin are well-known for sub-second queries. It takes a long time to optimize query responses running directly on Snowflake, but these can be processed by Kyligence and sped up to sub-second. 

High Concurrency

We bring extremely high levels of concurrency. Kyligence greatly reduces the query overhead through precomputation, which helps Snowflake users get better concurrency.

Unified Business Definitions Throughout Your Organization

In addition to performance, Kyligence also provides customers with the ability to build a unified semantic layer. This means you don’t need to repeatedly define the same models and semantics in different BI tools. Kyligence can also enable different business lines to use the same definitions to ensure the accuracy and consistency of the data. 

Integration with Your BI Tools

Kyligence’s ecosystem allows you to integrate with a variety of mainstream BI tools, for example, you can even directly use an Excel pivot table to analyze the data in Snowflake. We can support most common BI tools like Tableau, Excel, MicroStrategy, PowerBI, and so on.  

Cost-Effective Solution with Fast & Easy Deployment

Shifting your query load to Kyligence will greatly reduce your query costs in terms of time, resources, and manpower. You can quickly deploy a Kyligence environment on all cloud platforms supported by Snowflake with just a few clicks. 

Kyligence can directly synchronize with the tables and databases you’ve created in Snowflake so you don’t need to recreate them, you just need to switch the query from Snowflake to Kyligence and all the queries will be captured by Kyligence. The entire process is transparent and will not impact your business beyond freeing your analysts to move faster and do more with your data. 

The Beautiful

While we have discussed some of the potential challenges of getting the most out of your Snowflake environment, we’ve also identified a great alternative: use the Unified Semantic Layer, the vastly simplified data modeling, and the AI-assisted precomputation and query optimization of Kyligence on your Snowflake data warehouse. 

Together, you can scale your Snowflake warehouse to support thousands of concurrent users hammering away at petabytes of data and getting instant results. The results will be beautiful.


Additional Resources: 

Snowflake: The Good, the Bad, and the Ugly – The Kyligence + Snowflake demo begins at 23:08

Try Kyligence Cloud for free on a fully hosted environment with zero set up required


About the Authors:

Kaige Liu

Kaige Liu is a Senior Solutions Architect at Kyligence, where he works on building the next-generation big data analytics platform. Previously, he worked on the OpenStack and Bluemix team at IBM, focusing on cloud computing and virtualization technology. Kaige loves the open source community and is an active Apache Kylin committer.

Glenn Platkin is the Head of US Enterprise Sales for the East. Having been part of many well-known business intelligence-based companies, he is an industry veteran.

Samantha Berlant Profile

Samantha Berlant is the Marketing Communications Manager at Kyligence and a big fan of AI, machine learning, and science-fiction. She spent several years leading content analytics projects at Facebook and Instagram and has been a writer and editor for over a decade. Samantha believes in the power of accessible data and her favorite Star Trek character is, coincidently, Data.