How to Accelerate Self-Service Analytics on Data Lakes

Author
Joanna He
Director of Product Management, Kyligence
Jun. 18, 2022
 
Background

 

Spending on cloud infrastructure and services is accelerating. According to a recent report by IDC, worldwide “whole cloud” revenues totaled $706.6 billion in 2021, and are forecasted to reach more than $1.3 trillion by 2025. Data from Synergy Research Group confirms this trend, showing growth in on-premises data center spending at a mere 2% since 2010, while cloud-based services rose 52% during the same time. Synergy believes the growth is now being fueled in part by the COVID-19 pandemic response and a shift to more remote services.

 

Large public cloud vendors like Amazon and Microsoft have contributed to the cloud trend,  launching a number of different X-as-a-Service products, including data lake solutions to support the growing demand for more data-centric services, such as analytics.  In fact, data lakes and data warehouses are the two primary options for enterprises that have adopted cloud-based tools for data analytics.

 
The Challenges of Self-Service Analytics on Data Lakes

A data lake is a centralized data repository that allows businesses to store all of their structured and unstructured data at any scale. Businesses can store data in a data lake as-is (without first structuring it), or they can normalize the data based on their needs, and then use that data to run different types of analytics–decision dashboards and visualizations, big data processing, real-time analytics, and machine learning (ML) algorithms–to generate more accurate business intelligence. That flexibility is a key distinction between a data lake and a data warehouse, and is a distinct advantage for today’s data-driven enterprise.

 

Digging deeper, in Gartner's March 2022 Market Guide for Analytics Query Accelerators, analysts Merv Adrian and Adam Ronthal define the Data and Analytics Infrastructure Model in four zones: known data and known questions, known data and unknown questions, unknown data and known questions, and unknown data and unknown questions.

 

The optimization goals of the data warehouse and the data lake are different. The former is optimized for production delivery of semantically consistent, well-known data; the latter is optimized for semantic flexibility and rapid access to raw data.

 

The question then arises: "Why can't we use the data lake exclusively and retire the data warehouse?" The answer is that the data lake infrastructure, when based on a semantically flexible data store, is generally unable to optimize for the demands of production delivery (such as concurrency, latency and workload management) to the degree that the data warehouse can when built on a relational database.

 

A more manageable way to tackle the issue if the data lake structure has already been built is to add an analytics query accelerator.

 

Analytics query accelerators provide a means of making data in semantically flexible data stores more accessible and performant for production and exploratory use.

 
Criteria to Consider on Your Way to Self-Service Analytics
 

The analytical query acceleration solution is usually a logical extension of the SQL query interface on Hadoop (SQL on Hadoop), and the SQL query interface based on cloud object storage (SQL on Data Lake). So what criteria should enterprises consider when evaluating analytics query accelerators? Gartner also made recommendations in its Market Guide, including the following:

 

Market Recommendations

Data and analytics leaders considering analytics query accelerators to remediate data lake performance and governance concerns or as a broader logical data warehouse play should:

  • Assess where their performance line of "good enough" is by running their most complex workloads on the evaluated target platform in a POC. If a workload fails due to complexity, workload management requirements, performance requirements or other reasons, it is not suitable for the platform, and the next most complex workload should be assessed. Once you have established what percentage of your workloads can be accommodated by an analytics query accelerator, you will be able to make informed decisions about where to use it.
  • Reassess the capabilities of their strategic DBMS vendor and analytics tool(s) of choice to optimize access to the external data they are storing in their data lake. If they perform well enough, an additional product and vendor relationship may not be needed.
  • Test integration with surrounding cloud data management services and/or adjacent data management platforms by evaluating APIs and integration touchpoints.
  • Evaluate security and governance capabilities to ensure that they meet their enterprise standards and requirements by establishing clear governance and security "must haves." Avoid conflicts with existing tools by setting clear coverage assignments for each and leveraging integration where available.
  • Evaluate the degree to which an offering provides open-data access for persisted data by establishing whether the vendor uses open standards for data like Apache Parquet, ORC, Apache Avro or others. The use of a proprietary format may have undesirable consequences around vendor lock-in or impede access via other APIs.
 
Building Enterprise Self-Service Analytics with Kyligence OLAP on Data Lake
 

In the Market Guide, Gartner lists Kyligence as a representative vendor of analytic query accelerators, and with good reason. Enterprises across all industries rely on  Kyligence's OLAP on Data Lake solution to accelerate analytics queries by delivering minimal latency and maximal concurrency for data teams accessing an organization’s data lake, no matter which cloud services vendor—or vendors—they choose.

 
The Kyligence OLAP on Data Lake solution provides enterprises with the following capabilities:
 
Unified SQL Interface Based on Object Storage (SQL on Object Storage)
 

By leveraging Kyligence, users can execute queries directly on their data lake using standard SQL or business intelligence (BI) tools that support SQL queries. In addition, when using Kyligence organizations gain the advantage of unifying their data lake and data warehouse queries with a single, unified architecture, maximizing the value of that data by making it easier to access and turn into decision intelligence.

 

Kyligence also natively supports integration with data sources such as Hive and Object Storage, and data warehouses through software development kits (SDKs). Furthermore, Kyligence's intelligent query routing capabilities can detect and use common query patterns to automatically route queries to aggregate query indexes, detailed query indexes, or push queries down to underlying data warehouses or big data engines, making access to data more efficient.

 
 
One Customer’s Results with a Unified SQL Interface
 

One customer used Kyligence to build a unified data service layer with ANSI SQL query interfaces and microservice encapsulation, encompassing multiple data sources such as Oracle, MySQL, ElasticSearch, and ClickHouse. This capability helped them to achieve unified management of enterprise data assets, while significantly improving the efficiency of their application development and delivery, accelerating the process of data-to-insight.

 
 

Because Kyligence supports all the major cloud data lakes, such as Amazon Cloud S3, Azure Data Lake Storage, and Google Cloud Storage, and integrates with popular BI tools like Tableau, Power BI, and MicroStrategy, Kyligence is the ideal choice for building a self-service analytics platform with whatever tools and resources an enterprise is already using, and provides flexibility for the future as well.

 
High Performance, High Concurrency, Low TCO
 

Kyligence's OLAP on Data Lake solution provides stable query performance through pre-computation, meeting stable query performance demands common to production. This is important when working with data lakes unable to optimize for the demands of production delivery. Kyligence uses a cost-effective, "compute once, reuse many” approach that enables enterprises to avoid costs associated with over-consumption of cloud computing resources.

 

The potential cost savings gained from the Kyligence approach was illustrated by the international eCommerce firm OLX Group, which shared their cost comparison between Apache Kylin, SQL Server Analysis Service (SSAS), and Amazon Redshift when selecting Apache Kylin for cloud data lakes.

 

As shown in the figure below, when comparing the same 100 million rows of test data, the €450 monthly cost of Apache Kylin (including the cost of the underlying architecture) was less than half of the cost when using Microsoft SSAS (€1232), and a quarter of the cost of Amazon Redshift (€2000). What’s more, query performance can reach 2x compared to Microsoft SSAS and 4x that of Amazon Redshift.

 
 
More Efficient Data Management
 

Traditionally organizations have relied on legacy data warehouses to support the data analysis needs of production, architecting their data warehouse with a source layer, warehouse layer, and data mart layer. This can cause problems when applied to a data lake, resulting in data governance issues for production queries. To overcome these challenges, many organizations define metrics in views, and then use those views to solve last-mile queries. However, this is an inefficient approach because it does not work in all cases, requires additional and costly preparation by data engineering teams, and is error-prone.

 

Kyligence overcomes these inefficiencies with an AI engine that avoids the inefficiency of repeated development and construction in the data mart layer. Using Kyligence, organizations can access all required data sources using our simple low-code interface to replace complex extract-transform-load (ETL) processes, significantly reducing the time and complexity of developing at the data mart layer.

 
 

Furthermore, Kyligence's AI-augmented engine automates data collection from the business, and allows data development teams to see query histories recorded in the background log and understand query usage. Based on those query histories, the Kyligence AI-augmented engine will automatically recommend adding new, more efficient processes to existing models.

 

In addition, Kyligence also provides the following additional capabilities to accelerate data lake-based analytics.

  • Enterprise-grade data security: Kyligence cares about the data security of all users and provides enterprise-grade end-to-end data encryption, cell-level data access control, data backup/restore, and other security policies.
  • Open data formats on the cloud: Kyligence supports Apache Hudi, ORC, Apache Parquet, CSV, and other industry standard data formats.

Common API interface: Kyligence provides standardized API interfaces to help enterprises automate data development work such as data source access, data loading and building, and operation and maintenance monitoring.

 
Summary
 

When evaluating a solution to optimize production query delivery on your data lake infrastructure, whichever cloud resources your enterprise currently uses, refer to Gartner's March 2022 Market Guide for Analytics Query Accelerators. Then consider the cost savings and efficiency gain possible with an investment in Kyligence. Using the Kyligence OLAP on Data Lake solution, enterprises can achieve more efficient data management, dramatically lower operational costs, and maximize the value of their data lake through expanded analytics and a faster time-to-decision.

 
Reference
 

March 2022, Gartner, Market Guide for Analytics Query Accelerators, Merv Adrian, Adam Ronthal

https://www.slideshare.net/TylerWishnoff/apache-kylin-meetup-berlin-with-olx-group

 

Disclaimer:

Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner's research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.