Enhance Your MapR Data Platform Experience with Kyligence

Author
Li Kang
Technical Director, Kyligence Partnerships
Mar. 07, 2019

Opportunities abound for AI and machine learning endeavors across every industry. By 2021, it’s predicted that business investment in AI will exceed $57 billion*. This has generated a global race amongst organizations to capture, analyze, and act on every byte of data they can get. In response, analytics teams are turning to solutions like Kyligence and MapR to meet the demands of this evolving Big Data technology.

For the uninitiated, Kyligence is the extreme OLAP engine on big data platform built by the team behind the open source software Apache Kylin. MapR is a high performance and highly scalable big data platform. In this article, we are going to look at how Kyligence Enterprise and MapR can enhance each other and provide the best big data analytics solution.

Kyligence can be deployed onto, or on the edge nodes of, a Hadoop cluster. Users can define and load (pre-calculate) multi-dimensional models using Apache Spark or MapReduce jobs, with the OLAP data structures being stored on Hadoop file systems.

MapR is an advanced distributed file system and converged data platform that supports Hadoop Distributed File System (HDFS), HBase, Document database, and stream processing (using Kafka API). MapR provides some really great features which distinguish it from other Hadoop distributions. Combined with Kyligence Enterprise, MapR users can not only extend the capabilities of the MapR platform, but also enhance their experience.

Performance Gains on MapR-FS

MapR-FS was developed from the ground up as a distributed read write file system that is compatible with HDFS interfaces. It is written in C++ and installed on disks directly (see diagram below). MapR-FS does not have to deal with the overhead of JVM and Linux File System when accessing disks. 

When Kyligence engineers tested MapR-FS against generic Hadoop systems with similar hardware configurations, they got some very interesting results. The design of MapR-FS delivers a clear performance advantage over HDFS. As illustrated in following diagrams, both cube building and cube querying jobs run significantly faster on the MapR platform.

Diagram 1: Map-FS does not rely on JVM and Linux file system
Diagram 1: Map-FS does not rely on JVM and Linux file system
Diagram 2a: Kyligence Performance Test on MapR vs HDFS
Diagram 2a: Kyligence Performance Test on MapR vs HDFS
Diagram 2b: Kyligence Performance Test on MapR vs HDFS
Diagram 2b: Kyligence Performance Test on MapR vs HDFS

Complete Data Lake on MapR Platform

Today, many companies are choosing to build data lakes to address their data storage needs. Typical data lake setups focus on either storing historical data or large amounts of raw data (such as log data). Querying this data can be time-intensive. Query engines can accelerate the process in certain scenarios, but you’ll usually need to push the data to outside data marts to reduce latency.

With Kyligence Enterprise, aggregated results are pre-calculated and stored in the MapR file systems. There is no need to aggregate data in another data warehouse or data mart. This greatly simplifies your data lake architecture and removes traditional data warehouse dependencies that clash with Big Data technologies.

If a user needs answers from the detailed records (not aggregated), they can still send the query to Kyligence. In this case, the query is routed to the data store in MapR-FS or MapR-DB. Now, your data lake can serve both detailed and aggregated queries with superb performance.

Diagram 3: Kyligence on MapR platform
Diagram 3: Kyligence on MapR platform

Separate OLAP Cube Building and Querying Workloads

Kyligence employs two types of workloads. The first is cube building, which can take place before a query is served or during a cube update. The second type of workload is cube querying, where 1000’s of users and applications read data from the cubes. For Hadoop installations, Kyligence recommends a cube building and query cluster to separate the workloads for the best query performance.

Diagram 4: Separate Build and Query Cluster on HDFS
Diagram 4: Separate Build and Query Cluster on HDFS

MapR supports the concepts of node topology and volume topology. With these configurations, you can place data and jobs specifically on certain nodes in your cluster. You can also separate two types of workloads on two sets of nodes while changing the configurations on demand (e.g. add more nodes to support cube building at year end). Setup and management of two clusters is no longer required. Now, you have the flexibility of allocating resources within the same cluster.

Global Mirroring for Global Big Data Analytics

Detailed records stored in the cluster may contain sensitive data such as Personally Identifiable Information (PII). In some instances, regulations (like GDPR) may restrict you from moving this data out of the country. This presents a challenge for companies using Hadoop distributions or data virtualization tools that offer no aggregation.

MapR’s mirroring capability solves this problem by keeping data synced across clusters in different locations. Instead of mirroring raw data, MapR mirrors the pre-calculated cubes from the remote cluster. This allows access to aggregated results from the cubes at HQ while keeping the PII in the original cluster.

Diagram 5: Only aggregated data in the cubes is mirrored to remote cluster
Diagram 5: Only aggregated data in the cubes is mirrored to remote cluster

Start Empowering Your Analytics with Kyligence and MapR

You can see by now why these great features and capabilities make MapR is an ideal platform for Kyligence to run on. This joint solution enables businesses to accelerate their analytics on petabytes of data at the speed of thought while releasing IT from tedious administrative work.

If you’re ready to supercharge your MapR experience with augmented OLAP analytics or just want to know more, you can visit www.kyligence.io and www.mapr.com and get started today for free. Also, if you’re evaluating Apache Kylin and would like to know how it compares to Kyligence as an OLAP solution, we recommend you check out our Apache Kylin Comparison page.

References

*https://towardsdatascience.com/15-artificial-intelligence-ai-stats-you-need-to-know-in-2018-b6c5eac958e5