Use Python for Data Science with Apache Kylin

Author
Nikhil Jain
Senior Solution Architect, Kyligence
Jun. 24, 2019

In today’s world, Big Data, data science, and machine learning analytics and are not only hot topics, they’re also an essential part of our society. Data is everywhere, and the amount of digital data that exists is growing at a rapid rate. According to Forbes, around 175 Zettabytes of data will be generated annually by 2025.

The economy, healthcare, agriculture, energy, media, education and all other critical human activities rely more and more on the advanced processing and analysis of large quantities of collected data. However, these massive datasets pose a real challenge to data analytics, data mining, machine learning and data science.

Data Scientists and analysts have often expressed frustration while trying to work with Big Data. The good news is that there is a solution: Apache Kylin. Kylin solves this Big Data dilemma by integrating with Python to help analysts & data scientists finally gain unfettered access to their large-scale (terabyte and petabyte) datasets.

Machine Learning and Data Science Challenges

One of the main challenges machine learning (ML) engineers and data scientists encounter when running computations with Big Data comes from the principle that higher volume or scale equates to greater computational complexity.

Consequently, as datasets scale up, even trivial operations can become costly. Moreover, as data volume rises, algorithm performance becomes increasingly dependent on the architecture used to store and move data. Parallel data structures, data partitioning and placement, and data reuse become more important as the amount of data one is working with grows.

Too Much Big Data Python Code Apache Kylin

What Apache Kylin Is and How It Helps

Apache Kylin is an open source distributed Big Data analytics engine designed to provide a SQL interface for multi-dimensional analysis (MOLAP) on Hadoop. It allows enterprises to rapidly analyze their massive datasets in a fraction of the time it would take using other approaches or Big Data analytics tools.

With Apache Kylin, data teams are able to dramatically cut down on analytics processing time and associated IT and ops costs. It’s able to do this by pre-computing large datasets into one (or another very small amount) of OLAP cubes and storing them in a columnar database. This allows ML Engineers, data scientists, and analysts to quickly access the data and perform data mining activities to uncover hidden trends easily.

The Following diagram illustrates how machine learning and data science activities on big data become much easier when Apache Kylin is introduced.

Apache Kylin Machine Learning and Data Science Big Data
How Apache Kylin works with Big Data

How to Integrate Python with Apache Kylin

Python has quickly risen in prominence to take its spot as one of the leading programming languages in the data analytics field (as well as outside the field). With its ease of use and extensive collection of libraries, Python has become well-positioned to take on Big Data.

Python also provides plenty of data mining tools to assist in the handling of data, offering up a variety of applications already adopted by the machine learning and data science communities. Simply put, if you’re working with Big Data, there’s probably a way Python can make your job easier.

Apache Kylin can be easily integrated with Python with support from Kylinpy . Kylinpy is a python library that provides a SQLAlchemy Dialect implementation. Thus, any application that uses SQLAlchemy can now query Kylin OLAP cubes. Additionally, it also allows users to access data via Pandas data frames.

Sample code to access data via Pandas:

Apache Kylin Pandas Code

Benefits of using Apache Kylin as Data Source:

  • Easy Access to Massive Datasets: Interactively work with large amounts (TB/PB) of data.
  • Blazing Fast Performance: Get sub-second response times to your queries on Big Data.
  • High Scalability: With Kylin’s linear scalability, scale up your data without worrying about performance.
  • Web Scale Concurrency: Deploy to thousands of concurrent users.
  • Minimal Data Engineering: Invest time in discovering insights and leave the data engineering to Apache Kylin.

A Use Case: Data Science with Apache Kylin

Dataset

We imported an IMDB movie dataset (Source: Movielens) into our Kylin OLAP cube and used Python to read the data and perform exploratory analysis in order to find trends in movie ratings for different genres over a given period of time.

Motivation

  • Identify top rated movies.
  • Compare Male vs Female preference for different movie genres.
  • Find correlation between Occupation & Genre.
  • Analyzing trends in average movie ratings for different genres across the weeks.
  • Compare Men & Women average ratings.

Data Lifecycle

In order to analyze the data via Python, the Kylinpy library was used and SQL(s) were written to ingest relevant data for the analysis in question. The dataset(s) returned via SQL(s) were stored as Pandas data frame(s) and then data manipulation was done on the data frames to bring the data into a shape suitable for our analysis. We have leveraged the Matplotlib and Seaborn libraries for visualizing the data. The diagram below illustrates the data lifecycle through each of its stages.

Apache Kylin Data Lifecycle

Analysis

Let us first visualize the top-rated movies. It can be seen that from the top 15 movies, apart from top 2, 13 movies have been rated by an almost equal number of viewers. This information is a starting point for correlational discovery and can be further drilled down into to find the correlation between our closely rated movies.

Apache Kylin Big Data SQL Code
Apache Kylin Top Movies Big Data Chart

Similarly, plot graph below displays the comparison of Males vs. Females count per Genre. This describes a gender-based inclination across various movie genres.

Apache Kylin Big Data SQL Query Code
Apache Kylin SQL Data Chart

From the below correlation matrix (Heat map), we can state the relationship between Occupation and Genres of Movies that an individual prefers. For example: Farmers do not prefer to watch Mystery based movies and College Students prefer Film-Noir or Documentaries.

Additional Apache Kylin Python SQL Code
Apache Kylin Python SQL Data Heat Map Chart

The next figure shows the trends of the average ratings by users for different genres across different weeks for a given year. From the chart it can be seen that Documentary and Crime movies are amongst people’s favorites while children’s movies always had the lowest average rating.

Apache Kylin Python SQL Trend Line Chart

The two scatter plots below are used for a side by side comparison to infer correlation between the ratings of Men and Women.

Left Plot: The scatter plot shows that the average rating of Men and Women (all movies) has a linearly increasing trend and the highly concentrated part of the plot is equally distributed on both sides of the reference line, which depicts that apart from a few movie ratings, Men and Women tend to think alike.

Right Plot: The scatter plot was produced by segregating only those movies which have been rated more than 400 times. In this case as well we can see that Men and Women have similar ratings, suggesting that our initial inference was accurate.

Apache Kylin Python SQL Scatter Plot Chart

Get Started with Python on Apache Kylin

We discussed how Python easily integrates with Apache Kylin’s OLAP technology using the Kylinpy library, which in turn was used to run advanced analytics on our example movie dataset. We also used Pandas, Matplotlib and Seaborn libraries to manipulate and visualize the data residing in our Apache Kylin cubes.

Such analysis gave us insight into how people’s liking of different movie genres changes over time. It also told us about the strength of association between trends in different movie genres. Insights like these could be useful for movie critics.

If you or your team are facing issues in fully accessing your massive datasets and want to leverage Kylin’s OLAP on Big Data approach for your machine learning or data science activities, Apache Kylin (and its associated enterprise Big Data platform Kyligence) has you covered.

With the Kyligence Big Data analytics platform you get one click provisioning, elasticity and flexibility on the cloud, and much more. It’s fast and easy to get started with Kyligence. Apply for a free trial now.

Furthermore, Kyligence’s extreme OLAP engine is completely integrated with Azure cloud services, which means that all the great features of Kyligence can be easily combined with Azure’s AutoML machine learning services. For more information about Kyligence’s augmented OLAP analytics solutions visit our Kyligence Enterprise and Kyligence Cloud product pages. Or get started with this great overview video.

Also, be sure to follow us on LinkedIn and Twitter for the latest Kyligence product updates and Augmented Analytics announcements.