Meet Your AI Copilot fot Data Learn More
Your AI Copilot for Data
Kyligence Zen Kyligence Zen
Kyligence Enterprise Kyligence Enterprise
Metrics Platform
OLAP Platform
Customers
Definitive Guide to Decision Intelligence
Recommended
Resources
Apache Kylin
About
Partners
In today’s world, Big Data, data science, and machine learning analytics and are not only hot topics, they’re also an essential part of our society. Data is everywhere, and the amount of digital data that exists is growing at a rapid rate. According to Forbes, around 175 Zettabytes of data will be generated annually by 2025.
The economy, healthcare, agriculture, energy, media, education and all other critical human activities rely more and more on the advanced processing and analysis of large quantities of collected data. However, these massive datasets pose a real challenge to data analytics, data mining, machine learning and data science.
Data Scientists and analysts have often expressed frustration while trying to work with Big Data. The good news is that there is a solution: Apache Kylin. Kylin solves this Big Data dilemma by integrating with Python to help analysts & data scientists finally gain unfettered access to their large-scale (terabyte and petabyte) datasets.
One of the main challenges machine learning (ML) engineers and data scientists encounter when running computations with Big Data comes from the principle that higher volume or scale equates to greater computational complexity.
Consequently, as datasets scale up, even trivial operations can become costly. Moreover, as data volume rises, algorithm performance becomes increasingly dependent on the architecture used to store and move data. Parallel data structures, data partitioning and placement, and data reuse become more important as the amount of data one is working with grows.
Apache Kylin is an open source distributed Big Data analytics engine designed to provide a SQL interface for multi-dimensional analysis (MOLAP) on Hadoop. It allows enterprises to rapidly analyze their massive datasets in a fraction of the time it would take using other approaches or Big Data analytics tools.
With Apache Kylin, data teams are able to dramatically cut down on analytics processing time and associated IT and ops costs. It’s able to do this by pre-computing large datasets into one (or another very small amount) of OLAP cubes and storing them in a columnar database. This allows ML Engineers, data scientists, and analysts to quickly access the data and perform data mining activities to uncover hidden trends easily.
The Following diagram illustrates how machine learning and data science activities on big data become much easier when Apache Kylin is introduced.
Python has quickly risen in prominence to take its spot as one of the leading programming languages in the data analytics field (as well as outside the field). With its ease of use and extensive collection of libraries, Python has become well-positioned to take on Big Data.
Python also provides plenty of data mining tools to assist in the handling of data, offering up a variety of applications already adopted by the machine learning and data science communities. Simply put, if you’re working with Big Data, there’s probably a way Python can make your job easier.
Apache Kylin can be easily integrated with Python with support from Kylinpy . Kylinpy is a python library that provides a SQLAlchemy Dialect implementation. Thus, any application that uses SQLAlchemy can now query Kylin OLAP cubes. Additionally, it also allows users to access data via Pandas data frames.
We imported an IMDB movie dataset (Source: Movielens) into our Kylin OLAP cube and used Python to read the data and perform exploratory analysis in order to find trends in movie ratings for different genres over a given period of time.
In order to analyze the data via Python, the Kylinpy library was used and SQL(s) were written to ingest relevant data for the analysis in question. The dataset(s) returned via SQL(s) were stored as Pandas data frame(s) and then data manipulation was done on the data frames to bring the data into a shape suitable for our analysis. We have leveraged the Matplotlib and Seaborn libraries for visualizing the data. The diagram below illustrates the data lifecycle through each of its stages.
Let us first visualize the top-rated movies. It can be seen that from the top 15 movies, apart from top 2, 13 movies have been rated by an almost equal number of viewers. This information is a starting point for correlational discovery and can be further drilled down into to find the correlation between our closely rated movies.
Similarly, plot graph below displays the comparison of Males vs. Females count per Genre. This describes a gender-based inclination across various movie genres.
From the below correlation matrix (Heat map), we can state the relationship between Occupation and Genres of Movies that an individual prefers. For example: Farmers do not prefer to watch Mystery based movies and College Students prefer Film-Noir or Documentaries.
The next figure shows the trends of the average ratings by users for different genres across different weeks for a given year. From the chart it can be seen that Documentary and Crime movies are amongst people’s favorites while children’s movies always had the lowest average rating.
The two scatter plots below are used for a side by side comparison to infer correlation between the ratings of Men and Women.
Left Plot: The scatter plot shows that the average rating of Men and Women (all movies) has a linearly increasing trend and the highly concentrated part of the plot is equally distributed on both sides of the reference line, which depicts that apart from a few movie ratings, Men and Women tend to think alike.
Right Plot: The scatter plot was produced by segregating only those movies which have been rated more than 400 times. In this case as well we can see that Men and Women have similar ratings, suggesting that our initial inference was accurate.
We discussed how Python easily integrates with Apache Kylin’s OLAP technology using the Kylinpy library, which in turn was used to run advanced analytics on our example movie dataset. We also used Pandas, Matplotlib and Seaborn libraries to manipulate and visualize the data residing in our Apache Kylin cubes.
Such analysis gave us insight into how people’s liking of different movie genres changes over time. It also told us about the strength of association between trends in different movie genres. Insights like these could be useful for movie critics.
If you or your team are facing issues in fully accessing your massive datasets and want to leverage Kylin’s OLAP on Big Data approach for your machine learning or data science activities, Apache Kylin (and its associated enterprise Big Data platform Kyligence) has you covered.
With the Kyligence Big Data analytics platform you get one click provisioning, elasticity and flexibility on the cloud, and much more. It’s fast and easy to get started with Kyligence. Apply for a free trial now.
Furthermore, Kyligence’s extreme OLAP engine is completely integrated with Azure cloud services, which means that all the great features of Kyligence can be easily combined with Azure’s AutoML machine learning services. For more information about Kyligence’s augmented OLAP analytics solutions visit our Kyligence Enterprise and Kyligence Cloud product pages. Or get started with this great overview video.
Also, be sure to follow us on LinkedIn and Twitter for the latest Kyligence product updates and Augmented Analytics announcements.
The driving force behind Meituan’s success is not simply a robust analytics system, but the OLAP engine that system is built upon - Apache Kylin.
Cloud Analytics News will share the important news on Apache Kylin, Kyligence Cloud, and related technologies. In this edition, we cover Apache Kylin 4.X beta, the launch of Kyligence Cloud 4, Pivot to Snowflake, and more.
UnionPay was able to consolidate the 1,200 Cognos cubes into 2 Kyligence cubes and a single ETL process. Besides extending the life of the analytics executed against this data, there was a massive improvement in operational efficiency.
A peek behind the curtain of the world's leading open source big data analytics project, Apache Kylin.
An introduction to Apache Kylin's new storage and compute architecture, Apache Parquet. This article introduces Kylin's query principles, Parquet storage, and accurate duplicate removal
Already have an account? Click here to login
You'll get
A complete product experience
A guided demo of the whole process, from data import, modeling to analysis, by our data experts.
Q&A session with industry experts
Our data experts will answer your questions about customized solutions.
Please fill in your contact information.We'll get back to you in 1-2 business days.
Industrial Scenario Demostration
Scenarios in Finance, Retail, Manufacturing industries, which best meet your business requirements.
Consulting From Experts
Talk to Senior Technical Experts, and help you quickly adopt AI applications.