Kyligence Enterprise 4: Azure Marketplace Quick Start Guide

Get going with Kyligence Enterprise 4 in the Azure Marketplace with this step-by-step quick start guide.

Deploy Kyligence Enterprise 4

  1. Deploy Kyligence Enterprise to a New HDInsight Cluster
  • In Azure Marketplace, search for and select Kyligence Enterprise,click Create to create an HDInsight cluster. You need to switch the Quick create wizard to the Custome (size, settings, apps)
  • Fill in all the required information on the Basics blade to configure the basic settings of your cluster and click Next. You can select Spark 2.3 or Hadoop with HDInsight 3.6 or HDInsight 4 as you needed.

  • Please Azure storage as Primary Storage Type and click Next.
  • Click Kyligence Enterprise and fill out the Legal terms then click Next.

  • Fill out the Cluster size, Script actions blades and click Next.
  • Review your settings in Summary and then click Create to deploy your cluster. The cluster may take twenty minutes or more to create.
  1. Log in to the Web UI of Kyligence Enterprise
  • After the cluster has been created, in the dashboard of the Azure Portal, click the name of the new cluster you created in All resources/All subscriptions. If you cannot find it, click See more to find your cluster.

  • On the left side of the console, click Applications under the heading Settings. Then, click Potal to log  in to the Web UI of Kyligence Enterprise.

  1. Request Trial License
  • A window of Update License will pop up when you open the web interface of Kyligence Enterprise. Click Apply Evaluation License in the upper right corner of the pop-up window to apply for your Kyligence Enterprise’s license. If you have already gotten the Kyligence Enterprise’s license, please drop the license file or click the dotted area in the middle of the window to upload the license, or click paste license content to paste license content.
  • Submit your Business Mail, Company Name and Your Name. A two-month trial license for you will be effective immediately.

  1. Log in to Kyligence Enterprise
  • Log in to Kyligence Enterprise with the initial username “ADMIN” and the initial password “KYLIN”.
  • At the first login, you will be asked to update your password to a stronger one. Please remember the new password for future login.

Basic Concepts

  1. Table - Data source for model. Before creating a model and loading data, synchronize the table schema to the data source (typically Hive). Table schemas include table name, columns, data types etc.
  2. Model - Defines a set of connected tables and their joining relationships. Model also defines the fact table, dimension tables, dimensions, measures, and indices to pre-calculate when the data is loaded into the model. The system supports star schema and snowflake schema as the base of multi-dimensional analysis.
  3. Index - Index is built during data load and used to accelerate query executions; this includes aggregate index and table index.
    1. Aggregate Index - A combination of multiple dimensions and measures, suitable for answering summary queries such as total sales for a given year.
    2. Table Index - a wide table with a multi-level index, suitable for answering detailed queries such as the last 100 transactions of a certain user.
  4. Load Data - Data that needs to be loaded from the source tables into a model. During the load, the data is also being pre-calculated according to the defined index. Each data load results in one Segment. Querying on a loaded model is greatly accelerated by the pre-calculation that happens during the loading process.
    1. Incremental Load - A partition column of type date/time is defined on the fact table. With the partition column, data can be loaded incrementally by time ranges.
    2. Full Load - If no partition column is defined, then all data in source table is loaded in one try.
    3. Rebuild Index - You may modify model and index definition at any time. Index on loaded data will need a rebuild to catch up with its new definition. The system may optimize model and index definition too if asked by user to accelerate certain queries.
  5. Accelerate Query - Accelerated queries by automatically optimizing model and index definition. The system can automatically learn the model and improve the index based on previous query patterns and data characteristics. When using this feature, users do not need to design models and index manually.
  6. Project Mode - The system provides two project modes.
    1. Expert Mode - In this mode, model is guided by user. User defines model and index manually. The system may also optimize the index if asked by the user to accelerate queries, however no changes will be made to model. The expert mode suits advanced users who want to have fine control of the multi-dimensional model.
    2. Smart Mode - In this mode, model is guided by the system. The system will create and optimize the model and index transparently. All the user needs to do is issue queries and ask the system to accelerate. The model concept is intentionally hidden from users in this mode. The smart mode suits citizen analysts who want big data insights with a low learning curve.

 

Smart Mode Tutorial

Kyligence Enterprise 4.0 provides two project modes. They are Expert Mode—the model is guided by user and Smart Mode—the model is guided by the system. We introduce the basic product usage of Smart Mode in this section.

 

Preparation

  1. Kyligence Enterprise 4.0 offers two project modes: Smart Mode and Expert Mode.
    1. Smart Mode: Based on your querying habits and the acceleration rules you define, the system will create the model and index transparently. It will also automatically accelerate queries. This greatly reduces the work needed for model and index design.
    2. Expert mode: You are able to design your own models to fulfill your business analysis demands and load data for your models. You can design the index manually and let the system continue to improve the index according to your query habits and data characteristics. Learn in more detail Expert Mode Guide.
  2. We will use the Kyligence Enterprise built-in sample data to introduce the smart mode project. You can find out how to import the sample data in the Quick Start
  3. Before creating a smart mode project, we recommend that you run through the built-in demo in the Web UI Help -> User Guide -> Smart Mode. Using SSB (Star Schema Benchmark) dataset, this demo shows the process of Add project -> Synchronize table schema -> Create a model -> Load data in smart mode.

Create Project

Project is the primary management unit of Kyligence Enterprise. In a project, you can use all the features of Kyligence Enterprise.

At the top left of the product click the + button to add an item (right of the item list). Select Smart Mode in the pop-up window and fill in the project name and project description. The project name is mandatory, and the project description is optional. However, a good project description will help with the maintenance of the project in the future.

At this point, you have just created a smart mode project. The interface stays on the Studio -> Source page, ready to add data sources for the next step.

Add Data Source

Once the project is created, you need to add a data source table to the project. You will use the data source table added here during the analysis phase.

When you add a data source, the metadata of the source table is synchronized. The metadata of a table is the data that describes the characteristics of the table e.g. table names, column names and types etc.

  1. Import Table Metadata

In the Studio -> Source page, you can click the Add Data Source button at the top left to add a data source table for your project. It is divided into the following two steps:

  1. Select data source type: We currently support Hive as a data source. More data sources are under development in 4.0.

Tip: If you want to connect to other data sources such MySQL or Kafka, please use Kyligence Enterprise 3.x version.

  1. Select the target data source table: Expand the database list, and select the target data source table.

For more information on data source operations, please see the Data Source

  1. Table Sampling

During the metadata synchronization process of the table, the data sampling is turned on by default. You can view the auto-launched sample table data task in the Monitor -> Job Once the task has been executed, you can view the sample data from the source table in the Studio -> Source page. Learn more in [Data Sampling](../datasource/data_sampling.en.md.

You can get a preliminary understanding of the source table data characteristics using table sampling. In general, table sampling will answer questions suca as following:

  1. How many rows are there in the table?
  2. What is the cardinality of each column? That is the amount of data that is not repeated.
  3. What are the characteristics of the column values for each column?
  1. Data Source Interface

As shown in the following diagram, we added all the tables in the sample SSB dataset in Hive. The data source area is on the left, and the source table information is on the right.

You can view the source table information on the right side. The Storage panel shows whether the source table data is loaded. The Columns panel shows the feature information of the source table field, and the Sampled Data panel shows the sample records and other statistics of the source table.

 

Load Data

Kyligence Enterprise applies pre-calculation technology to achieve sub-second query response time in the big data era. Data needs to be loaded into the system in order to speed up queries. The process of data loading is also the pre-calculation process where index is built. You can find out more about how to load data by looking at the Load Datasection.

In the smart mode, newly imported tables are set as Full Load by default. However, no data loading job will be triggered yet since no index hasd been defined on the new table. At this time, you can already run queries on the tables without acceleration. We recommend reading about Incremental Load below, and set time partition column on fact tables so the fact table can be loaded incrementally.

  1. Principle of loading data
    1. Time Partition Column: For source tables that will grow over time, we recommended selecting a suitable date column as the partition column, for example, the order create time column in an order table. We recommend setting the time partition column right after importing the source tables.
    2. Incremental Load: For tables with a time partition column, you can load new data incrementally by batch. For example, incrementally load new data daily in the order table. Incremental load eliminates the need to reprocess loaded data, this increases productivity and saves resources. Learn more in Load by Date/Time.
    3. Full Load: In the smart mode, newly imported tables are set as full load by default. Since no time partition column is defined, all data in a table is always loaded fully. Learn more in Full Load.
  2. How to load data

There are a few options to load data and build index.

    1. Incremental load: To load a table incrementally, first set the time partition column on the source table. Then go to the navigation bar Studio -> Source and select load data. Please input a time range and request the system to load the matching data. If the source table has index defined on it (created via accelerate query), the system will launch a new job to read in the data and build the index.
    2. Full load: New imported tables are full load by default. Every time a new index is created or an existing index is changed on a full load table, the system will rebuild the index content automatically. No manual operation is needed.
    3. Accelerate query: Accelerate query is another way to create and optimize the index. In the navigation bar Studio -> Acceleration, select to accelerate SQL statements. Based on the characteristics of the selected queries, the system will automatically create a new index or optimize the existing index. A new job will be launched to rebuild the index content on all loaded data that is impacted.
  1. View storage size

You can view the storage size of the index of all loaded data, in the navigation bar Studio -> Index If the storage size is 0.00 KB, then the index group has no data, which is normal when index is just created. If the storage size is greater than 0.00 KB, it means the index group has been loaded with data.

As shown in the following diagram, the index group AUTOMODELLINEORDER_1 has no data loaded yet. Then no query will be accelerated by index group AUTOMODELLINEORDER_1.

 

Data Exploration

You can submit a query to experience analyzing data in Kyligence Enterprise. In smart mode, all tables are full load by default, which means you can start analyzing immediately after importing tables. However, query acceleration is not available at this point so performance may not be optimal.

  1. Query Analysis Principles

Kyligence Enterprise supports standard SQL queries. You can run a query right after the tables are imported. Since no index is defined at the beginning, the query will be pushed down to the Hive and be executed without acceleration. When the amount of data is large and the cluster resources run short, it may take a long time to execute. You can read the Query Analysis section for a detailed explanation of SQL queries.

You can speed up the captured queries in Kyligence Enterprise. When a query is accelerated and similar queries get executed again, the system will leverage pre-calculated index accelerate the execution. We will introduce methods to speed up the query in the next section Acceleration.

Your history query will be saved in the Query -> History screen, view Query History for more information.

  1. Query Analysis Example

Let's take the first SQL query in the built-in user guide demo as an example. Navigate to Query Editor of Query -> Insight and enter the following SQL query. The data source that we use is the SSB dataset which simulates the transaction data of an online store. This SQL statement returns the sales revenue of goods with a quantity less than 25 under the specified discount range in 1993.

select sum(lo_revenue) as revenue
fromlineorder left join ssb.dates on lo_orderdate = d_datekey
where d_year = 1993
and lo_discount between 1 and 3
and lo_quantity < 25

The query result is shown in the diagram below. You can see the query object is Hive in the query information, this indicates the query is pushed down to Hive and runs without acceleration. The result of the query shows the sales revenue under the specified conditions in the online store.

 

Acceleration Engine

Designing the model and index is a complex and challenging task. Analysts may just want to analyze the data and skip or minimize the work required to design the model and index.

In the smart mode, the system will create and optimize model and index transparently. All user needs to do is issuing queries and asking the system to accelerate. The system will learn from query history and data characteristics, then propose new models and index automatically behind the scene. The model concept is intentionally hidden from user in this mode. You can find out more in the Acceleration section.

You can view the queried SQL statements and ask the system to speed them up in the navigation bar Studio -> Acceleration. The acceleration process is the pre-calculation process of the data involved in the specified SQL statement. After the acceleration is completed, query the same or similar statement again and you can use the pre-calculated data to achieve fast results. The diagram below shows the interface of the acceleration engine.

The diagram below shows the page of the acceleration engine. After accelerating a SQL, try running it again (or a similar version, possibly a change in the where condition). You will find the query speed is significantly improved, and the query object is changed from Hive to the index automatically built by the system. You can view the details of the index in the Studio -> Index interface.

Job Monitor

Different jobs are triggered during the use Kyligence Enterprise, such as building index, loading data, and sampling table. You can view the job list in the navigation bar Monitor -> Job interface. For more detailed instructions, please see Monitor Job.

Job monitoring can help you effectively manage the workload of Kyligence Enterprise. You can check the status of the job to determine whether the operation is complete, whether the operating environment is stable, and so on. The following diagram shows the job monitoring interface in the built-in user guide demo when all jobs are successfully completed.

 

Want to know what Kyligence can do for you?