Kyligence Cloud – A Self-Managed Production Ready Platform

Author
Saikat Basu
Sr. Solution Architect
Dec. 15, 2021

Introduction

 

In this article, we will discuss the components of Kyligence Cloud platform which enable it as a production-ready self-managed distributed computing system.

 

Core Kyligence engine is built upon Apache Kylin – a query accelerator and index-optimizer to offer sub-second response time for OLAP queries at a petabyte scale. However, we will not discuss this query engine in this article, rather we wish to showcase and highlight other value-added features for Kyligence Cloud which make this solution absolutely robust, self-managed, cost-effective, and a production-ready platform.

 

Kyligence Cloud platform includes not only fail-safe query execution cluster, but also many other value-added services, including an out-of-the-box monitoring and alerting system. This helps users to minimize or eliminate production outage, performance degradation or overload condition for their business critical application.

 

Kyligence offers service monitoring and alerting APIs as described here. However, users have to build their own application using them.

 

Another option is to utilize the out-of-the-box InfluxDB database and Grafana visualization server to build easily a visual UI based system monitoring and alerting system. We will discuss and explain this implementation inside this blog.

 

Kyligence Cloud Architecture

 

The following figure shows the architecture of Kyligence Cloud including the monitoring system.

 
 

However, readers of this blog will find lots of information about Kyligence Cloud Architecture, details of each component within Kyligence cloud on the Kyligence website, and existing blogs.

 

Monitoring and Alerting System

 

In the following sections of this article, we will discuss the topic of Kyligence Cloud’s value-added service offering out-of-the-box system health and performance monitoring applications with customizable alert mechanisms to avoid downgraded performance or outage or system overloading for any mission-critical production application.

 

The following diagram illustrates the components of Kyligence Cloud with a built-in database for storing all the events and also a built-in visualization dashboard to display all the operational metrics in real-time.

 
Monitoring-Aleerting
 

InfluxDB

 

As displayed in the above diagram, Kyligence uses a very efficient but low footprint time-series database – InfluxDB for recording and storing all the transactional events occurring on the Kyligence Cloud platform.

 

InfluxDB database server is deployed by default as an embedded component inside one of the docker containers running on the Kyligence Cloud manager node. This type of deployment makes it modular, easily accessible from multiple endpoints, at the same time very lightweight and least resource-consuming.

 

We can see in the following picture, how Kyligence Cloud hosts multiple docker instances for managing several Spark clusters along with a dedicated docker instance hosting InfluxDB database.

 
Kyligence Cloud hosts multiple docker instances for managing several Spark clusters
 

Integrating InfluxDB with Kyligence cloud

 

Default Kyligence Cloud configuration includes connection definition for InfluxDB server. Users can log in to the manager node and find cloud deployment configuration in the cloud.properties file inside /data1/kyligence_cloud/conf folder.

 
Default Kyligence Cloud configuration
 

This configuration file is pre-populated with the default IP address and port numbers of the Influx database as shown below.

 
pre-populated with default IP address
 

However, please note if users want to use their own database server as a single, integrated company-wide central monitoring system, they have to change these 2 above highlighted configuration parameters according to their environment.

 

Another point to be noted here is – Kyligence Cloud offers HA (High Availability/Fail-Safe) deployment option. And in this mode of deployment, there will be 2 Kyligence Cloud manager nodes, each hosting a docker instance with the InfluxDB server.

 

In this situation, the default manager node or, the active node will have the InfluxDB server in use for recording and storing all the events happening on the entire cluster network. But in case actual failover happens and the stand-by manager node takes control, please make sure the stand-by InfluxDB server becomes active at the same time.

 

InfluxDB Health Check

 

In case users decide to use the Kyligence out-of-the-box InfluxDB server, it may be a good idea to make sure proper functioning of that. Like, whether the database server is in running state and has database and tables created, also tables are populated with events data.

 

To do that, please open a shell inside the Influx docker instance on the Kyligence cloud manager node and run the following commands to verify the proper functioning of the same.

 
InfluxDB Health Check
 

Also for configuration purposes, you may verify the correct IP address and port number for the active InfluxDB server in the Kyligence Cloud manager node as follows.

 
InfluxDB Health Check2
 

Grafana Visualization Dashboard

 

The final component to implement production monitoring and alerting system, users have the choice to either use Kyligence out-of-the-box Grafana server or their own instance. If their own instance is not available, they can easily download and install Grafana on another physical or virtual server.

 

Otherwise, it is very easy to install and run another docker instance dedicated for Grafana as described in the Kyligence document here.

 

Please do not forget to configure correct VPC/firewall in-bound/out-bound rules for Influx and Grafana server according to users’ deployment platform – whether AWS or Azure cloud. By default, Grafana server listens at port 3000 for HTTP connections.

 
Grafana login
 

If users decide to utilize the Kyligence out-of-the-box Grafana instance, they can log in as an “admin” user with the password “admin” as well.

 

Kyligence also offers a couple of built-in dashboards in JSON format, ready to use, out-of-the-box, including most of the useful operational metrics available inside the InfluxDB server.

 
built-in dashboards in JSON format
 

Kyligence offered KE dashboard includes several categories of operational metrics as - Cluster health monitoring, query execution monitoring, model building job monitoring, overall query latencies, model usage statistics, etc., as shown in the picture below.

 
KE dashboard
 

Users can check system health and resource utilization at a glance as follows –

 
system health and resource utilization at a glance
 

There are 100s of operational metrics related to Zookeeper instance, Azure and AWS, and other public cloud-related metrics and metrics related with Kyligence cloud application- all are stored in Influx database and available for users’ health check and monitoring purposes. These metrics’ definitions and other information about Kyligence system monitoring are available on the Kyligence documents website. Users can easily pick and choose them for building their customized dashboard.

As an example, for query cluster load monitoring purposes, users may define a dedicated panel on the dashboard along with the maximum query execution threshold configured as shown below.

 
a dedicated panel on the dashboard
 

In this case, the user has decided to set up a maximum query transaction threshold as 600 queries per minute (QPM). This is just an example, while users can set this value to thousands of concurrent transaction QPM according to their expected load.

It is very easy to define a custom monitoring panel like this on Grafana. The user needs to pick up the corresponding metric from the drop-down list and then apply the appropriate function available in the Grafana library according to his requirement.

 
Grafana library
 

Finally, users can define when they want to be alerted using the Kyligence metric and Grafana function library.

As an example, for cluster overload monitoring, users may decide to check every 30 minutes and observe 5 minutes for sustained above-threshold (600 QPM) transaction rate. If that condition occurs, users will be alerted automatically.

Such an alert condition can be defined on Grafana as follows.

 
alert condition on Grafana
 

Grafana offers dozens of channels to be configured for alert messaging needs. This list includes email alerting as well as several instant-messaging options like Slack. The user has to write his customized message while configuring his alert channel as shown below.

 
alert-condition-on-Grafana-2
 

Whenever the platform experiences load as more than expected load-bearing capacity, it will immediately alert the system administrator for his intervention.

Following is an example Slack alert for platform overload.

 
Slack alert for platform overload
 

Summary

 

Users may find more detailed descriptions about this in Kyligence document pages.

 

Following guidelines and examples provided in this article, users can build a robust, real-time monitoring and alerting application using Kyligence cloud’s built-in components very easily.

 

This helps Kyligence customers to save significant cost for building such an application system using REST API.

 

At the same time, this type of monitoring and alerting system is extremely beneficial when a large organization uses Kyligence for their business-critical production platform.

 

TEST DRIVE TODAY with $300 worth of free usage

test drive customer logo