Blog > Apache Kylin

Kylin 4 now is supporting AWS Glue Catalog

Xiaoxiang Yu

Kylin PMC Member

Mar. 22, 2022

Why does installing Kylin on EMR need to support AWS Glue?

What is AWS Glue?

AWS Glue is a fully hosted ETL (Extract, Transform, and Load) service that enables AWS users to easily and cost-effectively classify, cleanse, enrich data and move data between various data storages. AWS Glue consists of a central metastore called AWS Glue Data Catalog, an ETL engine that can automatically generate code and a flexible scheduler that can handle dependency resolution, monitor jobs and retry. AWS Glue is a serverless service, so there is no infrastructure to set up or manage.

Why does Kylin need AWS Glue Catalog?

At present, many users in the Kylin community use AWS EMR for running large-scale distributed data processing jobs on Hadoop, Spark, Hive, Presto, etc. Without AWS Glue Data Catalog, tables built on these data warehouse components (like Hive, Spark and Presto) can not be used by any other components. As the data warehouse needs to answer requirements from various business departments, they use AWS Glue Data Catalog for metadata storage when creating the AWS EMR clusters, to share the data sources among different components and business departments. That is, to build one data cube with data from each business department, so they can provide quick responses to different business requirements.

In modern companies, data is saved on cloud object storage and big data teams use AWS EMR for data processing, data analysis and model training. But with data explosion, it becomes really difficult to extract data and the response time is too long. In other words, the solution of EMR + Spark/Hive cannot meet the speedy data query requirements from data analysts, O&M personnel and sales. So some users turn to Apache Kylin as their open-source OLAP solution.

Recently, our users approached us with the request that Kylin 4 could directly read table metadata from AWS Glue. After some collaboration, now Kylin 4 supports AWS Glue Catalog, making it dealdot possible for tables and data to be shared among Hive, Presto, Spark and Kylin. This helps to break down the metadata barrier, so different topics can be combined to form a big data analysis platform.

Does Kylin support AWS Glue?

Kylin version which supports Glue

Kylin on HBase (Before Kylin 4) - 2.6.6 or higher (reference)
Kylin on HBase (Before Kylin 4) - 3.1.0 or higher
Kylin on Parquet - 4.0.1 or higher (reference: this article)

Prerequisites for deployment

Software Version

Apache Kylin - 4.0.1 or higher (reference: KIP 10 refactor hive and hadoop dependency)
AWS EMR - 6.5.0 or higher (reference: Amazon EMR release 6.5.0 - Amazon EMR)
AWS EMR - 5.33.1 or higher

Prepare AWS Glue database and tables

Create AWS EMR cluster

Create an EMR cluster.

Note: Parameter hive.metastore.client.factory.class is configured to enable AWS Glue. For details, you may refer to the commands below.

aws emr create-cluster --applications Name=Hadoop Name=Hive Name=Spark Name=ZooKeeper Name=Tez Name=Ganglia \
  --ec2-attributes ${} \
  --release-label emr-6.5.0 \
  --log-uri ${} \
  --instance-groups ${} \
  --configurations '[{"Classification":"hive-site","Properties":{"hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"}}]' \
  --auto-scaling-role EMR_AutoScaling_DefaultRole \
  --ebs-root-volume-size 100 \
  --service-role EMR_DefaultRole \
  --enable-debugging \
  --name 'Kylin4_on_EMR65_with_Glue' \
  --region cn-northwest-1

(Optional)Get environmental information

If you are using RDS or other metadata storage, you may skip this step.

RDBMS is recommended for metastore in Kylin 4. So for testing purposes, in this article, we use MariaDB which comes with the Master node for metastore; for hostname, account and password of MariaDB, see /etc/hive/conf/hive-site.xml.

1    kylin.metadata.url=kylin4_on_cloud@jdbc,url=jdbc:mysql://${HOSTNAME}:3306/hue,username=hive,password=${PASSWORD},maxActive=10,maxIdle=10,driverClassName=org.mariadb.jdbc.Driver  
2 kylin.env.zookeeper-connect-string=${HOSTNAME}

Configure the variables as per the actual information, for example, replace ${PASSWORD} with the real password, save it locally and it will be used to start Kylin.

Test the connectivity between Spark SQL and AWS Glue

Test whether AWS Spark SQL can access databases and table metadata through AWS Glue with Spark-SQL. For the first test, you will find that the startup fails with an error.

Replace hive-site.xml used by Spark with the following commands.

cd /etc/spark/conf
sudo mv hive-site.xml hive-site.xml.bak
sudo cp /etc/hive/conf/hive-site.xml .

Then change the value of hive.execution.engine in file /etc/spark/conf/hive-site.xml to mr, restart Spark-SQL CLI and verify whether the query for AWS Glue's table data is successful.

(Optional) Prepare kylin-spark-engine.jar

This issue will be fixed in Apache Kylin 4.0.2. So you can skip this step after updating to Apache Kylin 4.0.2. For users with Kylin 4.0.1, please refer to the following steps to replace kylin-spark-engine.jar:

Clone Kylin git repository, execute mvn clean package -DskipTests to build a new kylin-spark-project/kylin-spark-engine/target/kylin-spark-engine-4.0.0-SNAPSHOT.jar .

git clone https://github.com/hit-lacus/kylin.git
cd kylin
git checkout KYLIN-5160
mvn clean package -DskipTests

# find -name kylin-spark-engine-4.0.0-SNAPSHOT.jar kylin-spark-project/kylin-spark-engine/target

Patch link: https://github.com/apache/kylin/pull/1819

Deploy Kylin and connect to AWS Glue

Download Kylin

Download and decompress Kylin. Please download the corresponding Kylin package according to your EMR version. That is, with EMR 5.X you can download Spark 2 package; with EMR 6.X you can download Spark 3 package.

# aws s3 cp s3://${BUCKET}/apache-kylin-4.0.1-bin-spark3.tar.gz .
# wget apache-kylin-4.0.1-bin-spark3.tar.gz
tar zxvf apache-kylin-4.0.1-bin-spark3.tar.gz .
cd apache-kylin-4.0.1-bin-spark3
export KYLIN_HOME=/home/hadoop/apache-kylin-4.0.1-bin-spark3

(Optional) Get MariaDB driver jar (Optional)

If you are using other databases for metastore, please skip this step.

cd $KYLIN_HOME
mkdir ext
cp /usr/lib/hive/lib/mariadb-connector-java.jar $KYLIN_HOME/ext

Prepare Spark

AWS Spark has built-in support of AWS Glue, so you will use AWS Spark when loading table metadata and building jobs. Kylin 4.0.1 supports Apache Spark officially. Because the compatibility between Apache Spark and AWS Spark is not very good, we will use Apache Spark for cube queries. To sum up, you need to switch between AWS Spark and Apache Spark according to your task (query task or build task).

Prepare AWS Spark

cd $KYLIN_HOME
mkdir ext
cp /usr/lib/hive/lib/mariadb-connector-java.jar $KYLIN_HOME/ext

Download Apache Spark
- Please download the corresponding Spark installation package according to your EMR version. That is, with EMR 5.X you can download Spark 2.4.7 and with EMR 6.X you can download Spark 3.1.2.

cd $KYLIN_HOME
aws s3 cp s3://${BUCKET}/spark-2.4.7-bin-hadoop2.7.tgz $KYLIN_HOME # Or downloads spark-2.4.7-bin-hadoop2.7.tgz from offical website
tar zxvf spark-2.4.7-bin-hadoop2.7.tgz
mv spark-2.4.7-bin-hadoop2.7 spark-apache

First, you need to load AWS Glue table, so direct $KYLIN_HOME/spark to AWS Spark with soft link. Note: you do not need to set up SPARK_HOME, because if $KYLIN_HOME/spark exists and SPARK_HOME is not set up, Kylin will use $KYLIN_HOME/spark as SPARK_HOME by default.

ln -s spark-aws spark

Modify Kylin startup script

IStart Spark SQL CLI and keep it in running status.
Acquire PID of SparkSQLCLIDriver with jps -ml ${PID}. Then acquire spark.driver.extraClasspath of Driver. Or, you can acquire these from /etc/spark/conf/spark-defaults.conf.

jps -ml | grep SparkSubmit
jinfo ${PID} | grep "spark.driver.extraClassPath"

Edit bin/kylin.sh, modify KYLIN_TOMCAT_CLASSPATH and add kylin_driver_classpath; save bin/kylin.sh, then exit Spark SQL CLI.
- kylin.sh before modifying

For EMR 6.5.0, in the modified kylin.sh, kylin_driver_classpath is at the end of the code.

For EMR 5.33.1, in the modified kylin.sh, kylin_driver_classpath is placed before $SPARK_HOME/jars.

Configure Kylin

cd $KYLIN_HOME
vim conf/kylin.properties

Property Key: kylin.metadata.url
- Property Value(Example): kylin4_on_cloud@jdbc,url=jdbc:mysql://${HOSTNAME}:3306/hue,username=hive,password=${PASSWORD},maxActive=10,maxIdle=10,driverClassName=org.mariadb.jdbc.Driver
Property Key: kylin.env.zookeeper-connect-string
- ${HOSTNAME}
Property Key: kylin.env.zookeeper-connect-string
- Property Value(Example): /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar
- Notes: Copied from spark.driver.extraClasspath in /etc/spark/conf/spark-default.conf

Start Kylin and verify the building job

Start Kylin

cd $KYLIN_HOME
ln -s spark spark_aws # skip this step if soft link 'spark' exists 
bin/kylin.sh restart

(Optinoal) Replace kylin-spark-engine.jar

This step is only required for Kylin 4.0.1 users.

cd $KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/lib/
mv kylin-spark-engine-4.0.1.jar kylin-spark-engine-4.0.1.jar.bak # remove old one 
cp kylin-spark-engine-4.0.0-SNAPSHOT.jar  .

bin/kylin.sh restart # restart kylin to make new jar be loaded

Load AWS Glue table and build

Load AWS Glue table metadata

Create Model and Cube, then trigger a building job.

Verify the query

Switch the Spark used by Kylin and restart Kylin.

cd $KYLIN_HOME
rm spark # 'spark' is a soft link, it is point to aws spark
ln -s spark_apache spark # switch from aws spark to apache spark
bin/kylin.sh restart

Perform a test query and this query is successful.

Discussion and Q&A

Why we must use both AWS Spark and Apache Spark？

AWS Spark has built-in support for AWS Glue so you will use AWS Spark when loading table metadata and building jobs; Kylin 4.0.1 supports Apache Spark. Because the compatibility between Apache Spark and AWS Spark is not very good, we will use Apache Spark for cube query. To sum up, you need to switch between AWS Spark and Apache Spark according to your task (query task or build task).

Why do users need to modify kylin.sh?

As Spark Driver, Kylin needs to load table metadata through aws-glue-datacatalog-spark-client.jar, so you need to modify kylin.sh and load the relevant jar into classpath of Kylin process.

If I faced more questions, where should I asked?

If you have any questions about using Kylin on AWS, please contact us via mailling list(user@kylin.apache.org), please check for detail https://kylin.apache.org/community/.

Post Views: 15,947

Kylin 4 now is supporting AWS Glue Catalog

Why does installing Kylin on EMR need to support AWS Glue?

What is AWS Glue?

Why does Kylin need AWS Glue Catalog?

Does Kylin support AWS Glue?

Prerequisites for deployment

Software Version

Prepare AWS Glue database and tables

Create AWS EMR cluster

(Optional)Get environmental information

Test the connectivity between Spark SQL and AWS Glue

(Optional) Prepare kylin-spark-engine.jar

Deploy Kylin and connect to AWS Glue

Download Kylin

Prepare Spark

Modify Kylin startup script

Configure Kylin

Start Kylin and verify the building job

Verify the query

Discussion and Q&A

Why we must use both AWS Spark and Apache Spark？

Why do users need to modify kylin.sh?

If I faced more questions, where should I asked?

Every Product Will Be a Data Product

Building a Metrics Store for Snowflake

AWS Bedrock and Kyligence Copilot: Revolutionizing Data Analysis

Build Your Data Copilot on AWS S3

Build Your Data Copilot on Snowflake

These 7 AI Analytics Tools Can Transform Your Data Game Effortlessly!

SaaS Metrics that Matter for Customer Success

Top 5 Augmented Analytics Tools for 2023

Website Metrics that Matter for Business Growth: Why They Matter and How to Measure Them

What Are Analytics Query Accelerators? How Does It Work With Cloud Data Lakes?

Kylin 4 now is supporting AWS Glue Catalog

Why does installing Kylin on EMR need to support AWS Glue?

What is AWS Glue?

Why does Kylin need AWS Glue Catalog?

Does Kylin support AWS Glue?

Prerequisites for deployment

Software Version

Prepare AWS Glue database and tables

Create AWS EMR cluster

(Optional)Get environmental information

Test the connectivity between Spark SQL and AWS Glue

(Optional) Prepare kylin-spark-engine.jar

Deploy Kylin and connect to AWS Glue

Download Kylin

Prepare Spark

Modify Kylin startup script

Configure Kylin

Start Kylin and verify the building job

Verify the query

Discussion and Q&A

Why we must use both AWS Spark and Apache Spark？

Why do users need to modify kylin.sh?

If I faced more questions, where should I asked?

You might be interested:

Read Next

Every Product Will Be a Data Product

Building a Metrics Store for Snowflake

AWS Bedrock and Kyligence Copilot: Revolutionizing Data Analysis

Build Your Data Copilot on AWS S3

Build Your Data Copilot on Snowflake

These 7 AI Analytics Tools Can Transform Your Data Game Effortlessly!

SaaS Metrics that Matter for Customer Success

Top 5 Augmented Analytics Tools for 2023

Website Metrics that Matter for Business Growth: Why They Matter and How to Measure Them

What Are Analytics Query Accelerators? How Does It Work With Cloud Data Lakes?