Meet Your AI Copilot fot Data Learn More
Your AI Copilot for Data
Kyligence Zen Kyligence Zen
Kyligence Enterprise Kyligence Enterprise
Metrics Platform
OLAP Platform
Customers
Definitive Guide to Decision Intelligence
Recommended
Resources
Apache Kylin
About
Partners
AWS Glue is a fully hosted ETL (Extract, Transform, and Load) service that enables AWS users to easily and cost-effectively classify, cleanse, enrich data and move data between various data storages. AWS Glue consists of a central metastore called AWS Glue Data Catalog, an ETL engine that can automatically generate code and a flexible scheduler that can handle dependency resolution, monitor jobs and retry. AWS Glue is a serverless service, so there is no infrastructure to set up or manage.
At present, many users in the Kylin community use AWS EMR for running large-scale distributed data processing jobs on Hadoop, Spark, Hive, Presto, etc. Without AWS Glue Data Catalog, tables built on these data warehouse components (like Hive, Spark and Presto) can not be used by any other components. As the data warehouse needs to answer requirements from various business departments, they use AWS Glue Data Catalog for metadata storage when creating the AWS EMR clusters, to share the data sources among different components and business departments. That is, to build one data cube with data from each business department, so they can provide quick responses to different business requirements.
In modern companies, data is saved on cloud object storage and big data teams use AWS EMR for data processing, data analysis and model training. But with data explosion, it becomes really difficult to extract data and the response time is too long. In other words, the solution of EMR + Spark/Hive cannot meet the speedy data query requirements from data analysts, O&M personnel and sales. So some users turn to Apache Kylin as their open-source OLAP solution.
Recently, our users approached us with the request that Kylin 4 could directly read table metadata from AWS Glue. After some collaboration, now Kylin 4 supports AWS Glue Catalog, making it dealdot possible for tables and data to be shared among Hive, Presto, Spark and Kylin. This helps to break down the metadata barrier, so different topics can be combined to form a big data analysis platform.
Kylin version which supports Glue
Note: Parameter hive.metastore.client.factory.class is configured to enable AWS Glue. For details, you may refer to the commands below.
hive.metastore.client.factory.class
aws emr create-cluster --applications Name=Hadoop Name=Hive Name=Spark Name=ZooKeeper Name=Tez Name=Ganglia \ --ec2-attributes ${} \ --release-label emr-6.5.0 \ --log-uri ${} \ --instance-groups ${} \ --configurations '[{"Classification":"hive-site","Properties":{"hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"}}]' \ --auto-scaling-role EMR_AutoScaling_DefaultRole \ --ebs-root-volume-size 100 \ --service-role EMR_DefaultRole \ --enable-debugging \ --name 'Kylin4_on_EMR65_with_Glue' \ --region cn-northwest-1
If you are using RDS or other metadata storage, you may skip this step.
RDBMS is recommended for metastore in Kylin 4. So for testing purposes, in this article, we use MariaDB which comes with the Master node for metastore; for hostname, account and password of MariaDB, see /etc/hive/conf/hive-site.xml.
/etc/hive/conf/hive-site.xml
1 kylin.metadata.url=kylin4_on_cloud@jdbc,url=jdbc:mysql://${HOSTNAME}:3306/hue,username=hive,password=${PASSWORD},maxActive=10,maxIdle=10,driverClassName=org.mariadb.jdbc.Driver 2 kylin.env.zookeeper-connect-string=${HOSTNAME}
Configure the variables as per the actual information, for example, replace ${PASSWORD} with the real password, save it locally and it will be used to start Kylin.
${PASSWORD}
Test whether AWS Spark SQL can access databases and table metadata through AWS Glue with Spark-SQL. For the first test, you will find that the startup fails with an error.
Replace hive-site.xml used by Spark with the following commands.
cd /etc/spark/conf sudo mv hive-site.xml hive-site.xml.bak sudo cp /etc/hive/conf/hive-site.xml .
Then change the value of hive.execution.engine in file /etc/spark/conf/hive-site.xml to mr, restart Spark-SQL CLI and verify whether the query for AWS Glue's table data is successful.
hive.execution.engine
/etc/spark/conf/hive-site.xml
mr
This issue will be fixed in Apache Kylin 4.0.2. So you can skip this step after updating to Apache Kylin 4.0.2. For users with Kylin 4.0.1, please refer to the following steps to replace kylin-spark-engine.jar:
Clone Kylin git repository, execute mvn clean package -DskipTests to build a new kylin-spark-project/kylin-spark-engine/target/kylin-spark-engine-4.0.0-SNAPSHOT.jar .
mvn clean package -DskipTests
kylin-spark-project/kylin-spark-engine/target/kylin-spark-engine-4.0.0-SNAPSHOT.jar
git clone https://github.com/hit-lacus/kylin.git cd kylin git checkout KYLIN-5160 mvn clean package -DskipTests # find -name kylin-spark-engine-4.0.0-SNAPSHOT.jar kylin-spark-project/kylin-spark-engine/target
Patch link: https://github.com/apache/kylin/pull/1819
# aws s3 cp s3://${BUCKET}/apache-kylin-4.0.1-bin-spark3.tar.gz . # wget apache-kylin-4.0.1-bin-spark3.tar.gz tar zxvf apache-kylin-4.0.1-bin-spark3.tar.gz . cd apache-kylin-4.0.1-bin-spark3 export KYLIN_HOME=/home/hadoop/apache-kylin-4.0.1-bin-spark3
If you are using other databases for metastore, please skip this step.
cd $KYLIN_HOME mkdir ext cp /usr/lib/hive/lib/mariadb-connector-java.jar $KYLIN_HOME/ext
AWS Spark has built-in support of AWS Glue, so you will use AWS Spark when loading table metadata and building jobs. Kylin 4.0.1 supports Apache Spark officially. Because the compatibility between Apache Spark and AWS Spark is not very good, we will use Apache Spark for cube queries. To sum up, you need to switch between AWS Spark and Apache Spark according to your task (query task or build task).
cd $KYLIN_HOME aws s3 cp s3://${BUCKET}/spark-2.4.7-bin-hadoop2.7.tgz $KYLIN_HOME # Or downloads spark-2.4.7-bin-hadoop2.7.tgz from offical website tar zxvf spark-2.4.7-bin-hadoop2.7.tgz mv spark-2.4.7-bin-hadoop2.7 spark-apache
$KYLIN_HOME/spark
SPARK_HOME
ln -s spark-aws spark
SparkSQLCLIDriver
jps -ml ${PID}
spark.driver.extraClasspath
jps -ml | grep SparkSubmit jinfo ${PID} | grep "spark.driver.extraClassPath"
bin/kylin.sh
KYLIN_TOMCAT_CLASSPATH
kylin_driver_classpath
kylin.sh
$SPARK_HOME/jars
cd $KYLIN_HOME vim conf/kylin.properties
Start Kylin
cd $KYLIN_HOME ln -s spark spark_aws # skip this step if soft link 'spark' exists bin/kylin.sh restart
(Optinoal) Replace kylin-spark-engine.jar
This step is only required for Kylin 4.0.1 users.
cd $KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/lib/ mv kylin-spark-engine-4.0.1.jar kylin-spark-engine-4.0.1.jar.bak # remove old one cp kylin-spark-engine-4.0.0-SNAPSHOT.jar . bin/kylin.sh restart # restart kylin to make new jar be loaded
Load AWS Glue table and build
Switch the Spark used by Kylin and restart Kylin.
cd $KYLIN_HOME rm spark # 'spark' is a soft link, it is point to aws spark ln -s spark_apache spark # switch from aws spark to apache spark bin/kylin.sh restart
Perform a test query and this query is successful.
AWS Spark has built-in support for AWS Glue so you will use AWS Spark when loading table metadata and building jobs; Kylin 4.0.1 supports Apache Spark. Because the compatibility between Apache Spark and AWS Spark is not very good, we will use Apache Spark for cube query. To sum up, you need to switch between AWS Spark and Apache Spark according to your task (query task or build task).
As Spark Driver, Kylin needs to load table metadata through aws-glue-datacatalog-spark-client.jar, so you need to modify kylin.sh and load the relevant jar into classpath of Kylin process.
aws-glue-datacatalog-spark-client.jar
If you have any questions about using Kylin on AWS, please contact us via mailling list(user@kylin.apache.org), please check for detail https://kylin.apache.org/community/.
Learn about the fundamentals of a data product and how we help build better data products with real customer success stories.
Unlock potentials of analytics query accelerators for swift data processing and insights from cloud data lakes. Explore advanced features of Kyligence Zen.
Optimize data analytics with AWS S3. Leverage large language models and accelerate decision-making.
Optimize data analytics with Snowflake's Data Copilot. Leverage large language models and accelerate decision-making.
Discover the 7 top AI analytics tools! Learn about their pros, cons, and pricing, and choose the best one to transform your business.
Discover operational and executive SaaS metrics that matter for customers success, importance, and why you should track them with Kyligence Zen.
Unlock the future of augmented analytics with this must-read blog. Discover the top 5 tools that are reshaping the analytics landscape.
What website metrics matter in business? Learn about categories, vital website metrics, how to measure them, and how Kyligence simplifies it.
Already have an account? Click here to login
You'll get
A complete product experience
A guided demo of the whole process, from data import, modeling to analysis, by our data experts.
Q&A session with industry experts
Our data experts will answer your questions about customized solutions.
Please fill in your contact information.We'll get back to you in 1-2 business days.
Industrial Scenario Demostration
Scenarios in Finance, Retail, Manufacturing industries, which best meet your business requirements.
Consulting From Experts
Talk to Senior Technical Experts, and help you quickly adopt AI applications.