This article is part of a series of conversations with the founding members of Apache Kylin and Kyligence on the origins of Apache Kylin. You can find the first installment here: Episode One.
Episode Two: The Journey Begins
Apache Kylin is an open source distributed analytical data warehouse built with big data in mind. Via a clever combination of multi-dimensional cubes, plug-in architecture and precomputation technology, Kylin can provide near constant query speeds no matter the size of your dataset with sub-second latency – cutting costs for adopters of this technology in both time and manpower needed for effective analysis.
This recipient of multiple industry awards has been adopted by over a thousand organizations worldwide seeking a solution to the problem of storing and analyzing big data fast enough for their insights to make an impact on their business.
This is the origin story of the unexpected hero of modern big data analytics, Apache Kylin, as told by its inventors.
As we continue our conversation on the origin of the open source, top-level Apache Software Foundation project, Kylin, with its six founding members, I’d like to learn more about the specific role each of you has played in making Apache Kylin the globally respected platform that it is today.
What roles have each of you held in Kylin over the years and what features have you developed or contributed? I’m sure there are too many to name, so can you give us a few key highlights?
L U K E: The very first role I had on Kylin was just as a developer. I oversaw one main component. I also served as a product manager and took charge of talking to management and the clients inside eBay.
The metadata component was my initial focus. The metadata is the core of the project where we define the data models, illustrate how we combine them into dimensions and metrics, how we store and optimize that, and all the things we need to extract to make our system work. I wrote the entire model in the beginning.
When I was a developer, I just focused on my coding, but when I began to market the product within eBay, and later on with the Apache Software Foundation, a lot of things changed in the way I communicated – this is one reason I think I was picked to be Kylin’s first Project Management Committee Chair. The PMC Chair is a community role, so I did a lot of communication in that position.
The skills I developed there have helped me in a lot of ways – being able to chat with different people, bring differing opinions together and find common ground, to resolve conflict and know how to prioritize all the right stuff. This has changed me a lot from a very technical guy, sort of a geek, into a product manager and leader.
Conflict resolution is no small talent – and it’s particularly essential when you work with such a passionate team in an international community. Shaofeng, you are Kylin’s current PMC Chair. What was your journey to the PMC like?
S H A O F E N G: When I joined the project as a software engineer, I first started working with metadata and the build engine, but I quickly expanded to other areas. Later, I became a Committer and contributed regularly by working on releases, answering questions in the community, putting up new features, and, in time, I was recognized as a PMC member.
Last year, I became the PMC Chair for Apache Kylin, taking over the role from Luke. I’ve worked on many features in my time with Kylin. I’ve published several technical blogs about them on the Apache Kylin website. For example, I developed near-real-time streaming, Spark cubing engine, Top-N measure, and the fast-cubing algorithm. These are the big features that I’ve worked on.
I look forward to reading your blogs, Shaofeng. Jason, you have contributed very heavily to the Kylin community. Where did you start?
J A S O N: At first, I was a front-end programmer. I wrote the front-end code and also some REST API. We only had one front-end programmer at that time – me. I’m now the Senior Director responsible for the South China Business group with Kyligence, the enterprise-ready version of Kylin. There are so many features I’ve worked on I couldn’t tell you one by one. It’s been three years since I worked on Kylin directly and I am still listed as one of the top six Contributors on Apache Kylin.
That’s incredible! Hongbin, as one of the earliest members I am sure you’ve held a lot of positions.
H O N G B I N: Yes, I have. At first, I was a Junior Software Engineer. I joined this project very early. Yang joined just a few days before me. I’ve played many roles over the past four years. First, I became Senior Software Engineer. Later, I became a team leader, then Director and now I am VP of Engineering at Kyligence, so this has been a really fast-paced career path for me.
I mainly worked on the query engine part of the project. I remember in the first few months after I joined the Apache Kylin team, I was assigned a task to complete the OBDC driver for the query engine. I was very young and passionate at the time, so I did a lot of research and finished the prototype within one week, which I felt was a very impressive accomplishment.
That level of commitment and passion is very impressive. You have all demonstrated incredible drive and vision throughout this project. Yang, could you tell us some more about the beginning of it all?
Y A N G: In the early days, it was more of a transition period between Xu Jiang, the initial technical lead, and me. In the beginning, we worked from a peer-to-peer style. We didn’t have a lot of people, so there were no politics. It was just a group of technical engineers with the passion to create something new. After Xu left, I was the most senior member, so people would come to me for decision making or for guidance if they were in doubt.
Is there anything in particular that stands out in your mind as the secret to Kylin’s success?
Y A N G: I remember, apart from building the product itself, the team put quite a lot of effort into marketing, and I would say that was quite an important thing in the early days. It was led by me and Luke mostly, but the whole team was involved. Luke and I were more media-facing. I think if we were to say why Apache Kylin was the first successful open source project from China – and successful internationally, at that – I think the major difference for this project was the marketing.
I think Luke played a very good role in this. Thanks to eBay’s connection with Silicon Valley, we participated in a lot of global conferences starting very early on in Apache Kylin’s development and we made some big-name connections, which was a very important element of our success, I think.
That makes sense. Could you tell us about some of the features you’ve developed?
Y A N G: What features did I work on? Perhaps, everything! But if I had to pick one key thing, I would say that is the plug-in architecture beneath the surface of Apache Kylin, which was mostly driven by me.
Kylin is a precalculation engine and there are a few major components in the platform, one is the build engine, another is the storage engine, and finally the query engine. Basically, the build engine is where data gets loaded and precalculated, the storage engine is where the calculated data is stored (in the early days it was HBase), and, finally, the query engine is where the SQL is executed, analyzed and optimized based on what we have precalculated and stored.
So, we have what we call a plug-in architecture where the three major pieces have a very clear interface and they can be replaced without impacting the other components too much. We created this design roughly between versions 0.6 to 1.0. It turned out to be quite useful because we did replace the three parts gradually over time.
In the early days, the build engine was mostly MapReduce and, as time passed, we knew MapReduce was going to go away and be replaced by Spark, and that did happen recently. The same thing happened in the storage part. HBase was the most balanced storage engine we could find in the early days. It had everything we needed – cache, storage, and good throughput, but it’s not a fully distributed calculation engine.
HBase is not perfect. It has a co-processor, which is at least something but it’s a weak form of distributed calculation. The query engine was initially based on Calcite, an open source query engine. In the enterprise version, we replaced it with Spark. We’re now working on contributing that Spark query engine back to the open source community.
So, I think that was a pretty clever move on our part in the early days. We set the foundation for these major parts to be able to evolve, which prepared us pretty well to adapt. As the technical trends move forward and develop, Kylin is able to catch up whenever we want it to.
It certainly sounds like that level of flexibility would come in handy and help Kylin evolve with related technology. Dong, could you tell us about your experience working on Kylin?
D O N G: Absolutely. My first project with Apache Kylin was to implement some additional BI integrations. Before I joined, Kylin could only be connected with Tableau for BI users. My first job was to expand Kylin to more BI tools like Excel and PowerBI, and so I took on the OBDC driver component, which is a connector between Kylin and different BI tools.
Kylin started out as just a query engine. It was just a server running on a machine, so the challenge was – how can it be used and leveraged by business users and BI analysts on their desktop or monitor? They can only use Tableau or Excel to do their analytics, so we needed a connector between the BI tools and Apache Kylin.
Over the following two months, at the end of 2015, I quickly got involved in many core components of Kylin such as the query engine, job engine, and the original Spark cubing engine. I contributed a lot of enhancements, features, and bug fixes and was invited to become a Committer by the end of the year.
I also contributed a new component called Diagnose. In the beginning, a lot of users in the community met a lot of problems when they used Kylin. When they built cubes or wrote queries, they wanted to ask for help from the community, but when they wrote an email to ask for help, they had to attach a lot of information such as metadata, screenshots and so on.
Sometimes, even when they had attached a lot of information, it wasn’t enough for us to find the root cause of their issue and solve the problem. That system was very inefficient. To improve it, I came up with Diagnose.
With this feature, users can generate the diagnostic package from Apache Kylin, which contains all the information normally requested compressed into a single zip file that can just be attached to the email and sent to the community. The experts can then investigate the package and find the root cause of the problem, which has significantly reduced the turnaround time for solving user problems.
After I joined Kyligence, I developed the Kyligence Robot to help Kylin users. I led the team to build an online service called Kyligence Robot to help the community users perform some self-service diagnostics and monitoring.
With this product, the community can upload their Diagnose package to the Kyligence Robot, which has an intelligence engine that is able to analyze the package and generate some health reports, do some problem analysis and provide a solution with some detailed information and metrics to help users solve the problem by themselves, which is much more efficient.
Aside from contributing my code, I was mainly an evangelist for Apache Kylin. I’ve given a lot of technical speeches on Kylin. I first spoke at several open source Meetups in China. I’ve also had a lot of chances to speak at global conferences where I was able to make international friends and attract a lot of new technologies from around the world to Apache Kylin.
For example, I had the chance to speak at the O’Reilly Strata Hadoop conference, which is the number one global conference in the big data industry. I also had the chance to visit Tokyo, Singapore and the U.S.
Even within China, I’ve had a lot of opportunities to visit Kylin users to communicate with them and learn from them, to get new ideas and new scenarios to help us better understand how the community is using Apache Kylin to make a change to, and impact on, their business. So, from this role of evangelist I have had more chances to get a broader view of this technology than I ever would have in a purely developer role.
I joined the PMC in 2016, just one year after I became a Committer. Once I became a PMC member, I had the opportunity to write releases and do some project management for Apache Kylin. That experience planted the seed for me to become a manager because now I have transformed from a developer into a product manager at Kyligence, as well. Membership with the Kylin PMC helped me to start a new career path.
You have all been exceptionally busy over the last seven years! In our next conversation, I want to hear more about Apache Kylin’s most significant milestones and key achievements from your perspectives.
Stay tuned for the next episode! If you missed the first installment in this series, catch up here to learn about The Rise of Kylin.
About the Founders
Luke Han is the Co-Founder and CEO of Kyligence and Co-Founder of Apache Kylin, the first Apache Software Foundation top-level project developed in China. He is responsible for Kylin's strategic planning, development roadmap, product design, and more, and is committed to developing the Apache Kylin global community and ecosystem. He has served as Head of Big Data Products in eBay's Global Analytics Infrastructure Division, Chief Advisor to Actuate China, and Technical Director of Power Excellence East China.
Yang Li is the Co-Founder and CTO of Kyligence, Co-Founder of Apache Kylin and member of the Project Management Committee (PMC). Previously, he was the Senior Architect of Big Data in eBay's Global Analytics Infrastructure, Vice President at Morgan Stanley, and during his time with IBM, he received the Outstanding Technology Contribution Award. Yang has more than 10 years of hands-on experience in big data analytics; he has focused on parallel computing, data indexing, relational mathematics, approximation algorithms, compression algorithms and other cutting-edge technologies. Over the past 15 years, Yang has directly driven the development of OLAP technology in the big data space.
Dong Li is the Founding Member and Senior Director of Product and Innovation at Kyligence, an Apache Kylin Core Developer (Committer) and member of the Project Management Committee (PMC) where he focuses on big data technology development. Previously, he was a Senior Engineer in eBay's Global Analytics Infrastructure Department, a Software Development Engineer for Microsoft Cloud Computing and Enterprise Products, and a core member of the Microsoft Business Products Dynamics Asia Pacific team where he participated in the development of a new generation of cloud-based ERP solutions.
Shaofeng Shi is a Partner and Chief Software Architect at Kyligence, Apache Kylin Core Developer (Committer) and Chairman of the Project Management Committee (PMC Chair) where he focuses on big data analytics and cloud computing technologies. Previously, he was a Senior Data Engineer in eBay's Global Analytics Infrastructure Department and a Cloud Computing Software Architect at IBM.
Hongbin Ma is the Vice President of Research and Development at Kyligence, an Apache Kylin Core Developer (Committer) and member of the Project Management Committee (PMC) where he focuses on big data infrastructure and platforms. He joined eBay as Apache Kylin's Chief Committer. Previously, he was a core contributor to Trinity, Microsoft's Asian Research Institute's graph database. He has contributed to Apache Kylin's storage engine, query optimization, test coverage and other areas and is currently the technical leader of Kyligence Enterprise data warehouse products.
Jason Zhong is a Partner and Senior Director at Kyligence, an Apache Kylin Core Developer (Committer) and member of the Project Management Committee (PMC). He has worked in eBay's Global Analytics Infrastructure Division and been involved in operational automation product development as well as Kylin's development. After joining Kyligence, he worked in both research and development before becoming responsible for business sales and business development transformation. He has won consecutive Kyligence sales titles and is currently the Head of Kyligence South Division.
About the Author
Samantha Berlant is the Marketing Communications Manager at Kyligence and a big fan of AI, machine learning, and science-fiction. She spent several years leading content analytics projects at Facebook and Instagram and has been a writer and editor for over a decade. Samantha believes in the power of accessible data and her favorite Star Trek character is, coincidently, Data.