This article is part of a series of conversations with the founding members of Apache Kylin and Kyligence on the origins of Apache Kylin. You can find the first five installments here: Episode One, Episode Two, Episode Three, Episode Four, Episode Five.
Episode Six: Overcoming Obstacles
Apache Kylin is an open source distributed analytical data warehouse built with big data in mind. Via a clever combination of multi-dimensional cubes, plug-in architecture, and precomputation technology, Kylin can provide near-constant query speeds no matter the size of your dataset with sub-second latency – cutting costs for adopters of this technology in both time and manpower needed for effective analysis.
This recipient of multiple industry awards has been adopted by over a thousand organizations worldwide seeking a solution to the problem of storing and analyzing big data fast enough for their insights to make an impact on their business.
This is the origin story of the unexpected hero of modern big data analytics, Apache Kylin, as told by its inventors.
Once again, I’m delighted to sit down and speak with the six founding members of both Apache Kylin and its enterprise-ready counterpart, Kyligence. Continuing our conversation on what you’ve learned by working on Kylin, let’s talk about the obstacles you’ve faced. What has been most challenging about this project?
L U K E: A very challenging moment for us happened about three months after we joined the Apache Software Foundation. We sent our source code to the community and very quickly a lot of people were trying to use it. At the very beginning, a lot of people gave us very positive feedback, but nobody had used it yet, they just liked what we were doing.
After three months, we started getting a lot of complaints from the community. Everybody was saying “Hey, you guys suck. I can’t install it on my cluster, I can’t even compile data.” When we started out, we were just serving the eBay environment and hadn’t thought about any other possibilities – we had just made sure it worked internally. When other people took it and tried to install it themselves, we found out – “Oh my god, it doesn’t work!” Some of the users could conquer the issues themselves but most of them couldn’t.
So, we got a lot of challenges all at once and, just before the Chinese New Year in January 2015, we decided to stop developing any new features. We just wanted to make sure everyone could successfully install and launch. At that moment, we decided on a one-step installation method as our goal and we said if we couldn’t make it, we wouldn’t take our vacation. So, everyone was working very hard on it at that time. We had a very small team then, just five or six people. Finally, we made it – at the last minute before our New Year’s goal, so we were able to go on our vacation after all!
We made it so Kylin just needed to be downloaded and then one single command would help you to install it on your cluster. After that, we felt very good because just a few weeks later the community was no longer complaining. They started discussing and asking questions about how do you manage this, why are you using that algorithm, why are you doing this like that, and how can you help me build the cubes for my dataset – which are much more in line with the conversations we wanted to be having with the community.
That’s an amazing story. I’m glad to hear you were able to take your holiday! That just goes to show what an incredibly dedicated team you were, and still are. What obstacles did you encounter within eBay as you were getting the project started?
Y A N G: Getting sponsors for a project in eBay is a very challenging thing. The sponsors come in two levels, the first is the high-level sponsorship – you need a team and a budget, then you can hire more teammates to do more work. In the end, it was a manager from the local team in China that gave us a small sponsorship that allowed us to start with a small team. Debashis Sasha, who was the VP of Data Platforms, was our biggest sponsor early on in Kylin’s development. From an execution point of view, Vivian Tian, who was the local VP of the CCOE (China Center of Excellence) where we were based, gave us the manpower we needed to get started and build a team.
I don’t think we got a lot of support from eBay U.S. in the early days. I remember Luke told me a story once about when Xu Jiang (Kylin’s first technical lead) travelled to the U.S. to discuss the ideas of a cube precomputing on big data with an architect over there and it didn’t end up going very well – or more specifically, the story goes that Xu slapped on the table and left.
Basically, nobody at that time believed this was going to work. We had cube technology and we knew the challenges of big data. Doing precomputations is basically saying “I’m going to precompute the already big data and produce even bigger data as a precomputed result.” A lot of people think that’s not going to work because it’s already hard enough to handle this amount of big data, and now we’re talking about producing exponentially more. So, getting people to understand what we were trying to do was very difficult.
Another level of sponsorship we needed was from the internal users of eBay. In order to make a project go forward, you have to first get a lot of support, quickly create a prototype, and then sell that prototype to your potential users in eBay. Once you get their buy-in, then you get a second round of funding.
Once we got the prototype ready, we tried to make our first group of users happy, which was a challenge as well. My impression was that our early service in eBay was not very stable. To have a stable production environment, you need a QA dev environment before you put everything into production.
At that time in eBay, we were struggling with the fact that we were working with big data. In order to have a very strict separation of production and QA, you need to copy the whole production resource into QA – basically doubling the hardware resources you need, and it was a big Hadoop cluster. It’s very difficult to have that amount of free resources and we didn’t have that. So, in the early days, the QA environment and the production environment inside eBay were on the same Hadoop cluster. That can cause a lot of problems because, when we are testing, that means it’s possible, and likely, that we’ll impact production, since both are on the same Hadoop cluster. We were 100% reliant on whatever Hadoop had for backup for resource isolation. Hadoop has some mechanisms there, but it’s not what I would call production level.
For the early team members, everyone was an operation guy for eBay’s early Kylin product deployment. We put a lot of manpower into maintaining the early Kylin cluster and did a lot of QA testing and moved data from QA to production mostly manually and very carefully, because we were mixing two sets of Kylin services in one cluster. It took a lot of manpower to keep the whole thing stable and to have a reasonable release process, to have first tested things in QA before moving it to production. In order to maintain the early Kylin cluster, we ended up with a release tool that migrates data assets from QA to production automatically and safely to solve the maintenance/manpower problem.
That sounds like an amazing amount of work to keep things stable and moving forward. Do you encounter any challenges when you’re teaching people about Apache Kylin or do those conversations go more smoothly now that you’ve been around for several years?
S H A O F E N G: How difficult it is to explain Kylin to a person depends on their experience level. It’s easy to introduce Apache Kylin to someone if they already have the domain knowledge of data warehouse, but for someone who doesn’t have experience using data warehouse, it’s a little difficult to introduce Kylin. The learning curve for getting started on Kylin is still a little higher than other technologies.
J A S O N: Exactly. It’s not very easy to explain Kylin to people that aren’t familiar with the technology we use. Most of the time it’s not easy to explain, but it depends on who you are talking to. For some people who are not in IT, we can tell them what a distributed system is and what OLAP is. We have to explain what Kylin is doing using business language.
When it comes to explaining Kylin to folks in IT, the hardest thing is that many of them are still working on traditional platforms. They are not familiar with Spark, so we have to explain what a distributed system is, what Hadoop is and what Spark is. Usually, bigger companies don’t have a problem understanding Kylin since they all have a distributed system like Hadoop, so it’s easy for us to explain it, but for companies or users that have never used big data, it’s hard to explain Kylin to them.
That makes sense. The way Apache Kylin approaches big data is drastically different that traditional technologies so I can see where there is a bit of a learning curve. From the way the community has continued to grow, it seems to me that people are more than willing to put in that effort to understand what Kylin is doing and how it can help them in their business.
As we start to wrap up our conversation, in our next episode I’d like to hear what you all think is the future of Apache Kylin. Where do we go from here? Stay tuned for our next episode!
Q&A with Apache Kylin Committer, Kaige Liu – How Apache Kylin Is Rapidly Changing the Way We Approach Big Data
Roaring Elephant Podcast with Dong Li – Episode 93 – Apache Kylin: Extreme OLAP Engine for Big Data
Learn About Real-Time Streaming - What’s New with Apache Kylin 3.0?
4-Part Series on Count Distinct – Making Distinct Counting Work for Big Data
Further Reading Is Available on Our Apache Kylin Blog
About the Founders
Luke Han is the Co-Founder and CEO of Kyligence and Co-Founder of Apache Kylin, the first Apache Software Foundation top-level project developed in China. He is responsible for Kylin's strategic planning, development roadmap, product design, and more, and is committed to developing the Apache Kylin global community and ecosystem. He has served as Head of Big Data Products in eBay's Global Analytics Infrastructure Division, Chief Advisor to Actuate China, and Technical Director of Power Excellence East China.
Yang Li is the Co-Founder and CTO of Kyligence, Co-Founder of Apache Kylin, and member of the Project Management Committee (PMC). Previously, he was the Senior Architect of Big Data in eBay's Global Analytics Infrastructure, Vice President at Morgan Stanley, and during his time with IBM, he received the Outstanding Technology Contribution Award. Yang has more than 10 years of hands-on experience in big data analytics; he has focused on parallel computing, data indexing, relational mathematics, approximation algorithms, compression algorithms, and other cutting-edge technologies. Over the past 15 years, Yang has directly driven the development of OLAP technology in the big data space.
Dong Li is the Founding Member and Senior Director of Product and Innovation at Kyligence, an Apache Kylin Core Developer (Committer) and member of the Project Management Committee (PMC) where he focuses on big data technology development. Previously, he was a Senior Engineer in eBay's Global Analytics Infrastructure Department, a Software Development Engineer for Microsoft Cloud Computing and Enterprise Products, and a core member of the Microsoft Business Products Dynamics Asia Pacific team where he participated in the development of a new generation of cloud-based ERP solutions.
Shaofeng Shi is a Partner and Chief Software Architect at Kyligence, Apache Kylin Core Developer (Committer), and Chairman of the Project Management Committee (PMC Chair) where he focuses on big data analytics and cloud computing technologies. Previously, he was a Senior Data Engineer in eBay's Global Analytics Infrastructure Department and a Cloud Computing Software Architect at IBM.
Hongbin Ma is the Vice President of Research and Development at Kyligence, an Apache Kylin Core Developer (Committer) and member of the Project Management Committee (PMC) where he focuses on big data infrastructure and platforms. He joined eBay as Apache Kylin's Chief Committer. Previously, he was a core contributor to Trinity, Microsoft's Asian Research Institute's graph database. He has contributed to Apache Kylin's storage engine, query optimization, test coverage, and other areas and is currently the technical leader of Kyligence Enterprise data warehouse products.
Jason Zhong is a Partner and Senior Director at Kyligence, an Apache Kylin Core Developer (Committer), and a member of the Project Management Committee (PMC). He has worked in eBay's Global Analytics Infrastructure Division and been involved in operational automation product development as well as Kylin's development. After joining Kyligence, he worked in both research and development before becoming responsible for business sales and business development transformation. He has won consecutive Kyligence sales titles and is currently the Head of the Kyligence South Division.
About the Author
Samantha Berlant is the Marketing Communications Manager at Kyligence and a big fan of AI, machine learning, and science-fiction. She spent several years leading content analytics projects at Facebook and Instagram and has been a writer and editor for over a decade. Samantha believes in the power of accessible data and her favorite Star Trek character is, coincidently, Data.