Excel Your KPIs with AI Copilot Start for free today
Your AI Copilot for Data
Subscribe to our newsletter>
Get the latest products updates, community events and other news.
As a complete open-source big data suite, Apache Hadoop has profoundly influenced the entire big data world over the past decade. However, with the development of various emerging technologies, the Hadoop ecosystem has undergone tremendous changes. Is Hadoop really dead? If so, what products/technologies will replace it? What is the future outlook for big data analysis?
This article will analyze:
For the past two decades, we have been living in an era of data explosion. The amount of data created in traditional business, such as orders and warehousing, has increased relatively slowly, and its proportion in the total amount of data has gradually decreased.
Instead, massive amounts of human data and machine data (logs, IoT devices, etc.) have been collected and stored in quantities far exceeding traditional business data. A huge technology gap exists between the massive amounts of data and human capabilities, which has spawned various big data technologies. In this context, what we call the era of big data has come into being.
Thanks to “big data” and the influential Apache community of open-source software projects, Hadoop has become rapidly popular, and many commercial companies have emerged.
Hadoop is such a fully functional big data processing platform. It contains a variety of components to meet different functional requirements, such as:
Hadoop used clusters for parallel computing soon after its birth, breaking the sorting record held by supercomputers. It has been widely adopted by companies and various organizations with proven strength.
Top Hadoop distributors in the market include the three vendors — Cloudera, Hortonworks, and MapR. In addition, public cloud vendors also provide hosted Hadoop services on the cloud, such as AWS EMR, Azure HDinsight, etc., which account for the majority of Hadoop’s market share.
In 2018, nevertheless, the market experienced drastic changes. A piece of big news shocked the Hadoop ecosystem: Cloudera and Hortonworks merged.
In other words, the №1 and №2 market players embraced each other to survive in the market. Then, HPE announced the acquisition of MapR. These M&As indicated that despite Hadoop’s extreme popularity, the companies faced difficulties in operation and found it hard to make money.
After merging Hortonworks, Cloudera announced that it will charge for all product lines, including previous open-source versions. Open source products are no longer available to all users, but to paying users only.
The HDP distribution, which has been available for free in the past, is no longer maintained and available for download. It will be merged into a unified CDP platform in the future.
In April 2021, the Apache Software Foundation announced the retirements of 13 big data-related projects, 10 of which are part of the Hadoop ecosystem, such as Eagle, Sentry, Tajo, etc.
Now, Apache Ambari, born with the mission to manage Hadoop clusters, becomes the first Apache project to be retired in 2022.
Will Hadoop ultimately be abandoned? I believe this will not happen anytime soon. After all, Hadoop has a large number of users, which means exorbitant costs of platform and application migration.
Therefore, the current users will continue to use it, but the number of new users will gradually decrease. This is what we call the “post-Hadoop era”.
In terms of the potential growth of Apache Hadoop, the above roadmap is taken from a meetup of the Hadoop community. After 3.0, clearly, the new features of Hadoop are not that good anymore. They are mainly about the integration with K8s and Docker, which is not that attractive for big data practitioners.
Google Trends shows that interest in Hadoop reached its peak popularity from 2014 to 2017. After that, we see a clear decline in searches for Hadoop. It’s not unexpected that Hadoop has been gradually losing its aura. Any technology, by the way, will go through the cycle of development, maturity, and decline, and no technology can escape the objective law.
When looking at the current state of Apache Hadoop and it’s ecosystem there are a few key factors pointing to it’s eventual demise.
Looking back at the development history of Hadoop, it can be seen that the software framework has emerged because of the strong demand for big data storage. In the present day, however, users have new demands for data management and analysis, such as online rapid analysis, separation of storage and computing, or AI/ML for artificial intelligence and machine learning.
In those respects, Hadoop can only provide limited support. In this regard, it cannot be compared with some emerging technologies. For example, Redis, Elastisearch, and ClickHouse, which have been very popular in recent years, can all be applied to big data analysis.
For customers, there is just no need to deploy the complex Hadoop platform if a single technology can meet their demand.
From another perspective, cloud computing has been developing rapidly in the past decade or so, not only beating traditional software vendors such as IBM, HP, etc. but also encroaching to a certain extent on the big data market of Hadoop.
In the early days, cloud vendors only deployed Hadoop on IaaS, such as AWS EMR (claimed to be the most deployed Hadoop cluster in the world). For users, the Hadoop services hosted on the cloud can be started and stopped at any time, and the data can be safely backed up on the cloud vendor’s data service platform, which is easy to use and cost-saving.
Besides that, cloud vendors render a range of big data services for specific scenarios to form a complete ecosystem, such as persistent and low-cost data storage implemented by AWS S3, KV data storage, and access with low latency implemented by Amazon DynamoDB, Athena, a serverless query service to analyze big data, etc.
In addition to the emerging technologies and cloud vendors that continue to offer new services, Hadoop itself has been gradually showing “fatigue”. Building blocks is a good option. However, it increases the difficulty for users to use the components of the Hadoop ecosystem.
As can be seen from the figure above, there have been 13 (if not more) commonly used components in Hadoop, posing a huge challenge to Hadoop users in terms of learning and O&M.
Technology vendor Cloudera/Hortonworks can’t release a high-quality free product on the market. It turns out that their earlier two-pronged “free version + paid version” approach doesn’t work. Cloudera will only offer the paid version of CDP in the future, indicating the end of the free lunch. It is unknown whether other manufacturers are willing to offer free products. Even if there is such a manufacturer, its product stability and sophistication are yet unknown. After all, the core developers of Hadoop mostly work for Cloudera and Hortonworks.
Don’t forget that Hadoop is an open-source project hosted by the Apache Foundation. Apache is designed for the public good, which can be obtained, used, and distributed by the public for free. So if you don’t want to pay for it, there is an option called Apache Hadoop available for free use. After all, a large number of Internet companies still use Apache Hadoop (at their scale, only the open-source version can be used). If they can, why can’t I?
However, open-source software is of average quality, with no service and no SLA guarantees. Users can only find out and solve problems on their own. They have to post questions in the community and wait for the results. If you’re okay with that, then hire a few engineers to try it out. By the way, Hadoop development or O&M engineers are hard and expensive to hire.
In the post-Hadoop era, how should its users face the transition, and what options are available to them? It all depends on how much budget you have and your technical capabilities.
How should vendors in the Hadoop ecosystem respond to the new era? The evolution history of Apache Kylin and Kyligence are perfect examples.
Both the Apache Kylin project and Kyligence were born in the Hadoop era. Initially, all Kyligence products ran on Hadoop. About 4 years ago, Kyligence foresaw that customer needs were slowly shifting to cloud-native and the separation of storage and computing.
Having seen such industry trends, Kyligence made a large transformation to its original platform system.
In 2019, Kyligence launched Kyligence Cloud and announced that it has escaped from the Hadoop platform. Kyligence Cloud uses cloud-native architecture at the bottom tier, cloud vendors’ object storage services, such as AWS S3, ADLS, etc. for storage, and Spark+ containerization for computing. Its resources can be directly connected to the IaaS services and ECS on the cloud platform. Kyligence kept expanding to multiple clouds and fine-tuning the architecture, and announced in 2021 that it had merged new technologies such as ClickHouse to the architecture.
The flexibility, maintainability, and low TCO brought by the transformed architecture are huge and have received very positive feedback from the market.
The Key Success Factor(KSF) for tech vendors is to be extremely fast, sensitive, and bold, in both trend-catching and product-transforming.
The big data and analytics market, especially in North America, has become a very hot and competitive market with very dedicated investors that no other industries can compete with. It is never too much for vendors to keep your attention on the market trends, listen to the users and observe their new needs, and keep iterating your products according to these inputs.
Technology will keep progressing, startups with new missions may come and go, and major corporations remain resilient. I’ve written a blog about 7 must-know data buzzwords, and discussed emerging trends in 2022. To keep this one short, I’ll just list some (not all) examined interesting trends, and share with you some nice articles around them:
We cover critical questions our customers often ask on Hadoop in analyzing big data
Apache Hadoop is an open-source software framework used for distributed storage and processing of large datasets. Hadoop's primary role in Big Data Analytics involves efficiently handling vast amounts of data, offering scalability, fault-tolerance, and cost-effective solutions. Hadoop can also facilitate advanced analytics, such as predictive analytics, data mining, and machine learning.
The core components of Apache Hadoop include the Hadoop Distributed File System (HDFS) for data storage, MapReduce for processing, and YARN for resource management. Its main use cases involve processing and storing massive data sets, conducting large-scale data analysis, and supporting advanced data-driven applications in industries such as healthcare, finance, retail, and telecommunications.
Kyligence Zen intelligently manages data in the retail industry. Read to learn how to develop the "North Star Metric" system to track goals and progress.
Kyligence introduces the deployment of OLAP on top of Azure, including data sources, features, benefits, and prerequisites. Learn more about Kyligence for Azure.
What's OLAP on big data? What're its benefits? Here's everything you need to know about OLAP.
Learn how one big fast-food brand leveraged Kyligence capabilities and implemented precision marketing to maximize profit opportunities.
Come to see the Next Generation of SQL Query Engine
In this article, we’ll dive into the unified Metrics Platform at Beike, introduce Beike’s practice of building the Metrics Platform infrastructure using Apache Kylin and some real use cases at Beike.
A detailed step-by-step guide on how to connect Excel to ClickHouse within Kyligence Tiered Storage and achieve sub-second query latencies.
99 Almaden Boulevard Suite #663
San Jose, CA 95113
+1 (669) 256-3378
Ⓒ 2023 Kyligence, Inc. All rights reserved.
Already have an account? Click here to login
A complete product experience
A guided demo of the whole process, from data import, modeling to analysis, by our data experts.
Q&A session with industry experts
Our data experts will answer your questions about customized solutions.
Please fill in your contact information.We'll get back to you in 1-2 business days.