Build the Common Data Language with the Metrics Platform Start Now

Is Hadoop Dead? The Future of Big Data Analysis

Author
Shaofeng Shi & Coco Li
Partner & Chief Architect at Kyligence & Product Marketing Manager
Feb. 14, 2022
   

As a complete open source big data suite, Apache Hadoop has profoundly influenced the entire big data world over the past decade. However, with the development of various emerging technologies, the Hadoop ecosystem has undergone tremendous changes. Is Hadoop really dead? If so, what products/technologies will replace it? What is the future outlook for big data analysis?

 

This article will analyze: 

 
  1. 1. The history of Hadoop & it’s open source ecosystem
  2. 2. The emerging technology options under the cloud-native trends
  3. 3. The future outlook of big data analysis in the next 10 years
   

A Brief History of Big Data & Hadoop

 

For the past two decades, we have been living in an era of data explosion. The amount of data created in traditional business, such as orders and warehousing, has increased relatively slowly, and its proportion in the total amount of data has gradually decreased.

 
data volume
Image by Author
 

Instead, massive amounts of human data and machine data (logs, IoT devices, etc.) have been collected and stored in quantities far exceeding traditional business data. A huge technology gap exists between the massive amounts of data and human capabilities, which has spawned various big data technologies. In this context, what we call the era of big data has come into being.

 
2006: The rise of Apache Hadoop for big data processing
 

Thanks to “big data” and the influential Apache community of open-source software projects, Hadoop has become rapidly popular, and many commercial companies have emerged.

 

Hadoop is such a fully functional big data processing platform. It contains a variety of components to meet different functional requirements, such as:

 
  • HDFS for data storage
  • Yarn for resource management
  • MapReduce and Spark for data calculation and processing
  • Sqoop for relational data collection
  • Kafka for real-time data pipelines
  • HBase for online data storage and access
  • Impala for online ad-hoc queries, etc.
 
hadoop data capability
Image by Author
 

Hadoop used clusters for parallel computing soon after its birth, breaking the sorting record held by supercomputers. It has been widely adopted by companies and various organizations with proven strength.

 

Top Hadoop distributors in the market include the three vendors — Cloudera, Hortonworks, and MapR. In addition, public cloud vendors also provide hosted Hadoop services on the cloud, such as AWS EMR, Azure HDinsight, etc., which account for the majority of Hadoop’s market share.

 
2018: Cloudera & Hortonworks Merger
 

In 2018, nevertheless, the market experienced drastic changes. A piece of big news shocked the Hadoop ecosystem: Cloudera and Hortonworks merged.

 
Cloudera and Hortonworks
News by Chirs Preinesnerger, and Daniel Newman
 

In other words, the №1 and №2 market players embraced each other to survive in the market. Then, HPE announced the acquisition of MapR. These M&As indicated that despite Hadoop’s extreme popularity, the companies faced difficulties in operation and found it hard to make money.

 

After merging Hortonworks, Cloudera announced that it will charge for all product lines, including previous open-source versions. Open source products are no longer available to all users, but to paying users only.

 
cloudera website
Image from Cloudera official website: source
 

The HDP distribution, which has been available for free in the past, is no longer maintained and available for download. It will be merged into a unified CDP platform in the future.

 
2021: The Decline of Hadoop's Open Source Ecosystem
 

In April 2021, the Apache Software Foundation announced the retirements of 13 big data-related projects, 10 of which are part of the Hadoop ecosystem, such as Eagle, Sentry, Tajo, etc. 

 

Now, Apache Ambari, born with the mission to manage Hadoop clusters, becomes the first Apache project to be retired in 2022.

 
2022 & Beyond: The Post-Hadoop era
 

Will Hadoop ultimately be abandoned? I believe this will not happen anytime soon. After all, Hadoop has a large number of users, which means exorbitant costs of platform and application migration.

 

Therefore, the current users will continue to use it, but the number of new users will gradually decrease. This is what we call the “post-Hadoop era”.

 
Hadoop 3 roodmap
Image from Hadoop community meetup
 

In terms of the potential growth of Apache Hadoop, the above roadmap is taken from a meetup of the Hadoop community. After 3.0, clearly, the new features of Hadoop are not that good anymore. They are mainly about the integration with K8s and Docker, which is not that attractive for big data practitioners.

 

Why is Hadoop Dying?

 

Google Trends shows that interest in Hadoop reached its peak popularity from 2014 to 2017. After that, we see a clear decline in searches for Hadoop. It’s not unexpected that Hadoop has been gradually losing its aura. Any technology, by the way, will go through the cycle of development, maturity, and decline, and no technology can escape the objective law.

 
google trend hadoop

When looking at the current state of Apache Hadoop and it’s ecosystem there are a few key factors pointing to it’s eventual demise.

 
New Market Demands for Data Analytics and Emerging Technologies

Looking back at the development history of Hadoop, it can be seen that the software framework has emergedbecause of the strong demand for big data storage. In the present day, however, users have new demands for data management and analysis, such as online rapid analysis, separation of storage and computing, or AI/ML for artificial intelligence and machine learning.

 

In those respects, Hadoop can only provide limited support. In this regard, it cannot be compared with some emerging technologies. For example, Redis, Elastisearch, and ClickHouse, which have been very popular in recent years, can all be applied to big data analysis.

 

For customers, there is just no need to deploy the complex Hadoop platform if a single technology can meet their demand.

 
Fast-growing Cloud Vendors and Services

From another perspective, cloud computing has been developing rapidly in the past decade or so, not only beating traditional software vendors such as IBM, HP, etc. but also encroaching to a certain extent on the big data market of Hadoop.

 

In the early days, cloud vendors only deployed Hadoop on IaaS, such as AWS EMR (claimed to be the most deployed Hadoop cluster in the world). For users, the Hadoop services hosted on the cloud can be started and stopped at any time, and the data can be safely backed up on the cloud vendor’s data service platform, which is easy to use and cost-saving.

 
AWS Data Services
AWS Data Services: source
 

Besides that, cloud vendors render a range of big data services for specific scenarios to form a complete ecosystem, such as persistent and low-cost data storage implemented by AWS S3, KV data storage, and access with low latency implemented by Amazon DynamoDB, Athena, a serverless query service to analyze big data, etc.

 
Increasing Complexity of Hadoop Ecosystem

In addition to the emerging technologies and cloud vendors that continue to offer new services, Hadoop itself has been gradually showing “fatigue”. Building blocks is a good option. However, it increases the difficulty for users to use the components of the Hadoop ecosystem.

 
Hadoop ecosystem
Image from Source
 

As can be seen from the figure above, there have been 13 (if not more) commonly used components in Hadoop, posing a huge challenge to Hadoop users in terms of learning and O&M.

 
The loss of Cloudera & Hortonworks free product
end of free hadoop offering
Image by Author

Technology vendor Cloudera/Hortonworks can’t release a high-quality free product on the market. It turns out that their earlier two-pronged “free version + paid version” approach doesn’t work. Cloudera will only offer the paid version of CDP in the future, indicating the end of the free lunch. It is unknown whether other manufacturers are willing to offer free products. Even if there is such a manufacturer, its product stability and sophistication are yet unknown. After all, the core developers of Hadoop mostly work for Cloudera and Hortonworks.

 
The Inconsistent Quality of Hadoop’s Open Source Ecosystem
 

Don’t forget that Hadoop is an open-source project hosted by the Apache Foundation. Apache is designed for the public good, which can be obtained, used, and distributed by the public for free. So if you don’t want to pay for it, there is an option called Apache Hadoop available for free use. After all, a large number of Internet companies still use Apache Hadoop (at their scale, only the open-source version can be used). If they can, why can’t I?

 

However, open-source software is of average quality, with no service and no SLA guarantees. Users can only find out and solve problems on their own. They have to post questions in the community and wait for the results. If you’re okay with that, then hire a few engineers to try it out. By the way, Hadoop development or O&M engineers are hard and expensive to hire.

 

Looking Beyond Hadoop: Alternative Solutions

 

In the post-Hadoop era, how should its users face the transition, and what options are available to them? It all depends on how much budget you have and your technical capabilities.

 
For tech vendors, Key Success Factors in the market
 

How should vendors in the Hadoop ecosystem respond to the new era? The evolution history of Apache Kylin and Kyligence are perfect examples.

 

Both the Apache Kylin project and Kyligence were born in the Hadoop era. Initially, all Kyligence products ran on Hadoop. About 4 years ago, Kyligence foresaw that customer needs were slowly shifting to cloud-native and the separation of storage and computing.

 

Having seen such industry trends, Kyligence made a large transformation to its original platform system.

 
kyligence cloud platform
Image by Author
 

In 2019, Kyligence launched Kyligence Cloud and announced that it has escaped from the Hadoop platform. Kyligence Cloud uses cloud-native architecture at the bottom tier, cloud vendors’ object storage services, such as AWS S3, ADLS, etc. for storage, and Spark+ containerization for computing. Its resources can be directly connected to the IaaS services and ECS on the cloud platform. Kyligence kept expanding to multiple clouds and fine-tuning the architecture, and announced in 2021 that it had merged new technologies such as ClickHouse to the architecture.

 

The flexibility, maintainability, and low TCO brought by the transformed architecture are huge and have received very positive feedback from the market.

 

The Key Success Factor(KSF) for tech vendors is to be extremely fast, sensitive, and bold, in both trend-catching and product-transforming.

 

The big data and analytics market, especially in North America, has become a very hot and competitive market with very dedicated investors that no other industries can compete with. It is never too much for vendors to keep your attention on the market trends, listen to the users and observe their new needs, and keep iterating your products according to these inputs.

 

Future trends in big data and analytics

 

Technology will keep progressing, startups with new missions may come and go, and major corporations remain resilient. I’ve written a blog about 7 must-know data buzzwords, and discussed emerging trends in 2022. To keep this one short, I’ll just list some (not all) examined interesting trends, and share with you some nice articles around them:

 

TEST DRIVE TODAY with $300 worth of free usage

test drive customer logo