“Madam..What use is a new-born baby?”’ – Michael Faraday – (Apocryphal quote) when asked about the utility of electricity a new invention in the 1800s…
Why Hadoop Is Thriving and Will Continue to do so…
As my readers are aware I have been heavily involved in the Big Data space for the last two years. This time has been an amazing and transformative personal experience as I have been relentlessly traveling the globe advising global banking leaders across all continents.
Thus, it should come as no surprise that the recent KDNuggets article somewhat provocatively titled – “Hadoop is Failing – Why” – managed get me disagreeing right from the get go .
The author, though well meaning from what I can tell, bases the article on several unfounded assumptions. Before we delve into those, let us consider the following background.
The onset of Digital Architectures in enterprise businesses implies the ability to drive continuous online interactions with global consumers/customers/clients or patients. The goal is not just provide engaging visualization but also to personalize services customers care about – while working across multiple channels/modes of interaction. Mobile applications first began forcing the need for enterprise applications to support multiple channels of interaction with their consumers. For example Banking now requires an ability to engage consumers in a seamless experience across an average of four to five channels – Mobile, eBanking, Call Center, Kiosk etc. Healthcare is a close second where caregivers expect patient, medication & disease data at their fingertips with a few finger swipes on an iPad app.Big Data technology evolved to overcome the limitations of existing data approaches (RDBMS & EDW) to keep up data architecture & analysis challenges inherent in the Digital application stack.
These challenges include –
- The challenge of data volume explosion – please read blog below for a detailed discussion.
http://www.vamsitalkstech.com/?p=247
- The amazing data variety enterprises are now forced to deal with, traveling at high velocity –
http://www.vamsitalkstech.com/?p=75
- Surely Hadoop has it’s own technical constraints -the ability to support low latency BI (Business Intelligence) queries for one. However, the sheer inability of pre-Hadoop approaches to scale with exploding data ingest and management of massive data caused two business challenges for Digital Architectures. The first challenge is the ability to glean real time insights from vast streams of (structured & unstructured) data flowing into enterprise architectures. The second is it’s ability to work with advanced analytics – Predictive Analytics and Deep Learning – at fast speeds (quite often tens of thousands to tens of millions of messages per second) enables the ability to solve complex problems across domains. Hadoop only turns these challenges into business opportunities for efficient adopters.
Why the Darwinian Open Source Ecosystem ensures Hadoop is a robust and mature technology platform
Big Data is backed by the open source community with most Hadoop ecosystem technology (25+ projects) incubated, developed and maintained in the Apache ecosystem. The Open Source community is inherently Darwinian in nature. Its focus on code quality, industry adoption, a concrete roadmap and committers means that If a project lacks then it is for sure headed for the graveyard. Put another way, there can be no stragglers in this ecosystem.
Let us now consider the chief assumptions made by the author in the above article.
Assumption 1 – Hadoop adoption is staying flat at best
The best part of my job is working with multiple customers daily on their business initiatives and figuring out how to apply technology to solving these complex challenges. I can attest that adoption in the largest enterprise is anything but stagnating. While my view is certainly anecdotal and confined to the four walls of one company, adoption is indeed skyrocketing at verticals like Banking, Telecom, Manufacturing & Insurance. The early corporate movers working with the leading vendors, have more or less figured out the kinks in the technology as applied to their business challenges. The adoption patterns are maturing and they are realizing massive business value from it. A leading vendor, Hortonworks, moved to $100 million annual revenues quicker than any other tech startup – which is testament to the potential of this space. Cloudera just went public. All this growth has been accompanied by somewhat declining revenues & stock prices at leading EDW vendors. I forecast that the first Big Data ‘startup’ to $1 billion in revenue will happen over the next five-seven years, at a somewhat faster rate compared to the revered open source pioneer Red Hat. At a minimum Hadoop projects cut tens of millions of dollars from costly and inflexible enterprise data warehouse projects. Nearly every large organization has begun deploying Hadoop as an Enterprise Landing Zone (ELZ) to augment an EDW.
Assumption 2 – The business value of projects created using Hadoop is unclear
The author has a point here but let me explain why this is an organizational challenge and not really the fault of any technology stack – Middleware or Cloud or Big Data. The challenge is that it is often a fine art to figure out the business value of Big Data projects working across complex organizational structures. IT groups can surely start POCs as science or “one-off resume builder” projects but the lines of business need to get involved from the get go sooner than any other technology category. Big Data isn’t about the infrastructural plumber’s job of storing massive volumes of data but really about creating business analytics on the data collected and curated. Whether those analytics are simply old school BI or Data Science oriented depends on the culture and innovativeness of an organization.
Organizations are using Big Data to not only solve existing business challenges (sell more products, detect fraud, run risk reports etc) but also to rapid experiment with new business models using the insights gleaned from Big Data Analytics. It falls to the office of an enlightened CDO (Chief Data Officer) to own the technology, create the appropriate internal costing models and to onboard lines of business (LOBs) projects into the data lake.
There are two questions every CDO needs to ask at the outset –
- What business capabilities are going to be enabled across the organization?
- What aspects of digital transformation can be enabled best by Big Data?
Assumption 3 – Big Data is only valid technical solution for massive data volumes in the Petabytes (PBs).
The author writes ‘You don’t need Hadoop if you don’t really have a problem of huge data volumes in your enterprise, so hundreds of enterprises were hugely disappointed by their useless 2 to 10TB Hadoop clusters – Hadoop technology just doesn’t shine at this scale.’
This could not be further from the observed reality for three reasons.
Firstly, most of the projects in the terabyte (TB) range exist as tenants in larger clusters The real value of data lakes is being able to build out cross organizational data repositories that were simply too expensive or too hard to build before. Once you have all the data in one place, you can mashup it up, analyze it in ways heretofore unknown.
Secondly, as I’ve covered in the below post, many players are leveraging Big Data to gain the crucial “speed” advantage while working with TBs of data.
Thirdly, I recommend that every client start ‘small’ and use a data lake to serve as an Enterprise Landing Zone (ELZ) for data produced as a result of regular business operations. Hadoop clusters not only serve as cheap storage but also perform a range of rote but compute intensive data processing tasks (data joining, sorting, segmentation, binning etc etc) that saves the EDW from a range of taxing operations.
Assumption 4 – Hadoop skills are hard to find.
In the author’s words – “..while 57% said that the skills gap was the major reason, a number that is not going to be corrected overnight. This coincides with findings from Indeed who tracked job trends with ‘Hadoop Testing’ in the title, with the term featured in a peak of 0.061% of ads in mid 2014, which then jumped to 0.087% in late 2016, an increase of around 43% in 18 months. What this may signal is that adoption hasn’t necessarily dropped to the extent that anecdotal evidence would suggest, but companies are simply finding it difficult to extract value from Hadoop from their current teams and they require greater expertise.”
The skills gap is real and exists in three primary areas – Data Scientists, Data Engineers and Hadoop Administrators.
However this is nothing unique to Hadoop and is common with every new technology. Companies need to bridge this by augmenting the skills of internal staff, working with the Global Systems Integrators (GSI) who have all added Big Data practice areas and by engaging with academia. In fact, the prospect of working on Big Data projects can attract talent to the organization.
How Should large Organizations proceed on their Big Data journey?
So what are the best practices to avoid falling into the “Big Data does not provide value” trap?
- Ensuring that Big Data and a discussion of it’s business and technical capabilities are conducted at the highest levels. Big Data needs to be part of an organizations DNA at the highest levels and should be discussed in the context of the other major technology forces driving industry – Cloud, Mobility, DevOps, Social, APIs etc .
- Creating or constituting a team under the CDO (Chief Data Officer). Teams can be both physical, virtual and need to take into account organizational politics
- Creating a COE (Center of Excellence) or such federated approach where central team works with lines of business IT on these projects
- As part of the COE, institute a process to onboard the latest skills
- Instituting appropriate governance and project oversight
- Identifying key business metrics that will drive Big Data projects. This blog has covered many such areas but these include detailed analyses on expected growth acceleration, cost reduction, risk management and enabling competitive advantage.
- Engaging the lines of business to develop these capabilities in an iterative manner. Almost all successful Big Data projects are delivered in a DevOps fashion.
Conclusion
The Big Data ecosystem and Hadoop technology provide mature, stable and feature rich platforms for global vertical organizations to implement complex Digital projects. However, the technology maturity is only a necessary factor. The ability of the organization in terms of an innovation oriented mindset is key in driving internal change. So is inculcating a learning mindset across business leadership, IT teams, internal domain experts and management. The universal maxim – “one gets only out of something as much as they put into it” is more truer than ever with Big Data. While it is easy to blame a technology or a vendor or lack of skilled personnel for perceived project failures, one should guard against a status quo-ist mindset. You can rest assured that your competition are not sitting still.
2 comments
Vamsi —
I would argue Hadoop is not failing, merely evolving out of necessity. The data analytics field is still in many regards in the embryonic stages of development. I also agree that key to evolution is expanding the dearth of skilled data scientists. In my experience, there seems to be confusion still over what a data scientist is. The role is really a hybrid of domain experience, mathematics and IT.
John – I agree 100%. The skills shortage is real and more acute in Data Science – as you rightly point out.