In the world of data warehousing, there has been a lot going on. At their core, they continue to provide a repository of information for business decision support and delivering insight. Facts and dimensions come together to define their schema. But on the technology front it has been a whirlwind of activity. First there were relational databases that remain relatively stable workhorses today. Data warehouse appliances caught everyone’s attention a few years back, the whole idea of the data warehouse in a box. Today in memory processing is all the rage, no time to persist data on disk.
To say that big data has been a disruptor is putting it mildly. By providing a batch oriented system for handling big data, Hadoop and MapReduce made their mark. If we start talking about the Internet of Things and streaming of big data, then the complexity rapidly increases. Processing and analyzing data streams calls for the likes of Apache Spark and Apache Storm. So you now had several pieces to the puzzle: your data warehouse built on a relational database for structured data, Hadoop for semi structured/unstructured data, tools such as Spark and Storm for streaming data, backed by in-memory processing for rapid data crunching.
Ask the question if Hadoop could replace your data warehouse (and/or ETL system) and you are likely to get many animated responses, most of them vehemently arguing that this is an apples vs. oranges comparison and that they need to coexist not compete with each other. However look at the writers putting forth this opinion, and I suspect you will find that several of them work for the large database and ETL vendors. Clearly their world will get shaken if Hadoop makes serious inroads into the area of structured data and OLTP. With billions of dollars of revenue at stake, resistance to change is understandable.
But consider again the Hadoop ecosystem (http://hadoopecosystemtable.github.io/). It has a really long list of tools, including NewSQL databases and SQL on Hadoop tools. One could use MapReduce to do ETL. Hadoop seems to have something in it for everyone. It is hard to see why Hadoop cannot take over the data warehouse, given time. Hadoop creator Doug Cutting himself has predicted that OLTP workloads will move to Hadoop (http://www.datanami.com/2013/11/06/oltp_clearly_in_hadoops_future_cutting_says/). Yes the skills bar in deploying Hadoop is high but that should ease over time with more and more add-ons coming to the market. Certainly the Oracles and Teradatas of the world have reason to worry.
Once the Internet of Things goes mainstream, with big data as the name of the game, it may not make sense any longer to segment data processing systems and say this is for structured data and this is for the rest. Combining everything in one platform appears to be the way to go; Hadoop and its ecosystem seem to offer a complete solution. Plus open source (read: free) is hard to beat.
Of course skills for setting up the cluster and writing applications are another matter. Only the largest players are likely to have the resources to do it today, which is where Hadoop as a service should come in handy. Big data warehouses in the cloud that bring in and persist both structured and unstructured data along with the capability to handle streaming data as well. That should expand the benefits of big data to everyone, not just the Amazons and Googles of the world. In time we should see alternate platform options to Hadoop emerge. But right now all bets are on Hadoop.
Fast forward into the future and we should have multiple big data as a service providers who use Hadoop (or other alternate platforms) to combine data in rest as well as data in motion. So that insight from big data is not just for the few but instead a value add for every enterprise including small and medium ones. Data warehouses would continue to be alive and kicking but the engine under the hood would probably be very different from what it is today.