A relatively new term is being bandied about the tech world – Hadoop. Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google’s MapReduce and Google File System papers. Hadoop is written in the Java programming language with Yahoo being the largest contributor. Hadoop is currently all the rage with more than 150 enterprises, including Google and Yahoo, using it. Makes it seem really awesome knowing that these large corporations are using this framework, doesn’t it? Well maybe not so fast. Hadoop does have a free open source application but can just anyone use it? Before you rush out to get it, you should know that using Hadoop requires training along with a level of analytics expertise. It is true that Hadoop is available for enterprise IT departments to download and modify to fit their needs. However, those IT departments must ensure that they have the technical expertise in-house before starting on this venture. According to Hadoop, there are some database appliances beginning to appear including Amazon Web Service’s Elastic mapreduce, Cloudera Enterprises, Datameer Analystics Solution, the Hortonworks Data Platform to name a few. Should you buy into the Hadoop hype? It turns out that only about 1% of US enterprises are actually using Hadoop in production environments. Yes Hadoop does have advantages over some of the more traditional DMS, especially the ability to handle both structured data like that found in relational databases as well as unstructured information such as videos. It can also scale up with a minimum of fuss. You can run different jobs of different types on the same hardware, too – which is a huge savings for many enterprises. Hadoop can handle huge quantities of data – including video data. It takes in and processes huge amounts of data in a short amount of time. Some of the current users of Hadoop have found that they can add servers to the node and that the system scales immediately. Another great benefit of Hadoop is its ability to be able to analyst huge data sets to quickly spot trends. With ultra-large data sets, it can be a much more efficient way to find things since it’s really built for handling that. As good as Hadoop is, there are some cautions. First, "don't commit to or standardize on one vendor quite yet," because it's such a "turbulent" space right now. "The vendors are all continuing to rapidly evolve." On the other hand, that does create a "vibrant ecosystem," he says. And, of course, as mentioned earlier, you will need to train your staff and invest in analytics. It’s not trivial to use and the training is vital. Training not only for the IT staff but organizational training is required. Most companies who are using Hadoop are using it in addition to other types of software, not instead of other types. Since it’s an open source environment, it can be used by anyone and this can cause a few in-house issues – such as more than one person generating the same intermediate data sets to analyze – which is truly a waste. One suggestion to minimize this is to run common data queries once a morning and save the results in one place for anyone who needs them. By doing this you will save large amounts of processing time and related resources. It’s all a learning curve just as it is with any software. Learning and training go hand-in-hand. Users need to learn to clean up what they don’t need so that you don’t run out of disk space. All in all Hadoop is working and those who are using it are loving it. What about your company? Is this something that they are considering? If you deal with a lot of data and data manipulation, Hadoop might just be the right thing for you.
|