Apache Hadoop is powerful open source data storage and processing project designed for sophisticated analysis of both structured and unstructured complex data. Hadoop enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of servers, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer.
Hadoop is made up of two main components – MapReduce and Hadoop Distributed File System (HDFS):
MapReduce - The framework that understands and assigns work to the nodes in a cluster.
HDFS - A file system that spans all the nodes in a Hadoop cluster for data storage. It links together the file systems on many local nodes to make them into one big file system. HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes.
Hadoop is accompanied by an ecosystem of Apache projects that extend the value of Hadoop and improves its usability. The Hadoop ecosystem includes:
Hadoop is designed to run on a large number of servers that don’t share any memory or disks. That means you can purchase a several commodity servers, put them in a rack, and run the Hadoop software on each one. When you want to load all of your organization’s data into Hadoop, the software breaks the data into pieces that it then spreads across your different servers. There’s no one place where you go to talk to all of your data; Hadoop keeps track of where the data resides. Since there are multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated from a known good copy.
In a centralized database system, one big disk is connected to four or eight or sixteen processors. But that is as much horsepower as you can bring to bear. In a Hadoop cluster, every server has two, four or eight CPUs. You can run your indexing job by sending your code to each of the dozens of servers in your cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. This process is known as MapReduce. You map the operation out to all of the servers and then you reduce the results back into a single result set. Architecturally, the reason you’re able to deal with lots of data is because Hadoop spreads the data out. And the reason you’re able to ask complicated computational questions is because you’ve got all the processors, working in parallel, harnessed together.
Hadoop offers maximum flexibility and scalability and significant cost savings over alternative approaches. With Hadoop, enterprises can easily explore complex data using custom analyses tailored to their information, questions and needs. This technology can help reveal the value of Big Data.
The Hadoop platform was designed to solve problems concerning vast amounts of data, including mixed data types (structured and unstructured) that does not fit nicely into tables of rational databases. Hadoop is used for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. Hadoop can be applied to problems across all vertical markets. Common use cases include: