Big Data Solutions

What is Hadoop?

Apache Hadoop is powerful open source data storage and processing project designed for sophisticated analysis of both structured and unstructured complex data. Hadoop enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of servers, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer.

Hadoop is made up of two main components – MapReduce and Hadoop Distributed File System (HDFS):

MapReduce - The framework that understands and assigns work to the nodes in a cluster.

HDFS - A file system that spans all the nodes in a Hadoop cluster for data storage. It links together the file systems on many local nodes to make them into one big file system. HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes.

Hadoop is accompanied by an ecosystem of Apache projects that extend the value of Hadoop and improves its usability. The Hadoop ecosystem includes:

  • Flume™ - A distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
  • HBase™: A scalable, distributed database that supports structured data storage for large tables.
  • Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout™: A Scalable machine learning and data mining library.
  • Pig™: A high-level data-flow language and execution framework for parallel computation.
  • Sqoop™: A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
  • ZooKeeper™: A high-performance coordination service for distributed applications.

Back to top

How is Hadoop architected?

Hadoop is designed to run on a large number of servers that don’t share any memory or disks. That means you can purchase a several commodity servers, put them in a rack, and run the Hadoop software on each one. When you want to load all of your organization’s data into Hadoop, the software breaks the data into pieces that it then spreads across your different servers. There’s no one place where you go to talk to all of your data; Hadoop keeps track of where the data resides. Since there are multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated from a known good copy.

In a centralized database system, one big disk is connected to four or eight or sixteen processors. But that is as much horsepower as you can bring to bear. In a Hadoop cluster, every server has two, four or eight CPUs. You can run your indexing job by sending your code to each of the dozens of servers in your cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. This process is known as MapReduce. You map the operation out to all of the servers and then you reduce the results back into a single result set. Architecturally, the reason you’re able to deal with lots of data is because Hadoop spreads the data out. And the reason you’re able to ask complicated computational questions is because you’ve got all the processors, working in parallel, harnessed together.

Hadoop offers maximum flexibility and scalability and significant cost savings over alternative approaches. With Hadoop, enterprises can easily explore complex data using custom analyses tailored to their information, questions and needs. This technology can help reveal the value of Big Data.

Back to top

What problems can Hadoop solve?

The Hadoop platform was designed to solve problems concerning vast amounts of data, including mixed data types (structured and unstructured) that does not fit nicely into tables of rational databases. Hadoop is used for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. Hadoop can be applied to problems across all vertical markets. Common use cases include:

  • Risk Modeling - How can companies better understand customers and markets?
  • Customer Churn Analysis - Why do companies really lose customers?
  • Recommendation Engine - How can companies predict customer preferences?
  • Ad Targeting - How can companies increase campaign efficiency?
  • Point-of-Sale Transaction Analysis - How do retailers target promotions guaranteed to make you buy?
  • Analyzing Network Data To Predict Failure - How can companies use machine generated data to identify potential trouble?
  • Threat Analysis - How can companies detect threats and fraudulent activity?
  • Trade Surveillance - How can companies spot the rogue trader?
  • Search Quality - What’s in your search?
  • Data Sandbox - What can you do with new data?

Back to top

Back to HATTIE Big Data Solutions