Thursday, January 17, 2013

Hadoop

Hadoop consists of a data storage and a data processing service. The former is named HDFS, short for Hadoop Distributed File System. The latter is named MapReduce and uses a high performance parallel data processing technique. On this framework is built a database named Hbase, a data warehouse named Hive, and a query language named Pig. Hadoop scales horizontally in that additional commodity hardware can be added without interruption and node failures can be compensated with redundant data. It does not guarantee ACID properties and supports forward only parsing. Hadoop is used to store unstructured data where data is stored in delimited flat files where the column names, column count and column data types don't matter. Data is retrieved with code in two steps - a Map function and a Reduce function. The Map function selects keys from each line and the values to hold resulting in a big hashtable. A Reduce function that aggregates results. Together these operations gives a blob of mapped and reduced data. Writing this code for MapReduce is easier with PIG queries. This key value set is stored in the Hbase store and this is the NOSql (read as 'not only SQL'). Hbase stores key-value as columns in a column family and each row can have more than one column family. Each row need not have the same number of columns. Hive is a data warehouse system. It uses a hive query language on joins of Hbase tables. SQL Server has a SCOOP connector to Hadoop which makes data transfer easy between HDFS and RDBMS. SCOM, AD and BI tools are also being integrated with Hadoop. Hadoop uses a user account named Isotope on all the windows nodes for running jobs.

No comments:

Post a Comment