Hadoop – 1 : What is it?

Hadoop is an answer to distributed computing with massive data.

BRIEF HISTORY

When Google started with its operations and got good results out of it [1] [2], others started to investigate same thing to be done. Because working with data as big as web required some radical methods and current methods stayed short for.

A web search engine was being developed by Doug Cutting and his colleagues, called Nutch [3], in their spare times, in 2002. They made it work with a couple computers but to scale it to web it should have worked with thousands of computers. They soon realized this was not a project to be done with so few resources.

In 2003, Google, successful in this kind of work, released a paper about their GFS (Google File System) [1]. Their system was working on commodity servers with thousands of nodes. They could also scale those easily. With this file system, managing massive data could be operated easily. Doug Cutting and his team, with in the next year, could be able to have a working file system like GFS. They have called it NDFS (Nutch Distributed File System).

In 2004, Google again published a paper on MapReduce [2], their implementation of distributed computing. In 2005, Doug Cutting and his team already had an implementation of MapReduce. Most of the routines of the Nutch converted to NDFS and MapReduce.

Later on, Doug Cutting Joined Yahoo! and the resources provided for the project provided the team the big leaps. In 2008, Yahoo! announced that it had 10.000-core Hadoop clusters [5].

After Some times around that, the big hype around Hadoop started.

With Java… Really?

Aside from the fact that, Java made massive progress, both in systems integration and in performance issues, some would not still believe the kind of massive data is processed with it. As it came out, the bottleneck in the massive data processing was the seek time, not the speed of the computers, not the amount of the RAM available. As it is put in [5], the seek time of the hard drives is improving more slowly than their transfer rate. In this case, Java makes perfect sense;

  • first it is an open sourced project for the contribution to be made from all over the world.
  • Also the companies other than Microsoft first helped it shape and
  • For Java is one of the most advanced general programming languages.
  • Java is also platform independent without any suspicion like the Mono system.
  • Also Java has a wide programmer stock and Hadoop would need a gigantic number of map and reduce classes to be implemented.

With these and with many other reasons, I think, it made it to be what it is now.

Why not use Databases (RDBMS: Relational database management systems)

It is always misleading if we think about the Hadoop as a RDBMs like system. It is true that the Hadoop ecosystems are equipped with database like systems, but what Hadoop and its ecosystem really is a distributed computing processing engine. There are some similarities though, as you can query in both systems with SQL like languages, but I will talk about the differences in the coming words.

Seek Time

As previously mentioned, seek time is improving more slowly than transfer rate. That is to say, if you are making more seeks you will be slower. But if you are making less seeks and transferring more, you get more work done. This is the reason why Hadoop is more capable in big data than the RDBMs.  In fact, that is the design philosophy of the B-Tree structures: Index the records in a tree and keep the disk pointers in RAM and when needed, just seek to the place in the hard drive to read the records. As it can be seen, seek time is very important in RDBMSs.

Update Process

RDBMSs are more capable of updating a small sets set of data. In that they are very fast, if the set of the data increases, they become short in that. This is how it is done in RDBMS

  1. 1-Seek to one record,
  2. 2-update the record
  • or make the record deleted,
  • insert another one
    • for that go to another place – another seek
    • insert the record there and get the location info and update the index.

As it can be seen, there can be one or many seek. Of course, that depends on the operation but the general idea is as it is defined.

Structure Based Incrimination

RDBMS requires a rigid structure on the data to process it. Also, data duplication is strictly forbidden; from this normalization was born. But as the data size starts growing, normalization does not help the performance, it puts more seeks on the data to be traversed. So, the denormalization process is to be taken to make the process more performing. But this is thought to be the last resort in RDBMSs [6].

On the other hand, Hadoop is more relaxed on the structure of the data. It can take any format, any type of file to be processed. For example the big log files are good example of the data to be processed in Hadoop.

Scaling

When more power is needed in RDBMSs, the answer to that is more powerful computers to do the job. But in Hadoop, it is only adding more commodity servers to the cluster that are sometimes one hundred times cheaper.

RDBMs vs. Hadoop

Also, all of this does not mean that, the two systems are not capable of integrating both ideas. As Tom White indicates [7], the gap between them will slowly disappear. Oracle and many other companies already started working on it [8].

What about Grid Computing or Volunteer Computing? Don’t they do the same job?

Grid or volunteer computing is more about compute intensive jobs, rather than disk intensive jobs. In disk based jobs, Hadoop is the answer.

What about the network Delays?

Hadoop simply runs on the locality principle. That is to say, the data and the processing part of the systems should be on the same nodes or near to each other. With that, the power of the local area networks could be exploited. This simple principle and the ideas on the disk seeking give its power to Hadoop.

What about the Cloud, is it the same or different?

Everybody (or at least anybody that took network based classes or read about network structures) must have already seen the cloud picture representing the internet. Cloud is the internet and Hadoop brings the ability processing huge amount of data on the cloud, meaning with servers on all over the world, keeping above mentioned principles, of course. Amazon S3 etc. systems are all cloud based systems.

I want to write in languages other than Java, is it possible?

Of course, Hadoop provides means of writing map and reduce functions in languages other than Java. Also, functional languages like OCaml or F# is best suited for the jobs at hand [10] [11] [12]. I will be mentioning them in the coming articles.

What about ETL (Extract Transform Load) possibilities in Hadoop?

There exist many ETL processor engines and many more implemented by private companies for in house use. As a person who spent most of its professional life in ETL included projects, if I need to say something about the ETL, it would be “Hadoop is the future of ETL“. With small data, others would prevail or at least continue to stay alive, but in processing gigantic data, they will slowly fade away.

The Hadoop Ecosystem

Hadoop is an Apache top project. In that there are a dozen of other projects that collectively called Apache Hadoop project. In the next article, I will go over each one of them shortly.

 

References:

 

1-GFS: Google File System.

Abstract:http://research.google.com/archive/gfs.html

PDF:http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/gfs-sosp2003.pdf

2-MapReduce:

Abstract:http://research.google.com/archive/mapreduce.html

PDF:http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/mapreduce-osdi04.pdf

Slides:http://research.google.com/archive/mapreduce-osdi04-slides/index.html

3-Nutch: http://nutch.apache.org/

4-Hadoop The Definitive Guide 2nd Edition, Chapter 1 Meet Hadoop, page 5, Tom O’Reilly, White

5-Hadoop The Definitive Guide 2nd Edition, Chapter 1 Meet Hadoop, page 10, Tom O’Reilly, White

6-Date, C.J. Database in Depth: Relational Theory for Practitioners. O’Reilly (2005), p. 152.

7-Hadoop The Definitive Guide 2nd Edition, Chapter 1 Meet Hadoop, page 6, Tom O’Reilly, White

8-Oracle Big Data, http://www.oracle.com/us/technologies/big-data/index.html

9-Apache Hadoop, http://hadoop.apache.org/

10-Functinal languages and Hadoop, http://stackoverflow.com/questions/6087834/how-scalable-is-mapreduce-in-the-original-functional-languages

11-Making Hadoop more “functional” with Scala ,http://heuristic-fencepost.blogspot.com/2010/11/making-hadoop-more-functional-with.html

12-Parallel and Distributed Computational Intelligence,  Francisco Fernández de Vega, pg.14, “Hadoop built upon map and reduce primitives present in functional languages”.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>