Hadoop – 2 : Map/Reduce

M/R (Map/Reduce) model is the name of two distinct feature mostly found in the functional programming languages. Map is the process of applying a mass function to each member of a list. After that, resulting list contains the processed items which may or may not have the same number of elements. Reduce on the other hand takes this map input and combines them with recursive means.

The most interesting part of these methods is they are very much suitable for parallel operations. Let us dive into why they are….

The Best Way of Synchronization

Today most of the computation revolves around parallelization techniques. From the time when the processing power stopped increasing, the manufacturers started putting two of the same processing engine to the same box. After that, forcibly, streamlined parallel processing age has begun. Whether this parallel processing was in a single box or be distributed across the planet, what it brings is the change of mentality in programming. Previously, all a programmer had to do was to design the software with a single processing engine in mind. Why not? Even the compilers until that time were optimizing their code in this way. But when more power is needed you had to dive into the parallel processing paradigm and, man, it was hard. When doing parallel processing, you must be very careful synchronizing your code. Whether the problem at hand was the producer/consumer problem or it was the dining philosophers’ problem, synchronizing was the most important part. Also grasping the structures was not an easy job for everyone.

Now, let us say, one knows everything about the synchronization and applies to the design, but this time, he realizes that, synchronization comes with a price: it is slow. The whole idea of parallelization is to make things fast, but when doing parallelization, synchronization puts burdens and makes your job harder. Hardware supported techniques also did not help too much. It was time to think the best way of doing it and it was clear: To get the most out of hardware, you should avoid using synchronization. And there it was, the best way of parallel computation was not using synchronization techniques at all.

So, is it possible? Yes, it is possible, but it will depend on your problem.

Let us think a producer/consumer problem. Let us have an object reading a file line by line and then another object processing those lines and outputting to another file.  When doing this, we have two ways:

The first way: the consumer is greedy and is taking a line whenever it is ready. In this method, the consumer object can use a list-like structure the producer provides and checks its count property and whenever an item is ready, it will take it from there. The list itself must be synchronized for parallel processing.

The second way: The consumer waits until the number of lines reaches to a certain count and then start taking it without draining it. If the count is lower than this count than it will sleep. If the producer indicates that it has finished (this happens once in the object lifetime so no way to mess up), the consumer could drain the source and then finish.

In the first way, synchronizing makes things slow. To that, if the objects reside in different processes or in different machines it will make things worse. Because inter-process communication is slow, dangerous and more expensive.

In the second way, no synchronization is required and apart from the initial time, no waiting required; perfect for the long running operations.

M/R works in much the same manner of the second way. It does not use any synchronization techniques (of course, when you have many threads working at the same time, for other purposes you can use them, but for the main job, processing the data, this will be in a minimum level). First it divides the problem with mapping stages and then “reduce” stage takes these mapped values and process it whenever some input is ready. Initial waiting times put some burden on the operation but the overall time will always be better for long running operations.

Leave a Reply

Your email address will not be published. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>