01.18.08
What mapreduce is
I knew it was only a matter of time before a pedantic database supporter would chime in that mapreduce is not the greatest thing since sliced bread.
Mapreduce is very cool, and the last people likely to understand this are database programmers.
What is mapreduce?
Mapreduce stems from functional programing, such as Scheme. At the University of Illinois, I thought the CS program did students a disservice by teaching them Scheme. Apparently not. The simplest map reduce function I can think of is counting words.
every good boy does good
Let’s create a map function that simply places a one (1) next to each word and sorts it.
(boy, 1)
(does, 1)
(every, 1)
(good, 1)
(good, 1)
Next let’s create a reduce function that adds the number for non-unique keys.
(boy, 1)
(does, 1)
(every, 1)
(good, 2)
That’s mapreduce. If you had 1 trillion words to count, mapreduce becomes more useful. The data can start on many nodes, then return counted and sorted to many nodes. Even better, numbers are not necessary here. URLs can be added for both keys and values, such as (”www.amazon.com”,”ebay.com”). Using this model, reverse links can be counted, monte carlo simulations can run, and the result is page rank. There are many uses. This does not include every coding problem, but it opens many doors for formerly difficult problems.
Why would database people not understand this?
Database folks appear to be in a rut. They are overly concerned with optimization. They create the index, continue to add to the index, and retrieve data very quickly using the index. What mapreduce does appears wasteful. It creates the index once, cannot add to the index, and throws away all of the work each time it is run. I think someone in the Hadoop world should solve this problem of throwing away the index, but that’s only an optimization. Optimizations do not count where a paradigm shift occurred. Now from the article.
The database community has learned the following three lessons from the 40 years that have unfolded since IBM first released IMS in 1968.
* Schemas are good.
* Separation of the schema from the application is good.
* High-level access languages are good.
Are schemas good? This implies that data should be strongly typed. Now we are getting into the strongly-typed argument that seems never to be won between Java and C vs. Perl and PHP. My guess by extension is that Perl and PHP are bad and Hadoop is bad as well.
Schemas and applications could be separated, but that makes sense only in a database world. In mapreduce, the programmer is given control over his data. It’s called freedom. This also sounds too much like MVC arguments. Yes, databases and MVC save money in some cases, but in other cases, they just hold back creativity and development.
High-level languages are good. I agree. In a way, mapreduce programs are written in a high level language in all cases. The loop that you think you write in Java or C is actually torn apart and run on many nodes. I only appears that the loop runs on one machine, yet it runs on many machines in tiny pieces.
Some of you out there should try a little mapreduce programing and see how it screws with your mind. It’s wonderful to feel different about a loop. I feel just as good doing this as when I first learned SQL.