Eureka, I Can Mongo-Doop With Dumbo, Oops?

Ok, so thanks to the "klaasy" guy working on dumbo and uncomfortable Netbook power coding on the train, i was able to keep using my two recently favorite tools, Python and MongoDb. I merged the current mongo-hadoop repo with a fork which had implemented typedbytes mongo input and output formats (Cleaned it up a tinsy bit) and voila you can do a simple dumbo wordcount as follows:

import ast
import dumbo

class Mapper:
    def __call__(self, key, value):
        #value come in dict form and since we are not storing binary data, this should work. Safely convert string to dict
        value = ast.literal_eval(value)
        wordkey = value["key"]
        text = value["text"]
        text = " ".join(text.split("\n"))
        for word in text.split(" "):
           yield (str(wordkey), str(word)), 1

if __name__ == "__main__":
    job = dumbo.Job()
    job.additer(Mapper, sumreducer)
    job.run()

dumbo rm test.out -hadoop /usr/lib/hadoop;dumbo start ~/scratchbox/dumbo/wordcount.py -hadoop /usr/lib/hadoop  -libjar core/target/mongo-hadoop-core-1.0-SNAPSHOT.jar -libjar streaming/target/mongo-hadoop-streaming-1.0-SNAPSHOT.jar  -inputformat  com.mongodb.hadoop.typedbytes.MongoInputFormat -input mongodb://127.0.0.1/texts.in  -outputFormat com.mongodb.hadoop.typedbytes.MongoOutputFormat -output test.out -conf examples/dumbo/mongo-wordcount.xml


The conf file used contains the actual mongo config options like the destination collection, output key type, etc.

Contact me for more info and heres the repo http://bit.ly/wyPA5g


0%