Eureka, I Can Mongo-Doop With Dumbo, Oops?
Ok, so thanks to the "klaasy" guy working on dumbo and uncomfortable Netbook power coding on the train, i was able to keep using my two recently favorite tools, Python and MongoDb. I merged the current mongo-hadoop repo with a fork which had implemented typedbytes mongo input and output formats (Cleaned it up a tinsy bit) and voila you can do a simple dumbo wordcount as follows:
import ast
import dumbo
def __call__(self, key, value):
#value come in dict form and since we are not storing binary data, this should work. Safely convert string to dict
value = ast.literal_eval(value)
wordkey = value["key"]
text = value["text"]
text = " ".join(text.split("\n"))
for word in text.split(" "):
yield (str(wordkey), str(word)), 1
if __name__ == "__main__":
job = dumbo.Job()
job.additer(Mapper, sumreducer)
job.run()
job = dumbo.Job()
job.additer(Mapper, sumreducer)
job.run()
dumbo rm test.out -hadoop /usr/lib/hadoop;dumbo start ~/scratchbox/dumbo/wordcount.py -hadoop /usr/lib/hadoop -libjar core/target/mongo-hadoop-core-1.0-SNAPSHOT.jar -libjar streaming/target/mongo-hadoop-streaming-1.0-SNAPSHOT.jar -inputformat com.mongodb.hadoop.typedbytes.MongoInputFormat -input mongodb://127.0.0.1/texts.in -outputFormat com.mongodb.hadoop.typedbytes.MongoOutputFormat -output test.out -conf examples/dumbo/mongo-wordcount.xml
The conf file used contains the actual mongo config options like the destination collection, output key type, etc.
Contact me for more info and heres the repo http://bit.ly/wyPA5g