Distributed Vision Processing - Part 1

It’s been a while since i voiced any opinion or tips on the interwebs. Lets just say i’ve been on a magical journey of figuring out what the hell i’m going to do in the common years.
After all the smoke cleared, i saw the path. The path back to the past that is, computer vision processing, my concentration in college.

What vision projects?

I’m currently working on two vision processing related projects, one has to do with tv, the other has to do with traffic. Both projects share a similar platform for processing live image streams at 24 fps. It was fun to see my prototype work by just spawning some processes on my dev laptop and seeing everything automate itself, this is how programmers entertain themselves. Unfortunately, that euphoric feeling was cut short when i started to think of how this tiny thing would scale. I was off to my 6 feet whiteboard to discuss scalability with the team.  My team consists of Me, the programmer, Me the devops, and Me the R & D guy, suprisingly we never agree on anything.

Ok, scalability - It’s all fun and games until you exhaust all your server resources.

These vision projects require alot of bandwidth and storage. Our R & D guy decided that it would be too expensive to run these projects in the cloud since bandwidth costs in the country where it would be deployed are ridiculous. For example, each vision sensor node would generate 25 FPS with size of 14Kb per frame, in an hour we would generate ~1.2Gb of data, and we don’t have enough money for this project to make it rain yet. We would need to build a server farm that can process 24FPS * any number of TV or traffic video capture nodes. Each node in the farm or sub farm would need to be aware of other nodes. The idea was to create a system that allows a CV programmer to create services that operate on any frame captured, e.g a service that counts the number of heads in a frame and publishes/saves the result to the cluster/farm for any other service that may use this info. Each service can then be scaled automatically depending on load. 

These reasons led to the creation of a private vision processing cluster based on open source technologies like Riak, Consul, Gnatsd, Openframeworks and WeedFs.

I intend on explaining how i used each service to create my distributed CV server in my upcoming blog posts.

Thanks for reading

Django Frontend Edit

Django Frontend Edit was created to add “add” and “delete” frontend functions for mezzanine apps but it works with any type of model.

It provides a clean and neater way of adding frontend adding and deleting functionality. It used django permissions model to determine if a user can perform the frontend actions. It is especially good for quickly prototyping an app.

You can quickly provide adding and delete functions to a todo app, for example, using the template code below:

         {# This will render a nice ajax form for adding new items #}

{% can_add object_list text %}
          {% for todoitem in object_list %}
                 {{ todoitem.text }}
                 {% can_delete todoitem %}
           {% endfor %}
{% endcan_add %} 

Check out the example in the repo http://t.co/uvxlgnd3

Eureka, I can mongo-doop with dumbo, oops?

Ok, so thanks to the "klaasy" guy working on dumbo and uncomfortable Netbook power coding on the train, i was able to keep using my two recently favorite tools, Python and MongoDb. I merged the current mongo-hadoop repo with a fork which had implemented typedbytes mongo input and output formats (Cleaned it up a tinsy bit) and voila you can do a simple dumbo wordcount as follows:

import ast
import dumbo

class Mapper:
    def __call__(self, key, value):
        #value come in dict form and since we are not storing binary data, this should work. Safely convert string to dict
        value = ast.literal_eval(value)
        wordkey = value["key"]
        text = value["text"]
        text = " ".join(text.split("\n"))
        for word in text.split(" "):
           yield (str(wordkey), str(word)), 1

if __name__ == "__main__":
    job = dumbo.Job()
    job.additer(Mapper, sumreducer)

dumbo rm test.out -hadoop /usr/lib/hadoop;dumbo start ~/scratchbox/dumbo/wordcount.py -hadoop /usr/lib/hadoop  -libjar core/target/mongo-hadoop-core-1.0-SNAPSHOT.jar -libjar streaming/target/mongo-hadoop-streaming-1.0-SNAPSHOT.jar  -inputformat  com.mongodb.hadoop.typedbytes.MongoInputFormat -input mongodb://  -outputFormat com.mongodb.hadoop.typedbytes.MongoOutputFormat -output test.out -conf examples/dumbo/mongo-wordcount.xml

The conf file used contains the actual mongo config options like the destination collection, output key type, etc.

Contact me for more info and heres the repo http://bit.ly/wyPA5g

Ramble Rumble for Languages

I’m tired of listening to programmers ramble about what framework or programming language is better. So let me get this straight, you want me to sacrifice speed (C++) for programming convenience and smoothness (Python). No i don’t, if you believe your program needs to be faster than the speed of light then by all means code with a fast language, you can write machine code for all i care. I like being able to quickly prototype programs on my net book while being on a train and as a result i have tied the knot with python. That is not to say that i dont understand the kind of relationship i’m getting into. Obviously, i cannot create the next Farcry using Python (I mean i could but it would not be for mass production). If i wanted to do that then i would use C/C++.

This logic scales up to higher platform disputes like the usual which one is better, Ruby on Rails or Django or just plain PHP. It’s all about preference. I like big butts, my preference.

Verbose TFxIDF (Weighting) Example with Dumbo, The Begining

Recently, i ventured into the world of information retrieval and data mining because its cool to learn something new and it is the future of the “InterWebbs”. Over the few weeks, i have buried my head into research papers, books, source code with my trusty Netbook as my side kick. One of the concepts i have picked up is the infamous TFxIDF. The super smart weighting algorithm that everyone seems to rant about. Its a nifty algorithm which adequately weighs a term in a text according to its relevance. I will not go on about what exaclty it is because google does a good job of explaining things.

I also came across this method called MapReduce developed by my hero’s at Google. As a fan of structured programming and methodologies, i like adopting tested well structured ways of solving problems. Map Reduce helps you break large problems into simple tasks which can then be split between clusters. Hadoop seems to be the best framework for executing MapReduce jobs.

Ok, so i know what TFxIDF and Hadoop are used for, how can i implement this in my own project written in python [Enter Dumbo]. In lame man’s terms, Dumbo translates python map reduce code to a hadoop cluster.

Great, lets start coding.  I went through the  Dumbo TFxIDF Example and also the short tutorial and also the IRBook (A great introduction information retrieval book). I could not really follow the example because it seemed like the Dumbo creator writes he’s example code for Experts and not Dumbies (Pun? Intended?). There is mapperA, mapperB, reducerC, reducerD etc. So i give a more “Verbose” example. It also calculates Euclidean Length (To be used in calculating euclidean distance between two points, a query string and a text/docuement for example) for each document/text.
Over the comming weeks, i will explain parts of the code and also clean it up a little.

from dumbo import *
from math import pow, log, sqrt

@opt(“addpath”, “yes”)
class TokenCountMapper:
    def call(self, doc, line):
        Should generate a tokenezed list which may have repeated words.
        Tokens would be grouped and counted in the reducer
        tokens = tokenize(line.lower()) 
        for token in tokens:
            yield (doc[1], token), 1 

class TokenCountReducer:
    #This is skipped and the Dumbo sumreducer helper is used instead
class DocumentTokenCountMapper:
    I receive the total count of a token in each document 
    def call(self, key, tokenCount):
        doc, token = key
        yield doc, (token, tokenCount)

class DocumentTokenCountReducer:
    Sum the amount of tokens in a doc
    def call(self, doc, value):
        values = list(value)
        #total number of tokens in doc n
        totalNumberOfTokens = sum(tokenCount for token, tokenCount in values)  
        #yield token info for current doc
        for token, tokenCount in values:
            yield (token, doc), (tokenCount, totalNumberOfTokens)

class TokenCountDocumentCountMapper:
    def call(self, key, value):
        token, doc = key
        tokenCount, totalNumberOfTokens = value
        #this token has this info and is in this document 
        yield token, (doc, tokenCount, totalNumberOfTokens, 1)

class TokenCountDocumentCountReducer:
    def call(self, token, value):
        values = list(value)
        #count the number of docs for this token
        df = sum(docCount for doc, tokenCount, docTokCount, docCount in values)
        for doc, tokenCount, docTokCount in (value[:3] for value in values):
            yield (doc, token), (tokenCount, docTokCount, df, float(tokenCount)/docTokCount)

class EucledianLengthMapper:
    def call(self, key, value): 
        doc, token = key
        tokenCount, docTokCount, df, tf = value
        yield token, (doc, tokenCount, docTokCount, df, tf)
class EucledianLengthReducer:
    def call(self, token, value): 
        values = list(value)
        for doc, tokenCount, docTokCount, df, tf in values:
            yield (doc, token), (tokenCount, docTokCount, df, tf, pow(float(tokenCount), 2))  

class EucledianLengthSummerMapper:
    def call(self, key, value):
        doc, token = key
        tokenCount, docTokCount, df, tf, poww = value
        yield doc, (token, tokenCount, docTokCount, df, tf, poww)

class EucledianLengthSummerReducer:
    def call(self, doc, value):
        values = list(value)
        totalDistances = sum(v[5] for v in values)
        for token, tokenCount, docTokCount, df, tf, poww in (v for v in values):
            yield (doc, token), (docTokCount, df, tf, poww, sqrt(totalDistances))

if name == “main“:
    import dumbo
    job = dumbo.Job()
    job.additer(TokenCountMapper, sumreducer, sumreducer)
    job.additer(DocumentTokenCountMapper, DocumentTokenCountReducer)
    job.additer(TokenCountDocumentCountMapper, TokenCountDocumentCountReducer)
    job.additer(EucledianLengthMapper, EucledianLengthReducer)
    job.additer(EucledianLengthSummerMapper, EucledianLengthSummerReducer)

Recipe for the Semantic Web - mongodb, hadoop, nltk, scrapy, django

Recently, i have been working on my dream (5 Years and counting) project i came up with during the first few months of my Freshman Year back in 06. It was supposed to be the best thing to happen to the internet (In my head) but i was never able to complete it.