IMRU

What is IMRU?

IMRU is an open-source implementation of Iterative Map Reduce Update based on Hyracks.

IMRU provides a general framework for parallelizing a large class of machine learning algorithms, including batch learning and Expectation Maximization. These algorithms can be modeled as a sequence of iterations, where each iteration in turn comprises three sequential steps: (a) all training data are evaluated using the current model; (b) all evaluations are aggregated together, and (c) the model is revised based on the aggregated evaluation. The iterations continue until the model converges.


Batch Gradient Descent example:

public class BGDJob implements IIMRUJob {
    ...
    @Override
    public Gradient map(IMRUContext ctx, Iterator input, Model model)
            throws IOException {
        Gradient g = new Gradient(model.numFeatures);
        while (input.hasNext()) {
            Data data = input.next();
            ...
                for (int i = 0; i < data.fieldIds.length; i++)
                    g.gradient[data.fieldIds[i]] += data.label * data.values[i];
            ...
        }
        return g;
    }

    @Override
    public Gradient reduce(IMRUContext ctx, Iterator input)
            throws IMRUDataException {
        Gradient g = new Gradient(features);
        while (input.hasNext()) {
            Gradient buf = input.next();
            ...
            for (int i = 0; i < g.gradient.length; i++)
                g.gradient[i] += buf.gradient[i];
        }
        return g;
    }

    @Override
    public Model update(IMRUContext ctx, Iterator input, Model model)
            throws IMRUDataException {
        Gradient g = reduce(ctx, input);
        ...
        for (int i = 0; i < model.weights.length; i++)
            model.weights[i] += g.gradient[i] / g.total * model.stepSize;
        ...
        return model;
    }
    ...
}

The above code is an implementation of Batch Gradient Descent using IMRU.

Performance:

Performance of Batch Gradient Descent on IMRUThe above figure plots the performance of one iteration of Batch Gradient Decent on a Yahoo! News dataset (16,557,921 records). The curves show the performance of Spark (Zaharia et al., “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing”, NSDI 2012)  and two variants of IMRU: optimized for time, and optimized for cost.

Releases:

Release 0.2.3 (User Manual | Dev Manual)

Contact

Usage questions: imru-users mailing list (imru-users@googlegroups.com)

IMRU Publications:

Scaling DataLog for Machine Learning on Big Data. UCI, UCSC, Yahoo! Research technical report (not officially published). IMRU is the IMRU part of the report.