MongoDB-Hadoop connector on Azure HDInsight

Posted on 18 Dec 2013

Recently, MongoDB bought out a connector that lets you use Pig, Hive and core Map Reduce in hadoop to operate on Mongo sourced data.

Someone was asking about connecting mongo data to HDInsight on Stack Overflow, so I thought I’d make it work.

First off make the sbt build on mongo-hadoop work. Out of the box you can’t build it for HDInsight, since the Hadoop version supported by HDInsight (1.2) is not one of the options supported by Mongo. This is not as serious as it sounds, we just need to adjust the build script.

The changes required are in my GitHub repo, pending a PR to Mongo.

All you need to do then is change the build.sbt file (not in my PR for obvious reasons)

hadoopRelease in ThisBuild := "1.2"

Then you can run the build. I’m doing that on my Mac, but you can do this on a windows commnand line in much the same way.

./sbt package

This will produce a few jar files.

./core/target/mongo-hadoop-core_1.2.0-1.2.0.jar
./flume/target/mongo-flume-1.2.0.jar
./hive/target/mongo-hadoop-hive_1.2.0-1.2.0.jar
./pig/target/mongo-hadoop-pig_1.2.0-1.2.0.jar

Upload these jars to the azure storage container associated with your cluster. You will then need to specify this location when you want to use the mongo functionality. Alternatively, you could put it in the magic location /user/hdp/share/lib/ (though I’ve not personally tested that). You will also need to download the Mongo Java Driver and put it in the same place.

If building these is a bit of a pain, I’ve put them all in a zip for you.

Now let’s try running a Pig script using data from a mongo server. Here’s an example just reading from a BSON file from mongodump.

REGISTER 'wasb://BLOB_LOCATION/mongo-connector/mongo-java-driver-2.9.3.jar';
REGISTER 'wasb://BLOB_LOCATION/mongo-connector/mongo-hadoop-core_1.2.0-1.2.0.jar';
REGISTER 'wasb://BLOB_LOCATION/mongo-connector/mongo-hadoop-pig_1.2.0-1.2.0.jar';

raw = LOAD 'wasb://BLOB_LOCATION/yield_historical_in.bson' using com.mongodb.hadoop.pig.BSONLoader;
raw_limited = LIMIT raw 3;
DUMP raw_limited;

And a similar example using an actual mongo connection

-- First, register jar dependencies
REGISTER wasb://BLOB_LOCATION/mongo-connector/mongo-java-driver-2.9.3.jar        -- mongodb java driver
REGISTER wasb://BLOB_LOCATION/mongo-connector/mongo-hadoop-core_1.2.0-1.2.0.jar  -- mongo-hadoop core lib
REGISTER wasb://BLOB_LOCATION/mongo-connector/mongo-hadoop-pig_1.2.0-1.2.0.jar   -- mongo-hadoop pig lib

raw = LOAD 'mongodb://mongohost/Db.collection' using com.mongodb.hadoop.pig.MongoLoader;
raw_limited = LIMIT raw 3;
DUMP raw_limited;

Similar methods ought to work for Hive and MapReduce jobs, but that’s a blog for another day.

StackOverflow Flair

profile for Simon Elliston Ball at Stack Overflow, Q&A for professional and enthusiast programmers