Adventures in NoSQL, part 1

You've deployed and setup a private Cloud platform but now what? You need an application!

I've been experimenting with a number of technologies to generate workloads and give some demos to prospective Eucalyptus customers. A NoSQL database seems like a great use-case to demo as the technology benefits from being designed for scale-out workloads and this happens to be exactly what an IaaS Cloud does best.

There are an abundance of NoSQL implementations (Cassandra, MongoDB, Couchbase, Neo4j...), written in different programming languages and with slightly different takes on which two parts of the CAP theorem they choose to implement and which method they will use to store and display data.

For this post I'm going to be using MongoDB, which is in the "CP" camp, it handles Consistency and Partition Tolerance whilst forgoing Availability (Every request may not see a response), although MongoDB still provides some great availability options.

MongoDB is supported by 10gen, seems fairly mature and has a large community of users with modules for a ton of different programming languages. Cassandra also interests me and I'll tackle that in a later post.

We also need a bunch of data and whilst there are large datasets available on the internet, last week I read a post on using the Twitter streaming API with Ruby and storing that data in MongoDB and thought it would be cool to use it, albeit with Python instead of Ruby.

Creating an ssh keypair and application security group

To start, let's setup a keypair and security group for MongoDB so that we can ensure it is not going to be accessed by anyone else:

# Ensure we have our Eucalyptus or Amazon credentials in the environment
source ~/eucarc
# Create an ssh keypair
euca-add-keypair mongodb > ~/mongodb.key
chmod 400 ~/mongodb.key
# Add SSH, MongoDB and MongoDB admin interface ports to mongodb security group
euca-create-group mongodb -d "MongoDB databases"
# Replace 0.0.0.0/0 with your IP e.g. 1.2.3.4/32 to restrict it to just your system
euca-authorize -P tcp -p 22 -s 0.0.0.0/0 mongodb
euca-authorize -P tcp -p 27017 -s 0.0.0.0/0 mongodb
euca-authorize -P tcp -p 28017 -s 0.0.0.0/0 mongodb

Run an instance

We can now spin up an instance running Ubuntu 12.04 LTS x86_64 and install MongoDB on our private cloud:

euca-run-instances -k mongodb -g mongodb -t c1.xlarge emi-87F63CE5

If you are using AWS or your own cloud you'll need to substitute the EMI ID I've used with one an AMI of Ubuntu or your own image ID. You will also need to use your own keypair.

After a few moments our instance should show as 'running':

$ euca-describe-instances 
RESERVATION r-AB3F4645  985725263417    mongodb
INSTANCE    i-D89D40E2  emi-87F63CE5    1.2.3.4    4.3.2.1    running mongodb 0       c1.xlarge   2013-02-03T22:40:26.743Z    cluster1   eki-222540D6    eri-A5753DBE        monitoring-disabled 1.2.3.4    14.3.2.1            instance-store

Install MongoDB

Let's connect to the instance and install MongoDB:

The MongoDB documentation goes into the installation of MongoDB in more detail.

Ubuntu 12.04 LTS has version 2.0.4 of MongoDB in it's repositories, 2.2.3 is the current stable version upstream so we'll use the repository from 10gen to install the latest package.

ssh -i mongodb.key ubuntu@1.2.3.4  #replace 1.2.3.4 with your instance IP!
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv 7F0CEB10
echo "deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen"| sudo tee -a /etc/apt/sources.list.d/10gen.list
sudo apt-get update
sudo apt-get install -y mongodb-10gen

At this point we have an instance running that has MongoDB installed and running. You should be able to navigate to the MongoDB admin interface in your web browser:

http://1.2.3.4:28017

Installing Tweetstream

Now we have MongoDB running, we need to import some twitter data. Twitter has a streaming API that is publicly accessible (as long as you have a twitter account!) and there a number of modules for the programming language of your choice.

Tweetstream is a Python module that provides easy access to the streaming API and we can use it in combination with pymongo to store tweets into MongoDB.

Tweetstream isn't packaged for Ubuntu, so I'll use the source:

sudo apt-get install -y python-setuptools
wget -c http://pypi.python.org/packages/source/t/tweetstream/tweetstream-1.1.1.tar.gz
tar -zxvf tweetstream-1.1.1.tar.gz
cd tweetstream-1.1.1 && sudo python setup.py install

Installing pyMongo

pyMongo is the official MongoDB python driver and is available from the Ubuntu archive.

sudo apt-get install -y python-pymongo

Writing a python script to save tweets into MongoDB

There are a number of articles detailing how to do this via curl or tweetstream and it's very surprisingly very simple to do it.

This following script is based on some of those examples. It connects to MongoDB and stores tweets in a collection called 'twitterstream'. It stores the whole tweet which includes a lot of metadata, it might be useful to use this metadata later to sort tweets or index for particular fields we are interested in querying. It's important to note that the streaming API does not give us all tweets on twitter, it's merely a small percentage as the "Firehose" API that contains all tweets is not public.

import tweetstream
import pymongo

username = "TWITTER_USERNAME"
password = "TIWTTER_PASSWORD"
mongohost = "localhost"

connection = pymongo.Connection(mongohost, 27017)
db = connection.twitterstream

with tweetstream.TweetStream(username, password) as stream:
    for tweet in stream:
        try:
            # Save the whole tweet but only show certain fields on screen
            db.tweets.save(tweet)
            print tweet['created_at'], tweet['id'], "Username: ", tweet['user']['screen_name'],':', tweet['text'].encode('utf-8')
        except:
            pass

If we run this, you should see a stream of tweets printed out and the whole tweets stored within MongoDB:

python tweet2mongo.py

Use the mongo shell to see if there are entries in the database:

$ mongo
MongoDB shell version: 2.2.3
connecting to: test
Welcome to the MongoDB shell.
For interactive help, type "help".
For more comprehensive documentation, see
        http://docs.mongodb.org/
Questions? Try the support group
        http://groups.google.com/group/mongodb-user
>
> show dbs
admin   (empty)
local   (empty)
twitterstream   0.203125GB
> use twitterstream
switched to db twitterstream
> show collections
system.indexes
tweets
> db.tweets.find()

The final command should output a portion of the tweets in the json document format that MongoDB queries are displayed in.

That's it, we're now streaming tweets into MongoDB via Python tweetstream!

In part 2, I'll investigate scaling out the MongoDB database by spinning up new Eucalyptus instances and configuring replication and sharding.

TomEllis.io