You’ve deployed and setup a private Cloud platform but now what? You need an application!
I’ve been experimenting with a number of technologies to generate workloads and give some demos to prospective Eucalyptus customers. A NoSQL database seems like a great use-case to demo as the technology benefits from being designed for scale-out workloads and this happens to be exactly what an IaaS Cloud does best.
There are an abundance of NoSQL implementations (Cassandra, MongoDB, Couchbase, Neo4j…), written in different programming languages and with slightly different takes on which two parts of the CAP theorem they choose to implement and which method they will use to store and display data.
For this post I’m going to be using MongoDB, which is in the “CP” camp, it handles Consistency and Partition Tolerance whilst forgoing Availability (Every request may not see a response), although MongoDB still provides some great availability options.
MongoDB is supported by 10gen, seems fairly mature and has a large community of users with modules for a ton of different programming languages. Cassandra also interests me and I’ll tackle that in a later post.
We also need a bunch of data and whilst there are large datasets available on the internet, last week I read a post on using the Twitter streaming API with Ruby and storing that data in MongoDB and thought it would be cool to use it, albeit with Python instead of Ruby.
Creating an ssh keypair and application security group
To start, let’s setup a keypair and security group for MongoDB so that we can ensure it is not going to be accessed by anyone else:
1 2 3 4 5 6 7 8 9 10 11 | |
Run an instance
We can now spin up an instance running Ubuntu 12.04 LTS x86_64 and install MongoDB on our private cloud:
1
| |
If you are using AWS or your own cloud you’ll need to substitute the EMI ID I’ve used with one an AMI of Ubuntu or your own image ID. You will also need to use your own keypair.
After a few moments our instance should show as ‘running’:
1 2 3 | |
Install MongoDB
Let’s connect to the instance and install MongoDB:
The MongoDB documentation goes into the installation of MongoDB in more detail.
Ubuntu 12.04 LTS has version 2.0.4 of MongoDB in it’s repositories, 2.2.3 is the current stable version upstream so we’ll use the repository from 10gen to install the latest package.
1 2 3 4 5 | |
At this point we have an instance running that has MongoDB installed and running. You should be able to navigate to the MongoDB admin interface in your web browser:
1
| |
Installing Tweetstream
Now we have MongoDB running, we need to import some twitter data. Twitter has a streaming API that is publicly accessible (as long as you have a twitter account!) and there a number of modules for the programming language of your choice.
Tweetstream is a Python module that provides easy access to the streaming API and we can use it in combination with pymongo to store tweets into MongoDB.
Tweetstream isn’t packaged for Ubuntu, so I’ll use the source:
1 2 3 4 | |
Installing pyMongo
pyMongo is the official MongoDB python driver and is available from the Ubuntu archive.
1
| |
Writing a python script to save tweets into MongoDB
There are a number of articles detailing how to do this via curl or tweetstream and it’s very surprisingly very simple to do it.
This following script is based on some of those examples. It connects to MongoDB and stores tweets in a collection called ‘twitterstream’. It stores the whole tweet which includes a lot of metadata, it might be useful to use this metadata later to sort tweets or index for particular fields we are interested in querying. It’s important to note that the streaming API does not give us all tweets on twitter, it’s merely a small percentage as the “Firehose” API that contains all tweets is not public.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | |
If we run this, you should see a stream of tweets printed out and the whole tweets stored within MongoDB:
1
| |
Use the mongo shell to see if there are entries in the database:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | |
The final command should output a portion of the tweets in the json document format that MongoDB queries are displayed in.
That’s it, we’re now streaming tweets into MongoDB via Python tweetstream!
In part 2, I’ll investigate scaling out the MongoDB database by spinning up new Eucalyptus instances and configuring replication and sharding.