Yesterday I was pulling down some stock data from Yahoo, with the goal of building out a machine learning training set using Spark and Cassandra. If you haven’t tried Cassandra yet, it’s a database built for high availability and linear scalability. I’ve got a intro talk up here. Spark is another apache project that kicks Cassandra into overdrive by providing a framework for batch analytics, streaming, and machine learning. On the way is support for graph operations which makes me giddy.

One thing at a time though. Before I get too deep into it, I just wanted to pull down and store the raw data on disk. Looking through the Pandas documentation, I came across HDF5. HDF5 lets me treat a local file as a hash and work directly with DataFrames. Very cool.

Let’s install requirements. I use virtual environments for everything, I suggest you do too. To learn more about that and other useful Python topics please read Python for Programmers.

  • cython

Cython is required for building extensions:

pip install cython
  • HDF5

I grabbed the source from the HDF5 site and did:

./configure --prefix=/usr/local
sudo make install

You’ll need the python library for hdf5 as well:

pip install h5py
  • pytables

This isn’t installed by default with Pandas, you’ll have to install it separately:

pip install tables

Once you’ve installed the requirements, this is about as simple as it gets:

import pandas
from pandas import HDFStore

# .... some helper functions here, omitted for brevity .....

store = HDFStore("stocks.h5") # filename of the hdf5 file

for stock in stocks():
        print "Looking up {}".format(stock)
        response = get_history(stock) # i do fetching and use Pandas to turn the result into a DataFrame in here
        store[stock] = response # treat hdf5 handle like a hash
        print "FAIL {}".format(stock)

You can see that working with HDF5 is about as trivial as working with a hash. What I really like is how trivial it is to read and write from this file using Pandas. It’s nice to have some local persistance for small datasets that doesn’t require running a server.

Next I’ll be playing with scikit-learn and Spark to try to extract some useful information from this dataset.