Getting Started With Pandas and HDF5
Yesterday I was pulling down some stock data from Yahoo, with the goal of building out a machine learning training set using Spark and Cassandra. If you haven’t tried Cassandra yet, it’s a database built for high availability and linear scalability. I’ve got a intro talk up here. Spark is another apache project that kicks Cassandra into overdrive by providing a framework for batch analytics, streaming, and machine learning. On the way is support for graph operations which makes me giddy.
One thing at a time though. Before I get too deep into it, I just wanted to pull down and store the raw data on disk. Looking through the Pandas documentation, I came across HDF5. HDF5 lets me treat a local file as a hash and work directly with DataFrames. Very cool.
Let’s install requirements. I use virtual environments for everything, I suggest you do too. To learn more about that and other useful Python topics please read Python for Programmers.
- cython
Cython is required for building extensions:
pip install cython
- HDF5
I grabbed the source from the HDF5 site and did:
./configure --prefix=/usr/local
make
sudo make install
You’ll need the python library for hdf5 as well:
pip install h5py
- pytables
This isn’t installed by default with Pandas, you’ll have to install it separately:
pip install tables
Once you’ve installed the requirements, this is about as simple as it gets:
import pandas
from pandas import HDFStore
# .... some helper functions here, omitted for brevity .....
store = HDFStore("stocks.h5") # filename of the hdf5 file
for stock in stocks():
try:
print "Looking up {}".format(stock)
response = get_history(stock) # i do fetching and use Pandas to turn the result into a DataFrame in here
store[stock] = response # treat hdf5 handle like a hash
except:
print "FAIL {}".format(stock)
You can see that working with HDF5 is about as trivial as working with a hash. What I really like is how trivial it is to read and write from this file using Pandas. It’s nice to have some local persistance for small datasets that doesn’t require running a server.
Next I’ll be playing with scikit-learn and Spark to try to extract some useful information from this dataset.
If you found this post helpful, please consider sharing to your network. I'm also available to help you be successful with your distributed systems! Please reach out if you're interested in working with me, and I'll be happy to schedule a free one-hour consultation.