Cassandra is a BigTable inspired database created at Facebook. It was open sourced several years ago and is now an Apache project.

In cassandra, a row can be very wide and is identified by a key. Think of it as more like a giant array. The data is stored on disk sorted by the key you pick, meaning if you pick the right sort option and key you can have some really fast queries. Here we’ll go over a time series.

A time series is a naturally sorted list, since things are happening over time. Sensor readings or live chat are good examples. In older versions of Cassandra, you’d use timestamp as your column name, and the value would be the actual data. This would give you your list of data, sorted in order. The benefit of this is your queries would likely be looking at slices of time, and with the data stored sequentially on disk you’ll get very fast reads, since there only needs to be one seek (if the data isn’t already in memory).

To make it insanely unlikely that 2 timestamps would ever conflict, the column would actually be a uuid1, which has an embedded timestamp. Data stax gave a good example of a table definition back from Cassandra 0.8:

[default@demo] CREATE COLUMN FAMILY blog_entry
WITH comparator = TimeUUIDType
AND key_validation_class=UTF8Type
AND default_validation_class = UTF8Type;

As Cassandra has matured, it’s evolved really nice schema definition options giving you the choice of some additional structure if you want it. CQL is a SQL-ish language for defining tables, where you specify the column names beforehand. This makes using our time series data a little challenging since you can’t possibly know all the timestamps you’re going to be using. The upcoming version of the language is CQL3. Here’s a great DataStax blog post on some of the CQL3 features.

In particular, the Cassandra team has introduced 2 important items. 1 is the timeuuid field, and the other is specifying compound primary keys with compact storage. This causes the data to be stored sequentially by the timeuuid column, exactly like a really wide row. Starting cqlsh with the -3 option gives us a CQL3 console. Here we define our schema:

haddad-pro:apache-cassandra-1.1.5  jhaddad$ bin/cqlsh -3
Connected to Test Cluster at localhost:9160.
[cqlsh 2.2.0 | Cassandra 1.1.5 | CQL spec 3.0.0
Use HELP for help.

cqlsh> CREATE KEYSPACE sensor WITH strategy_class = 'SimpleStrategy' 
    AND strategy_options:replication_factor = 1;
cqlsh> use sensor;
cqlsh:sensor> create table sensor_entries ( 
    sensorid uuid, 
    time_taken timeuuid, reading text,  
    primary key(sensorid, time_taken)) with compact storage;

Here’s a little Python script to put in some example data:

import uuid
import cql

conn = cql.connect('localhost')
conn.set_cql_version('3.0.0')
conn.set_initial_keyspace('sensor')

# put in 5 sensor reads right now

key = str(uuid.uuid4())
for i in range(5):
    ts = str(uuid.uuid1())
    query = """INSERT INTO sensor_entries 
                (sensorid, time_taken, reading)
                VALUES (:sensorid, :time_taken, :reading)"""
    values = {'sensorid':key,
              'time_taken': ts,
              'reading': "random sample {}".format(i)}
    cur = conn.cursor()
    cur.execute(query, values)

And the result (edited to fit on 1 line):

cqlsh:sensor> select * from sensor_entries;
 sensorid                | time_taken               | reading
--------------------------------------+------------------
 060e1156-1d7d-46b7-87f3 | 2012-10-02 07:57:43-0700 | random sample 0
 060e1156-1d7d-46b7-87f3 | 2012-10-02 07:57:43-0700 | random sample 1
 060e1156-1d7d-46b7-87f3 | 2012-10-02 07:57:43-0700 | random sample 2
 060e1156-1d7d-46b7-87f3 | 2012-10-02 07:57:43-0700 | random sample 3
 060e1156-1d7d-46b7-87f3 | 2012-10-02 07:57:43-0700 | random sample 4`

You can see how even though I was generating a uuid1, Cassandra is showing us a timestamp.

Huge thanks to everyone that’s worked on Cassandra to get it to this point. It’s an absolutely amazing piece of software.