CASSANDRA 3.2 OVERVIEW

The 3.0 release of Apache Cassandra marked an important milestone. One of the biggest updates was CASSANDRA-8099, the JIRA to modernize the storage engine. It was also the first release in the new Tick Tock cycle, which lands a new release of Cassandra every month. Even .x numbers (such as 3.2) are feature releases, and odd .x numbers (such as 3.1) are bug fix releases. Cassandra 3.2, released about a week ago, is the first feature release following 3.

FRANKDUX RPC PREVIEW #1

In my previous post, I briefly mentioned FrankDux, a new project I’m working on. FrankDux is a framework for quickly building RPC microservices in Python. This is a preview of it’s functionality and subject to change. A goal of FrankDux is to provide a means of building stateless microservices that’s as easy as working with Flask or Bottle, but also the conveniences of Cap’n Proto, of which I’m a huge fan.

KILLRANSWERS STATUS UPDATE, AND INTRODUCING FRANK DUX

In a previous post, I introduced a new project, KillrAnswers. I had originally planned on writing KillrAnswers using Rust, leveraging the Cap’n Proto library for RPC and object serialization. I’ve had some time to think about this, and decided to switch back to Python. I also started my own RPC project, FrankDux, based on ZeroMQ and MessagePack for object serialization instead of Cap’n Proto. Let’s get the obvious question out of the way - why not use Rust?

USELESSDB

I have built a completely useless database. I had a couple flights across the country this week so I decided to test some ideas in Rust. If you’re not yet familiar with Rust, it’s a systems language focusing on performance, safety, and concurrency. I’ve really enjoyed using it so far and every day it feels much more natural. I’ve been thinking about database internals a lot recently, and decided to see what it would be like to implement a typed database in Rust using byte buffers.

RAMP MADE EASY - PART 2

Introduction In my previous post I introduced RAMP, a family of algorithms for managing atomicity on reads across distributed database partitions. The first algorithm discussed was RAMP-Fast, which is designed to perform with as few network round trips as possible at the cost of storing significant amounts of metadata. I suggest reading my first post if you aren’t familiar with RAMP as I’ll be referring to some of the concepts throughout this post.

RAMP MADE EASY

Introduction In this post I’ll introduce RAMP, a family of algorithms for performing atomic reads across partitions when working with distributed databases. The original paper, Scalable Atomic Visibility with RAMP Transactions, was written by Peter Bailis, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein and Ion Stoica, of UC Berkeley and University of Sydney. Peter has graciously reviewed this blog post to ensure its accuracy. As part of the overview, I’ll explain the RAMP-Fast algorithm, the first of 3 algorithms covered in the paper.

INTRODUCING KILLRANSWERS

The last few months have been a non stop whirlwind of traveling and speaking. I’ve been very fortunate to have spoken at Strata New York, give a couple sessions at the Cassandra Summit, and even had a few minutes on stage for the Cassandra Summit keynote (I’m at minute 22 with Luke Tillman). When I have time, I end up hacking on random projects. For example, a couple months ago I was working on a recommendation engine for KillrVideo.

MIGRATING FROM MYSQL TO CASSANDRA USING SPARK

MySQL is a popular choice for new projects. It’s a flexible database that’s easy to set up and start querying. There’s loads of documentation, examples and frameworks it works with, such as Wordpress, Pandas, Ruby on Rails, and Django. From the above paragraph it reads like a pretty fantastic database, and at small scale it can be great. The problem arises when you need to scale past a single server or have high availability needs.

CASSANDRA + PYSPARK DATAFRAMES REVISTED

A little while back I wrote a post on working with DataFrames from PySpark, using Cassandra as a data source. DataFrames are, in my opinion, a fantastic, flexible api that makes Spark roughly 14 orders of magnitude nicer to work with as opposed to RDDs. When I wrote the original blog post, the only way to work with DataFrames from PySpark was to get an RDD and call toDF(). Sound freaking amazing - what’s the problem?

JOINING DATAFRAMES WITH PANDAS

In this post I’ll walk through the process of reading in various plain text database files using Pandas, and then joining together the different DataFrames. All my work was done through an IPython notebook. I decided to mess around with the labor statistics database that’s up on Amazon. My end goal was to save all the relevant information into Cassandra for future analysis with PySpark. If the files were bigger, I’d do all the initial loading with PySpark, but they’re pretty small and Pandas has a lot of functionality that’s still missing on the Spark side.