| RustyRazorblade Consulting

RAMP MADE EASY - PART 2

Introduction In my previous post I introduced RAMP, a family of algorithms for managing atomicity on reads across distributed database partitions. The first algorithm discussed was RAMP-Fast, which is designed to perform with as few network round trips as possible at the cost of storing significant amounts of metadata. I suggest reading my first post if you aren’t familiar with RAMP as...

Tue, Nov 24, 2015 distributed databases algorithms

RAMP MADE EASY

Introduction In this post I’ll introduce RAMP, a family of algorithms for performing atomic reads across partitions when working with distributed databases. The original paper, Scalable Atomic Visibility with RAMP Transactions, was written by Peter Bailis, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein and Ion Stoica, of UC Berkeley and University of Sydney. Peter has graciously reviewed...

Fri, Nov 20, 2015 db cassandra algorithms distributed databases

INTRODUCING KILLRANSWERS

The last few months have been a non stop whirlwind of traveling and speaking. I’ve been very fortunate to have spoken at Strata New York, give a couple sessions at the Cassandra Summit, and even had a few minutes on stage for the Cassandra Summit keynote (I’m at minute 22 with Luke Tillman). When I have time, I end up hacking on random projects. For example, a couple months ago I was...

Mon, Oct 19, 2015 spark cassandra python pyspark rust capnproto microservices

MIGRATING FROM MYSQL TO CASSANDRA USING SPARK

MySQL is a popular choice for new projects. It’s a flexible database that’s easy to set up and start querying. There’s loads of documentation, examples and frameworks it works with, such as Wordpress, Pandas, Ruby on Rails, and Django. From the above paragraph it reads like a pretty fantastic database, and at small scale it can be great. The problem arises when you need to scale...

Mon, Aug 10, 2015 mysql cassandra python spark

CASSANDRA + PYSPARK DATAFRAMES REVISTED

A little while back I wrote a post on working with DataFrames from PySpark, using Cassandra as a data source. DataFrames are, in my opinion, a fantastic, flexible api that makes Spark roughly 14 orders of magnitude nicer to work with as opposed to RDDs. When I wrote the original blog post, the only way to work with DataFrames from PySpark was to get an RDD and call toDF(). Sound freaking amazing -...

Thu, Jul 30, 2015 spark cassandra python pyspark

JOINING DATAFRAMES WITH PANDAS

In this post I’ll walk through the process of reading in various plain text database files using Pandas, and then joining together the different DataFrames. All my work was done through an IPython notebook. I decided to mess around with the labor statistics database that’s up on Amazon. My end goal was to save all the relevant information into Cassandra for future analysis with...

Mon, Jun 8, 2015 python dataframes pandas cassandra

YOU'RE ALREADY EVENTUALLY CONSISTENT

New people to Apache Cassandra are often concerned about the phrase “eventual consistency.” It’s one of those things that seems so foreign, especially if you’re coming from a relational database. When I am with with my RDBMS I get wrapped in the sweet cocoon of ACID transactions! Is the entire system really safe though? Are we perfectly ACID throughout our entire...

Tue, Jun 2, 2015 cassandra rdbms

SPARK STREAMING WITH PYTHON AND KAFKA

Last week I wrote about using PySpark with Cassandra, showing how we can take tables out of Cassandra and easily apply arbitrary filters using DataFrames. This is great if you want to do exploratory work or operate on large datasets. What if you’re interested in ingesting lots of data and getting near real time feedback into your application? Enter Spark Streaming. Spark streaming is the...

Thu, May 7, 2015 spark pyspark cassandra python kafka

ON THE BLEEDING EDGE - PYSPARK, DATAFRAMES, AND CASSANDRA

A few months ago I wrote a post on Getting Started with Cassandra and Spark. I’ve worked with Pandas for some small personal projects and found it very useful. The key feature is the data frame, which comes from R. Data Frames are new in Spark 1.3 and was covered in this blog post. Till now I’ve had to write Scala in order to use Spark. This has resulted in me spending a lot of time...

Fri, May 1, 2015 spark pyspark cassandra data science

HANGOUT ANNOUNCEMENT - PYTHON PERFORMANCE PROFILING

Just wanted to let everyone know I’m going to be doing a Google Hangout on Air on Thursday, 2pm PT / 5PM ET on Python Performance Profiling. I’m going to be covering several tools and exposing a variety of ways of understanding your applications. You can RSVP on the event page. I’ll be answering Q&A along the way so be sure to have your questions ready and upvote the ones you...

Tue, Jan 20, 2015 hangouts python