Introduction Building a web app relying on database calls with CPython (the standard Python distribution) is pretty easy, but can suffer from performance problems. Python itself isn’t particularly fast, and in 2.x, it’s concurrency story is especially weak. For starters, there’s the dreaded GIL. The GIL prevents us from taking advantage of multi core systems, so even if we use try to use threads we’re missing out on their main performance benefit, which is parallel computation.
- In my previous post, I briefly mentioned FrankDux, a new project I’m working on. FrankDux is a framework for quickly building RPC microservices in Python. This is a preview of it’s functionality and subject to change. A goal of FrankDux is to provide a means of building stateless microservices that’s as easy as working with Flask or Bottle, but also the conveniences of Cap’n Proto, of which I’m a huge fan.
- In a previous post, I introduced a new project, KillrAnswers. I had originally planned on writing KillrAnswers using Rust, leveraging the Cap’n Proto library for RPC and object serialization. I’ve had some time to think about this, and decided to switch back to Python. I also started my own RPC project, FrankDux, based on ZeroMQ and MessagePack for object serialization instead of Cap’n Proto. Let’s get the obvious question out of the way - why not use Rust?
- The last few months have been a non stop whirlwind of traveling and speaking. I’ve been very fortunate to have spoken at Strata New York, give a couple sessions at the Cassandra Summit, and even had a few minutes on stage for the Cassandra Summit keynote (I’m at minute 22 with Luke Tillman). When I have time, I end up hacking on random projects. For example, a couple months ago I was working on a recommendation engine for KillrVideo.
- MySQL is a popular choice for new projects. It’s a flexible database that’s easy to set up and start querying. There’s loads of documentation, examples and frameworks it works with, such as Wordpress, Pandas, Ruby on Rails, and Django. From the above paragraph it reads like a pretty fantastic database, and at small scale it can be great. The problem arises when you need to scale past a single server or have high availability needs.
- A little while back I wrote a post on working with DataFrames from PySpark, using Cassandra as a data source. DataFrames are, in my opinion, a fantastic, flexible api that makes Spark roughly 14 orders of magnitude nicer to work with as opposed to RDDs. When I wrote the original blog post, the only way to work with DataFrames from PySpark was to get an RDD and call toDF().
- In this post I’ll walk through the process of reading in various plain text database files using Pandas, and then joining together the different DataFrames. All my work was done through an IPython notebook. I decided to mess around with the labor statistics database that’s up on Amazon. My end goal was to save all the relevant information into Cassandra for future analysis with PySpark. If the files were bigger, I’d do all the initial loading with PySpark, but they’re pretty small and Pandas has a lot of functionality that’s still missing on the Spark side.
- Last week I wrote about using PySpark with Cassandra, showing how we can take tables out of Cassandra and easily apply arbitrary filters using DataFrames. This is great if you want to do exploratory work or operate on large datasets. What if you’re interested in ingesting lots of data and getting near real time feedback into your application? Enter Spark Streaming. Spark streaming is the process of ingesting and operating on data in microbatches, which are generated repeatedly on a fixed window of time.
- Just wanted to let everyone know I’m going to be doing a Google Hangout on Air on Thursday, 2pm PT / 5PM ET on Python Performance Profiling. I’m going to be covering several tools and exposing a variety of ways of understanding your applications. You can RSVP on the event page. I’ll be answering Q&A along the way so be sure to have your questions ready and upvote the ones you find useful!
- Yesterday I was pulling down some stock data from Yahoo, with the goal of building out a machine learning training set using Spark and Cassandra. If you haven’t tried Cassandra yet, it’s a database built for high availability and linear scalability. I’ve got a intro talk up here. Spark is another apache project that kicks Cassandra into overdrive by providing a framework for batch analytics, streaming, and machine learning. On the way is support for graph operations which makes me giddy.