ON THE BLEEDING EDGE - PYSPARK, DATAFRAMES, AND CASSANDRA

A few months ago I wrote a post on Getting Started with Cassandra and Spark. I’ve worked with Pandas for some small personal projects and found it very useful. The key feature is the data frame, which comes from R. Data Frames are new in Spark 1.3 and was covered in this blog post. Till now I’ve had to write Scala in order to use Spark. This has resulted in me spending a lot of time looking for libraries that would normally take me less than a second to recall the proper Python library (JSON being an example) since I don’t know Scala very well.

HANGOUT ANNOUNCEMENT - PYTHON PERFORMANCE PROFILING

Just wanted to let everyone know I’m going to be doing a Google Hangout on Air on Thursday, 2pm PT / 5PM ET on Python Performance Profiling. I’m going to be covering several tools and exposing a variety of ways of understanding your applications. You can RSVP on the event page. I’ll be answering Q&A along the way so be sure to have your questions ready and upvote the ones you find useful!

INTRODUCTION TO SPARK & CASSANDRA

I’ve been messing with Apache Spark quite a bit lately. If you aren’t familiar, Spark is a general purpose engine for large scale data processing. Initially it comes across as simply a replacement for Hadoop, but that would be selling it short. Big time. In addition to bulk processing (goodbye MapReduce!), Spark includes: SQL engine Stream processing via Kafka, Flume, ZeroMQ Machine Learning Graph Processing Sounds awesome, right? That’s because it is, babaganoush.

DIAGNOSING PROBLEMS IN PRODUCTION WEBINAR POSTED

The webinar from Nov 18, Diagnosing Problems in Production, has been posted to YouTube. I’ve embedded it at the bottom of this post. The webinar is an extended version of the talk I gave at the Cassandra Summit with Blake Eggleston, which I recapped in my blog as well. I had almost double the time to talk in the webinar and so I was able to go into more detail

GETTING STARTED WITH PANDAS AND HDF5

Yesterday I was pulling down some stock data from Yahoo, with the goal of building out a machine learning training set using Spark and Cassandra. If you haven’t tried Cassandra yet, it’s a database built for high availability and linear scalability. I’ve got a intro talk up here. Spark is another apache project that kicks Cassandra into overdrive by providing a framework for batch analytics, streaming, and machine learning. On the way is support for graph operations which makes me giddy.

NEW HOUSE, NEW DESK

When I moved out of my last place I decided it was time for a grown up desk. I left behind a beat down Ikea that I had used for close to a decade, I think it has more than served it’s purpose. Since I had recently gotten into woodworking, I figured this would be the perfect opportunity to build something awesome. I wish I had thought to take a picture of the old desk in all it’s (lack of) glory, but alas, you’ll just have to imagine a desk that’s on the verge of collapse.

21 WAYS TO MINIMIZE EMPLOYEE RETENTION

It’s important to be able to maximize turnover and confusion while minimizing employee retention. This is by no mean an exhaustive list, but it will, without a doubt, be successful, unlike your business. Eliminate all privacy. Employees should feel like they’re being watched at all times. Ideally utilize an open floor plan, which can maximize distractions. If an open room isn’t available, cram as many people into small rooms as humanly possible.

CASSANDRA SUMMIT RECAP: DIAGNOSING PROBLEMS IN PRODUCTION

Introduction Last week at the Cassandra Summit I gave a talk with Blake Eggleston on diagnosing performance problems in production. We spoke to about 300 people for about 25 minutes followed by a healthy Q&A session. I’ve expanded on our presentation to include a few extra tools, screenshots, and more clarity on our talking points. There’s finally a lot of material available for someone looking to get started with Cassandra. There’s several introductory videos on YouTube by both me and Patrick McFadin as well as videos on time series data modeling.

SAY HELLO TO MEATBOT

What is Meatbot? Meatbot is a HipChat bot for managing status updates for our growing team of Evangelists at DataStax. It’s built in Python 2.7, utilizing the Will library. The status updates are stored in Cassandra using cqlengine. Yep, it’s up on github. There’s a few simple commands. First, you tell Meatbot about each project you work on. Once you’ve got your projects, you can list them with lsproject or delete them with rmproject.

PYTHON FOR PROGRAMMERS

When I started learning Python, there’s a few things I wish I had known about. It took a while to learn them all. This is my attempt to compile the highlights into a single post. This post is targeted towards experienced programmers just getting started with Python who want to skip the first few months of researching the Python equivalents of tools they are already used to. The sections on package management and standard tools will be helpful to beginners as well.