Open Source

Web interest in Apache Storm, Kafka, Spark in the Python community

Thursday, November 27th, 2014

Apache Storm, Kafka, and Spark are gaining a lot of momentum in the data analysis and processing communities. I was curious whether the interest in using these technologies with Python, in particular, is growing. Based on these Google Trends reports, it seems like it is.

Read the rest of this entry »

Clojonic: Pythonic Clojure

Sunday, November 2nd, 2014

In June 2012, I promised myself that I’d learn Clojure “as a mind expander”. As a long-time Python programmer who has been using Python full-time in my work at Parse.ly, I wanted to explore. I wrote then:

I don’t know whether Clojure programs will be better or worse than equivalent Python programs. But I know they will be different.

It took me awhile, but in January of this year, I started teaching myself the language.

Rich Hickey, and the “Cult of Personality”

My approach was to first learn the underpinnings of the language from books and online videos. If you embark on this for Clojure, you will inevitably run into the copious publicly-available material from the language’s creator, Rich Hickey.

In stark contrast to Guido van Rossum in the Python community, Rich Hickey is undeniably not just the Clojure language’s creator, but also a kind of spokesperson for a functional programming renaissance. Guido van Rossum generally lays low and lets the Python language and community speak for itself, and tries to avoid controversy. To him, Python is just a popular tool he happened to create, and it doesn’t represent any major paradigm shift in programming. It’s a positive evolutionary improvement supported by a great open source ecosystem and community. To Hickey, however, “traditional” programming languages — but especially popular ones with an object-oriented focus, such as Java and C++ — are just plain wrong. He proposes Clojure as an antidote of sorts.

You can get the gist of this from his motivating videos, such as Hammock-Driven Development, Are We There Yet?, and Simple Made Easy. For a thorough overview of Clojure as a language, you can also get a walkthrough by Hickey, given to a room full of Java developers, in Clojure for Java Programmers Part I and Part II.

Here is a summary of the viewpoint. Most languages are missing some important attributes that can help us tackle the most complex issues in programming projects:

Read the rest of this entry »

Python annotations and type-checking

Saturday, August 16th, 2014

In 2010, the Python core team wrote PEP 3107, which introduced function annotations for Python 3.x.

Nearly 4 years ago, I wrote this response to the PEP, but I published it to a discussion site that ended up becoming defunct (Clusterify). I saw that recently, interest in function annotations for type-checking was revived by GvR, and thought I might resurrect this discussion.

Background

There is a huge flaw with the creation of Python annotations, IMO. Lack of composability.

The problem only arises when you consider that at some point in the future, there may be more than one use case for function annotations (as the PEP suggests). For example, let’s say that in my code, I use function annotations both for documentation and for optional run-time type checking. If I have a framework that expects all the annotations on my function definition to be docstrings, and another framework that expects all the annotations to be classes, how do I annotate my function with both documentation and type checks?

This amounts to lack of a standard for layering function annotations. Is this really a problem?

It’s true that some standard for this could organically form in the community. For example, one could imagine tuples being used for this. If an annotation expression is a tuple, then every framework should iterate through the items of the tuple until they find an item of the matching type. However, this won’t always work: what if two frameworks are both expecting strings, or two frameworks are both expecting classes, with different semantics?

Read the rest of this entry »

5 years ago, I was bored

Friday, July 4th, 2014

I wrote this to a friend five years ago, a few weeks after I had quit my job to embark on the crazy ride that has been Parse.ly’s founding story.

You said to me, “I am glad that you left because you sounded unhappy there.”

But you know, I wasn’t exactly unhappy.

I was just bored.

I’m eager to work on my own stuff. I had a good work environment and I learned a lot. I was making money, had flexibility about hours and work from home, and was respected on my team.

But I had a couple of realizations. First, I didn’t see a future for myself in financial firms. I just don’t like their core business enough; in fact, I think their core business is somewhat superfluous and that financial firms should be way, way smaller than they are. They should make less money, have less power, etc.

Second, my specific project had this split personality. On the one hand, it wanted to be this cutting edge framework to really empower application developers throughout the company. On the other, it was a lost project — lots of code, lots of ideas, but no solid product and no real customer.

Read the rest of this entry »

streamparse: Python + Apache Storm for real-time stream processing

Sunday, May 4th, 2014

Parse.ly released streamparse today, which lets you run Python code against real-time streams of data by integrating with Apache Storm.

We released it for our talk, “Real-time streams & logs with Apache Kafka and Storm” at PyData Silicon Valley 2014.

An initial release (0.0.5) was made. It includes a command-line tool, sparse, with the ability to set up and run local Storm-friendly Python projects.

Read the rest of this entry »

The Log: a building block for large-scale data systems

Monday, December 16th, 2013

A software engineer at LinkedIn has written a monster of a blog post about “The Log”, a building block for large-scale data systems. The concepts in this post are near and dear to my heart due to my work on precisely these kinds of problems at Parse.ly.

What is “a log”?

The log is similar to the list of all credits and debits and bank processes; a table is all the current account balances. If you have a log of changes, you can apply these changes in order to create the table capturing the current state. This table will record the latest state for each key (as of a particular log time). There is a sense in which the log is the more fundamental data structure: in addition to creating the original table you can also transform it to create all kinds of derived tables.

At Parse.ly, we just adopted Kafka widely in our backend to address just these use cases for data integration and real-time/historical analysis for the large-scale web analytics use case. Prior, we were using ZeroMQ, which is good, but Kafka is better for this use case.

We have always had a log-centric infrastructure, not born out of any understanding of theory, but simply of requirements. We knew that as a data analysis company, we needed to keep data as raw as possible in order to do derived analysis, and we knew that we needed to harden our data collection services and make it easy to prototype data aggregates atop them.

I also recently read Nathan Marz’s book (creator of Apache Storm), which proposes a similar “log-centric” architecture, though Marz calls it a “master dataset” and uses the fanciful term, “Lambda Architecture”. In his case, he describes that atop a “timestamped set of facts” (essentially, a log) you can build any historical / real-time aggregates of your data via dedicated “batch” and “speed” layers. There is a lot of overlap of thinking in that book and in this article.

full-stack

LinkedIn’s log-centric stack, visualized.

Read the rest of this entry »

Functional dynamic dispatch with Python’s new singledispatch decorator in functools

Sunday, October 20th, 2013

I just read about Python 3.4′s release notes. I found a nice little gem.

I didn’t know what “Single Dispatch Functions” were all about. Sounded very abstract. But it’s actually pretty cool, and covered in PEP 443.

What’s going on here is that Python has added support for another kind of polymorphism known as “single dispatch”. This allows you to write a function with several implementations, each associated with one or more types of input arguments. The “dispatcher” (called singledispatch and implemented as a Python function decorator) figures out which implementation to choose based on the type of the argument. It also maintains a registry of types to function implementations.

This is not technically “multimethods” — which can also be implemented as a decorator, as GvR did in 2005 — but it’s related. See the Wikipedia article on Dynamic Dispatch for more information.

Also, the other interesting thing about this change is that the library is already on Bitbucket and PyPI and has been tested to work as a backport with Python 2.6+. So you can start using this today, even if you’re not on 3.x!

Read the rest of this entry »

Python double-under, double-wonder

Thursday, April 11th, 2013

Python has a number of protocols that classes can opt into by implementing one or more “dunder methods”, aka double-underscore methods. Examples include __call__ (make an object behave like a function) or __iter__ (make an object iterable).

The choice of wrapping these functions with double-underscores on either side was really just a way of keeping the language simple. The Python creators didn’t want to steal perfectly good method names from you (such as “call” or “iter”), but they also did not want to introduce some new syntax just to declare certain methods “special”. The dunders achieve the dual goal of calling attention to these methods while also making them just the same as other plain methods in every aspect except naming convention.

Read the rest of this entry »

PyCon 2013: The Debrief

Sunday, March 17th, 2013

PyCon US 2013 is over! It was a lot of fun — and super informative.

pycon_panorama

The People

For me, it was great to finally meet in person such friends and collaborators as
@__get__, @nvie, @jessejiryudavis, and @japerk.

It was of course a pleasure to see again such Python super-stars as
@adrianholivaty, @wesmckinn, @dabeaz, @raymondh, @brandon_rhodes, @alex_gaynor, and @fperez_org.

(Want to follow them all? I made a Twitter list.)

I also met a whole lot of other Python developers from across the US and even the world, and the entire conference had a great energy. The discussions over beers ranged from how to use Tornado effectively to how to hack a Python shell into your vim editor to how to scale a Python-based software team to how to grow the community around an open source project.

In stark contrast to the events I’ve been typically going to in the last year (namely: ‘trade conferences’ and ‘startup events’), PyCon is unbelievably pure in its purpose and feel. This is where a community of bright, talented developers who share a common framework and language can push their collective skills to new heights.

And push them, we did.

Read the rest of this entry »

Rapid Web Prototyping with Lightweight Tools

Wednesday, March 13th, 2013

Today, I am teaching a tutorial at PyCon called “Rapid Web Prototyping with Lightweight Tools.” I’ll update this post with how it went, but here are the materials people are using for the course.

Read the rest of this entry »