Programming

The Log: a building block for large-scale data systems

Monday, December 16th, 2013

A software engineer at LinkedIn has written a monster of a blog post about “The Log”, a building block for large-scale data systems. The concepts in this post are near and dear to my heart due to my work on precisely these kinds of problems at Parse.ly.

What is “a log”?

The log is similar to the list of all credits and debits and bank processes; a table is all the current account balances. If you have a log of changes, you can apply these changes in order to create the table capturing the current state. This table will record the latest state for each key (as of a particular log time). There is a sense in which the log is the more fundamental data structure: in addition to creating the original table you can also transform it to create all kinds of derived tables.

At Parse.ly, we just adopted Kafka widely in our backend to address just these use cases for data integration and real-time/historical analysis for the large-scale web analytics use case. Prior, we were using ZeroMQ, which is good, but Kafka is better for this use case.

We have always had a log-centric infrastructure, not born out of any understanding of theory, but simply of requirements. We knew that as a data analysis company, we needed to keep data as raw as possible in order to do derived analysis, and we knew that we needed to harden our data collection services and make it easy to prototype data aggregates atop them.

I also recently read Nathan Marz’s book (creator of Apache Storm), which proposes a similar “log-centric” architecture, though Marz calls it a “master dataset” and uses the fanciful term, “Lambda Architecture”. In his case, he describes that atop a “timestamped set of facts” (essentially, a log) you can build any historical / real-time aggregates of your data via dedicated “batch” and “speed” layers. There is a lot of overlap of thinking in that book and in this article.

full-stack

LinkedIn’s log-centric stack, visualized.

Read the rest of this entry »

Functional dynamic dispatch with Python’s new singledispatch decorator in functools

Sunday, October 20th, 2013

I just read about Python 3.4′s release notes. I found a nice little gem.

I didn’t know what “Single Dispatch Functions” were all about. Sounded very abstract. But it’s actually pretty cool, and covered in PEP 443.

What’s going on here is that Python has added support for another kind of polymorphism known as “single dispatch”. This allows you to write a function with several implementations, each associated with one or more types of input arguments. The “dispatcher” (called singledispatch and implemented as a Python function decorator) figures out which implementation to choose based on the type of the argument. It also maintains a registry of types to function implementations.

This is not technically “multimethods” — which can also be implemented as a decorator, as GvR did in 2005 — but it’s related. See the Wikipedia article on Dynamic Dispatch for more information.

Also, the other interesting thing about this change is that the library is already on Bitbucket and PyPI and has been tested to work as a backport with Python 2.6+. So you can start using this today, even if you’re not on 3.x!

Read the rest of this entry »

Python double-under, double-wonder

Thursday, April 11th, 2013

Python has a number of protocols that classes can opt into by implementing one or more “dunder methods”, aka double-underscore methods. Examples include __call__ (make an object behave like a function) or __iter__ (make an object iterable).

The choice of wrapping these functions with double-underscores on either side was really just a way of keeping the language simple. The Python creators didn’t want to steal perfectly good method names from you (such as “call” or “iter”), but they also did not want to introduce some new syntax just to declare certain methods “special”. The dunders achieve the dual goal of calling attention to these methods while also making them just the same as other plain methods in every aspect except naming convention.

Read the rest of this entry »

PyCon 2013: The Debrief

Sunday, March 17th, 2013

PyCon US 2013 is over! It was a lot of fun — and super informative.

pycon_panorama

The People

For me, it was great to finally meet in person such friends and collaborators as
@__get__, @nvie, @jessejiryudavis, and @japerk.

It was of course a pleasure to see again such Python super-stars as
@adrianholivaty, @wesmckinn, @dabeaz, @raymondh, @brandon_rhodes, @alex_gaynor, and @fperez_org.

(Want to follow them all? I made a Twitter list.)

I also met a whole lot of other Python developers from across the US and even the world, and the entire conference had a great energy. The discussions over beers ranged from how to use Tornado effectively to how to hack a Python shell into your vim editor to how to scale a Python-based software team to how to grow the community around an open source project.

In stark contrast to the events I’ve been typically going to in the last year (namely: ‘trade conferences’ and ‘startup events’), PyCon is unbelievably pure in its purpose and feel. This is where a community of bright, talented developers who share a common framework and language can push their collective skills to new heights.

And push them, we did.

Read the rest of this entry »

Rapid Web Prototyping with Lightweight Tools

Wednesday, March 13th, 2013

Today, I am teaching a tutorial at PyCon called “Rapid Web Prototyping with Lightweight Tools.” I’ll update this post with how it went, but here are the materials people are using for the course.

Read the rest of this entry »

Solidify your Python web skills in two days at PyCon US 2013

Friday, February 8th, 2013

PyCon US 2013 is coming up in March. It is in beautiful Santa Clara, right outside of Palo Alto / San Francisco.

The main conference is sold out, but there are still a few spots open for the tutorial sessions.

(Here’s a secret: the tutorials are where I’ve always learned the most at PyCon.)

Most of PyCon’s attendees are Python experts and practitioners. However, Python is one of the world’s greatest programming languages because it is one of its most teachable and learnable. Attending PyCon is a great way to rapidly move yourself from the “novice” to “expert” column in Python programming skills.

This year, there is an excellent slate of tutorial sessions available before the conference starts. These cost $150 each, which is a tremendous value for a 3-hour, in-depth session on a Python topic. I know of a lot of people who are getting into Python as a way to build web applications. There is actually a great “novice web developer” track in this year’s tutorials, which I’ll outline in this page.

Read the rest of this entry »

Fully distributed teams: in lists

Thursday, September 20th, 2012

Things fully distributed teams need:

  • real-time chat
  • hosted code repos and code review
  • async updates
  • email groups
  • basic project management
  • bug / issue tracking
  • customer support tools
  • easy way to share files
  • standard way to collaborate on documents and drawings
  • personal task lists
  • personal equipment budgets
  • wiki
  • team calendar
  • webcams (caution: use sparingly)

Things fully distributed teams are happy to live without:

  • constant interruptions
  • long commutes
  • “brainstorming sessions”
  • all-hands meetings
  • equipment fragmentation
  • slow, shared internet
  • 9-to-5
  • “that guy”

Things fully distributed teams do miss out on:

  • face time
  • a good, group laugh
  • after-work beers
  • office serendipity

My old backpack

Thursday, August 30th, 2012

Ten years ago today, I bought myself a birthday present. It was a Brenthaven Backpack.

At the tender age of 18, I coveted few things. But among the web designers and programmers whose blogs I read regularly and whom I looked up to, this backpack was the ultimate in durability and functionality.

It featured a padded, hardened laptop sleeve that could sustain even a dead drop from ten or fifteen feet. It had padded, adjustable shoulder straps. It was made from a seemingly indestructible material. It had hidden pockets everywhere.

At the time, I didn’t have a laptop — just a desktop computer. It ran Windows and Linux, and I used it mostly for web design and Macromedia Flash programming. Adobe hadn’t bought Macromedia yet.

Notebook computers were generally clunky and underpowered devices — not meant for doing “real work”. But my Dad purchased me a used MacBook Titanium from a friend of his — and I knew this was a true luxury.

Read the rest of this entry »

Progress Tiers: Epic, Story, Task, Step

Saturday, August 25th, 2012

I realize that for about 10 years now, I’ve been doing project-oriented work — generally, writing software with working software taking shape over the course of months and even years.

I have developed a theory of “progress tiers” in how this work is optimally managed.

Epics are high-level themes of functionality that manifests in software. For example, “E-mail Notifications”. This is too vague to express a specific user feature, but does express an area of strategic importance to the product. For example, it may be that the product is used primarily via the web, that it lacks engagement from some users, and that all users of the system are also active e-mail users. Therefore, it makes sense that the application would generate some e-mail notifications — but, it’s not yet clear which ones are the right ones, how they should look, how frequently they should arrive, etc.

Understanding the priority of Epics helps the team understand its product roadmap and vision, and the strategic context for the functionality they deliver.

Read the rest of this entry »

The Debian Manifesto

Thursday, August 16th, 2012

The Debian design process is open to ensure that the system is of the highest quality and that it reflects the needs of the user community. By involving others with a wide range of abilities and backgrounds, Debian is able to be developed in a modular fashion. Its components are of high quality because those with expertise in a certain area are given the opportunity to construct or maintain the individual components of Debian involving that area. Involving others also ensures that valuable suggestions for improvement can be incorporated into the distribution during its development; thus, a distribution is created based on the needs and wants of the users rather than the needs and wants of the constructor. It is very difficult for one individual or small group to anticipate these needs and wants in advance without direct input from others.

This amazing quote from 1994 (!!!) actually models the way I think about software engineering at Parse.ly.

A nice piece of nostalgia on Debian’s 19th birthday.

See also: A Brief History of Debian, The Debian Policy Manual, & The Debian Developers Map.