The Log: a building block for large-scale data systems

Monday, December 16th, 2013

A software engineer at LinkedIn has written a monster of a blog post about “The Log”, a building block for large-scale data systems. The concepts in this post are near and dear to my heart due to my work on precisely these kinds of problems at

What is “a log”?

The log is similar to the list of all credits and debits and bank processes; a table is all the current account balances. If you have a log of changes, you can apply these changes in order to create the table capturing the current state. This table will record the latest state for each key (as of a particular log time). There is a sense in which the log is the more fundamental data structure: in addition to creating the original table you can also transform it to create all kinds of derived tables.

At, we just adopted Kafka widely in our backend to address just these use cases for data integration and real-time/historical analysis for the large-scale web analytics use case. Prior, we were using ZeroMQ, which is good, but Kafka is better for this use case.

We have always had a log-centric infrastructure, not born out of any understanding of theory, but simply of requirements. We knew that as a data analysis company, we needed to keep data as raw as possible in order to do derived analysis, and we knew that we needed to harden our data collection services and make it easy to prototype data aggregates atop them.

I also recently read Nathan Marz’s book (creator of Apache Storm), which proposes a similar “log-centric” architecture, though Marz calls it a “master dataset” and uses the fanciful term, “Lambda Architecture”. In his case, he describes that atop a “timestamped set of facts” (essentially, a log) you can build any historical / real-time aggregates of your data via dedicated “batch” and “speed” layers. There is a lot of overlap of thinking in that book and in this article.


LinkedIn’s log-centric stack, visualized.

Functional dynamic dispatch with Python’s new singledispatch decorator in functools

Sunday, October 20th, 2013

I just read about Python 3.4′s release notes. I found a nice little gem.

I didn’t know what “Single Dispatch Functions” were all about. Sounded very abstract. But it’s actually pretty cool, and covered in PEP 443.

What’s going on here is that Python has added support for another kind of polymorphism known as “single dispatch”. This allows you to write a function with several implementations, each associated with one or more types of input arguments. The “dispatcher” (called singledispatch and implemented as a Python function decorator) figures out which implementation to choose based on the type of the argument. It also maintains a registry of types to function implementations.

This is not technically “multimethods” — which can also be implemented as a decorator, as GvR did in 2005 — but it’s related. See the Wikipedia article on Dynamic Dispatch for more information.

Also, the other interesting thing about this change is that the library is already on Bitbucket and PyPI and has been tested to work as a backport with Python 2.6+. So you can start using this today, even if you’re not on 3.x!

Python double-under, double-wonder

Thursday, April 11th, 2013

Python has a number of protocols that classes can opt into by implementing one or more “dunder methods”, aka double-underscore methods. Examples include __call__ (make an object behave like a function) or __iter__ (make an object iterable).

The choice of wrapping these functions with double-underscores on either side was really just a way of keeping the language simple. The Python creators didn’t want to steal perfectly good method names from you (such as “call” or “iter”), but they also did not want to introduce some new syntax just to declare certain methods “special”. The dunders achieve the dual goal of calling attention to these methods while also making them just the same as other plain methods in every aspect except naming convention.

PyCon 2013: The Debrief

Sunday, March 17th, 2013

PyCon US 2013 is over! It was a lot of fun — and super informative.


The People

For me, it was great to finally meet in person such friends and collaborators as
@__get__, @nvie, @jessejiryudavis, and @japerk.

It was of course a pleasure to see again such Python super-stars as
@adrianholivaty, @wesmckinn, @dabeaz, @raymondh, @brandon_rhodes, @alex_gaynor, and @fperez_org.

(Want to follow them all? I made a Twitter list.)

I also met a whole lot of other Python developers from across the US and even the world, and the entire conference had a great energy. The discussions over beers ranged from how to use Tornado effectively to how to hack a Python shell into your vim editor to how to scale a Python-based software team to how to grow the community around an open source project.

In stark contrast to the events I’ve been typically going to in the last year (namely: ‘trade conferences’ and ‘startup events’), PyCon is unbelievably pure in its purpose and feel. This is where a community of bright, talented developers who share a common framework and language can push their collective skills to new heights.

And push them, we did.

Rapid Web Prototyping with Lightweight Tools

Wednesday, March 13th, 2013

Today, I am teaching a tutorial at PyCon called “Rapid Web Prototyping with Lightweight Tools.” I’ll update this post with how it went, but here are the materials people are using for the course.

Solidify your Python web skills in two days at PyCon US 2013

Friday, February 8th, 2013

PyCon US 2013 is coming up in March. It is in beautiful Santa Clara, right outside of Palo Alto / San Francisco.

The main conference is sold out, but there are still a few spots open for the tutorial sessions.

(Here’s a secret: the tutorials are where I’ve always learned the most at PyCon.)

Most of PyCon’s attendees are Python experts and practitioners. However, Python is one of the world’s greatest programming languages because it is one of its most teachable and learnable. Attending PyCon is a great way to rapidly move yourself from the “novice” to “expert” column in Python programming skills.

This year, there is an excellent slate of tutorial sessions available before the conference starts. These cost $150 each, which is a tremendous value for a 3-hour, in-depth session on a Python topic. I know of a lot of people who are getting into Python as a way to build web applications. There is actually a great “novice web developer” track in this year’s tutorials, which I’ll outline in this page.

Information fanaticism

Tuesday, September 25th, 2012

On finding alternative sources of news in the pre-web era (this quote comes from ~1992):

The information is there, but it’s there to a fanatic, you know, somebody wants to spend a substantial part of their time and energy exploring it and comparing today’s lies with yesterday’s leaks and so on. That’s a research job and it just simply doesn’t make sense to ask the general population to dedicate themselves to this task on every issue.


Very few people are going to have the time or the energy or the commitment to carry out the constant battle that’s required to get outside of MacNeil/Lehrer or Dan Rather or somebody like that. The easy thing to do, you know — you come home from work, you’re tired, you’ve had a busy day, you’re not going to spend the evening carrying on a research project, so you turn on the tube and say, “it’s probably right”, or you look at the headlines in the paper, and then you watch the sports or something.

That’s basically the way the system of indoctrination works. Sure, the other stuff is there, but you’re going to have to work to find it.

Cloud GNU: where are you?

Saturday, August 18th, 2012

This continues an article I wrote nearly three years ago, Common Criticisms of Linux, parsed and analyzed.

In the three years since I wrote that original piece, Linux has grown — albeit slowly — in desktop usage. After nearly 2 years of no growth (2008-2010, lingering around 1% of market), in 2011 Linux saw a significant uptick in desktop adoption (+64% from May 2011 to January 2012). However, Linux’s desktop share still about 1/5 of the share of Apple OS X and 1/50 the share of Microsoft Windows. This despite the fact that Linux continues to dominate Microsoft in the server market.

The proprietary software industry may be filled with vaporware, mediocre software, and heavyweight kludges, but there is certainly also a lot of good stuff that keeps users coming back.

However, I believe the 2011/2012 up-tick in Linux desktop usage reflects a different trend: the increasingly commoditized role that desktop operating systems (and by extension, desktop software) play in an omni-connected world of cloud software.

Why doesn’t Linux run software application X or Y?

For end users, the above was a core complaint for many years (approx. 2000-2009) when evaluating Linux. However, this complaint has faded in the last two years. Let’s reflect on the most common and useful pieces of software on desktop operating systems these days:

The Debian Manifesto

Thursday, August 16th, 2012

The Debian design process is open to ensure that the system is of the highest quality and that it reflects the needs of the user community. By involving others with a wide range of abilities and backgrounds, Debian is able to be developed in a modular fashion. Its components are of high quality because those with expertise in a certain area are given the opportunity to construct or maintain the individual components of Debian involving that area. Involving others also ensures that valuable suggestions for improvement can be incorporated into the distribution during its development; thus, a distribution is created based on the needs and wants of the users rather than the needs and wants of the constructor. It is very difficult for one individual or small group to anticipate these needs and wants in advance without direct input from others.

This amazing quote from 1994 (!!!) actually models the way I think about software engineering at

A nice piece of nostalgia on Debian’s 19th birthday.

See also: A Brief History of Debian, The Debian Policy Manual, & The Debian Developers Map.

Build a web app fast: Python, HTML & JavaScript resources

Thursday, June 14th, 2012

Wanna build a web app fast? Know a little bit about programming but want to build a modern web app using two well-supported, well-documented, and universally accessible languages? You’ll love these Python, HTML/CSS, and JavaScript resources.

I’ve been sharing these documents with friends who ask me, “I want to start programming and build a web app, where do I start?”. These resources have also been useful to existing programmers who know C, C++ or Java, but who want to embrace dynamic and web-based programming.

Python Resources

Python is the core programming language used at It also happens to be a quickly-growing language with wide adoption in the open source community, and it is a very popular choice for web startups.

I’ve written a blog post with some original materials for learning Python, import this — learning the Zen of Python with code and slides.

This is a good starting point, but you may also find these resources very helpful:

  • For absolute beginners, “Learn Python the Hard Way”. This teaches Python using a series of programming examples, but it really assumes you have no programming background whatsoever. After going through the examples in LPTHW, it may be a good idea to supplement your understanding with Think Python.
  • For existing programmers, “Dive into Python 3″. This teaches Python from the starting point that you have already programmed in a mainstream language like C or Java, and want to know what makes Python really cool/good. Similar audience to my “Zen of Python” slides. Note that this tutorial teaches Python 3, but most people still use Python 2.7. See Python2orPython3 on Python wiki to see the differences.
  • For advanced programmers, “Python Essential Reference, 4th Edition”. Unfortunately, this book costs money, but it’s basically the best book on Python on the market, and it’s very up-to-date. It’s very dense and weighs in at 717 pages, so this is only for those who want to go deep on Python.
  • For cheap advanced programmers, “Official Python Tutorial”. Though the Python tutorial doesn’t have the best narrative style nor the best real-world examples, for advanced programmers, it will teach the reality of the language in a comprehensible way. And, it’s free.

HTML/CSS Resources

In order to build up web applications, you’ll need to write your front-ends in HTML and CSS. These technologies have evolved over the years, but the basic principles remain from when they emerged nearly a decade ago. HTML is the markup language of the web, and you’ll see a lot of tutorials refer to HTML4, which is basically the markup standard all web browsers and websites work off. Don’t be confused by the HTML5 moniker, which often refers to much more than simply the markup — usually, it’s referring to a set of JavaScript APIs that are becoming standard in browsers, along with enhanced audio/video support and a few new “semantic markup” tags that have been added.

Since HTML is basically useless without CSS, you can get by with a short tutorial on HTML and then more advanced tutorials on CSS styling. Here’s what I recommend.

Learn the basics of HTML from MDC’s Introduction to HTML and Wikipedia’s page on HTML. This is a rare case where using Wikipedia is actually a perfect way to get the right background because half the battle with understanding HTML is understanding its history.

An excellent new guide to HTML & CSS together has been published by Shay Howe in 2013.

These look like a great first stop.

You can also use these dedicated resources for CSS specifically:

  • For absolute beginners: Use W3C’s official tutorial on Starting with HTML + CSS. This was written all the way back in 2004, but provides the basics with screenshots and real code examples, so is a great way to get started.
  • For existing programmers: Mozilla has done a great job putting together a quick and readable tutorial that gives you the basics at a glance.
  • For advanced programmers: You’ll want to buy the best book on the subject, CSS Mastery. It has the best explanation of the box model and browser rendering engine’s that I’ve seen, and covers all the edge cases nicely.
  • For cheap advanced programmers: You’ll need to look over the MDC (Mozilla) CSS Reference. Pay particularly close to articles on the Box Model and the Visual Formatting Model.

JavaScript Resources

Aside from Python, every engineer also knows JavaScript, even if it is only begrudgingly. For better or for worse, JavaScript has become the world’s most popular programming language.

JavaScript is definitely the language of the web. It is also a language that has, over the last few years, developed a nice bit of great documentation for learning the language. Here are some resources you can use to get up to speed:

  • For absolute beginners: “Eloquent JavaScript” introduces you to both modern programming techniques and JavaScript at the same time. It is thus a great book for beginners. There is also a print version available.
  • For existing programmers: The Mozilla Developer Network (MDN) contains the web’s best and most official documentation of HTML, CSS, and JavaScript. This guide, “A Re-Introduction to JavaScript”, presents the language to an audience that already knows how to program, and focuses specifically on the “gotcha” parts of the language.
  • For advanced programmers: A must-read is the short (but costly) “JavaScript: The Good Parts”. Douglas Crockford basically reintroduced the world to JavaScript as a modern programming language. He is a bit of a curmudgeon when it comes to programming style, but this makes sense since he is also the author of JSLint, an important tool used in JS development for static code checking.
  • For cheap advanced programmers: Douglas Crockford, author of the above “Good Parts” book, has also given a series of public video lectures on JavaScript at Yahoo! headquarters. These are freely available online and actually present much of the same content in “Good Parts”, just in a condensed form. Warning for the cheap: though the videos are very good, the book goes into more depth and spends less time on the history of the language. Also, Matt Might’s JavaScript, Warts and workarounds is an excellent summary to some of the most important “bad parts” of JavaScript.

JavaScript “frameworks”

Though knowing JS is important to do anything web-facing, you can also leverage some frameworks to help you out. The ones I recommend are the venerable jQuery JavaScript library and the Twitter Bootstrap HTML/CSS/JavaScript components. See:

jQuery adds common utilities for DOM manipulation, server requests, basic animations and dynamic CSS. Bootstrap builds on jQuery and adds a common, simple UI component library using pure HTML, CSS and JavaScript. This provides a grid system for layout; nicely-designed stylesheets for typography, tables, lists, and buttons; JavaScript components that add dynamic behavior such as tabs, dropdowns, modal dialogs, navigation bars, and more.

