Computer Science

Simple Lego Blocks for Big Data

Monday, November 30th, 2015

Data engineers should abstract their code in the most lightweight way possible to facilitate downstream integration in a large-scale data system.

You want lego blocks, not puzzle pieces.

lego_blocks

The creators of the C programming language once famously said, “first make it work, then make it right, and, finally, make it fast.” This adage still applies today.

The difference is, we have tools to take working code and validate that it is right against reams of data. Many of these tools can also be used to make the working, right code run really fast across a cluster of machines, possibly even in real-time, as the data comes in.

But, making code work, then right, then fast, requires some discipline.

Read the rest of this entry »

Programming: it’s weird

Sunday, June 14th, 2015

I read the Bloomberg piece, What Is Code?, an explanation of code artistry and programmer/hacker culture in 2015. I love this paragraph about “languages as liquid infrastructure”:

The point is that things are fluid in the world of programming, fluid in a way that other industries don’t seem to be. Languages are liquid infrastructure. You download a few programs and, whoa, suddenly you have a working Clojure environment. Which is actually the Java Runtime Environment. You grab an old PC that’s outlived its usefulness, put Linux on it, and suddenly you have a powerful Web server. Now you can participate in whole new cultures. There are meetups, gatherings, conferences, blogs, and people chatting on Twitter. And you are welcomed. They are glad for the new blood.

Java was supposed to supplant C and run on smart jewelry. Now it runs application servers, hosts Lisplike languages, and is the core language of the Android operating system. It runs on billions of things. It won. C and C++, which it was designed to supplant, also won. A lot of things keep winning because computers keep getting more plentiful. It’s weird.


Worse is better, is worse, is better, is worse, is better…

Web interest in Apache Storm, Kafka, Spark in the Python community

Thursday, November 27th, 2014

Apache Storm, Kafka, and Spark are gaining a lot of momentum in the data analysis and processing communities. I was curious whether the interest in using these technologies with Python, in particular, is growing. Based on these Google Trends reports, it seems like it is.

Read the rest of this entry »

Clojonic: Pythonic Clojure

Sunday, November 2nd, 2014

In June 2012, I promised myself that I’d learn Clojure “as a mind expander”. As a long-time Python programmer who has been using Python full-time in my work at Parse.ly, I wanted to explore. I wrote then:

I don’t know whether Clojure programs will be better or worse than equivalent Python programs. But I know they will be different.

It took me awhile, but in January of this year, I started teaching myself the language.

Rich Hickey, and the “Cult of Personality”

My approach was to first learn the underpinnings of the language from books and online videos. If you embark on this for Clojure, you will inevitably run into the copious publicly-available material from the language’s creator, Rich Hickey.

In stark contrast to Guido van Rossum in the Python community, Rich Hickey is undeniably not just the Clojure language’s creator, but also a kind of spokesperson for a functional programming renaissance. Guido van Rossum generally lays low and lets the Python language and community speak for itself, and tries to avoid controversy. To him, Python is just a popular tool he happened to create, and it doesn’t represent any major paradigm shift in programming. It’s a positive evolutionary improvement supported by a great open source ecosystem and community. To Hickey, however, “traditional” programming languages — but especially popular ones with an object-oriented focus, such as Java and C++ — are just plain wrong. He proposes Clojure as an antidote of sorts.

You can get the gist of this from his motivating videos, such as Hammock-Driven Development, Are We There Yet?, and Simple Made Easy. For a thorough overview of Clojure as a language, you can also get a walkthrough by Hickey, given to a room full of Java developers, in Clojure for Java Programmers Part I and Part II.

Here is a summary of the viewpoint. Most languages are missing some important attributes that can help us tackle the most complex issues in programming projects:

Read the rest of this entry »

“So, you work in IT?”

Monday, August 18th, 2014

For many years, IT as a field was dominated by people who could not write code.

This is because computer technology was mystifying and befuddling to most people that anyone who knew merely how to use computers with any level of comfort could demand a tax from those who didn’t.

During that same period (late 90s and early 2000′s), programming itself was being commoditized by offshore outsourcing, so the same IT people were positioning themselves for management positions. This is how MIS (Management of Information Systems) became a popular career path among the IT elite, and why when I was in college in 2002-2006, Comp Sci enrollment was at a major low.

Read the rest of this entry »

streamparse: Python + Apache Storm for real-time stream processing

Sunday, May 4th, 2014

Parse.ly released streamparse today, which lets you run Python code against real-time streams of data by integrating with Apache Storm.

We released it for our talk, “Real-time streams & logs with Apache Kafka and Storm” at PyData Silicon Valley 2014.

An initial release (0.0.5) was made. It includes a command-line tool, sparse, with the ability to set up and run local Storm-friendly Python projects.

Read the rest of this entry »

The Log: a building block for large-scale data systems

Monday, December 16th, 2013

A software engineer at LinkedIn has written a monster of a blog post about “The Log”, a building block for large-scale data systems. The concepts in this post are near and dear to my heart due to my work on precisely these kinds of problems at Parse.ly.

What is “a log”?

The log is similar to the list of all credits and debits and bank processes; a table is all the current account balances. If you have a log of changes, you can apply these changes in order to create the table capturing the current state. This table will record the latest state for each key (as of a particular log time). There is a sense in which the log is the more fundamental data structure: in addition to creating the original table you can also transform it to create all kinds of derived tables.

At Parse.ly, we just adopted Kafka widely in our backend to address just these use cases for data integration and real-time/historical analysis for the large-scale web analytics use case. Prior, we were using ZeroMQ, which is good, but Kafka is better for this use case.

We have always had a log-centric infrastructure, not born out of any understanding of theory, but simply of requirements. We knew that as a data analysis company, we needed to keep data as raw as possible in order to do derived analysis, and we knew that we needed to harden our data collection services and make it easy to prototype data aggregates atop them.

I also recently read Nathan Marz’s book (creator of Apache Storm), which proposes a similar “log-centric” architecture, though Marz calls it a “master dataset” and uses the fanciful term, “Lambda Architecture”. In his case, he describes that atop a “timestamped set of facts” (essentially, a log) you can build any historical / real-time aggregates of your data via dedicated “batch” and “speed” layers. There is a lot of overlap of thinking in that book and in this article.

full-stack

LinkedIn’s log-centric stack, visualized.

Read the rest of this entry »

PyCon 2013: The Debrief

Sunday, March 17th, 2013

PyCon US 2013 is over! It was a lot of fun — and super informative.

pycon_panorama

The People

For me, it was great to finally meet in person such friends and collaborators as
@__get__, @nvie, @jessejiryudavis, and @japerk.

It was of course a pleasure to see again such Python super-stars as
@adrianholivaty, @wesmckinn, @dabeaz, @raymondh, @brandon_rhodes, @alex_gaynor, and @fperez_org.

(Want to follow them all? I made a Twitter list.)

I also met a whole lot of other Python developers from across the US and even the world, and the entire conference had a great energy. The discussions over beers ranged from how to use Tornado effectively to how to hack a Python shell into your vim editor to how to scale a Python-based software team to how to grow the community around an open source project.

In stark contrast to the events I’ve been typically going to in the last year (namely: ‘trade conferences’ and ‘startup events’), PyCon is unbelievably pure in its purpose and feel. This is where a community of bright, talented developers who share a common framework and language can push their collective skills to new heights.

And push them, we did.

Read the rest of this entry »

Computer Science and “soft” skills

Thursday, March 29th, 2012

My friend Jennifer Anyaegbunam (@JenniferAdaeze) has published a new piece on HuffingtonPost about the role of humanities in medical education.

Matriculating into medical school, we were proud of our humanities roots and felt it made us uniquely poised to become great clinicians. Yet, we have often found that we have had to defend our educational choices to interviewers, advisors and even our peers– something science majors rarely, if ever, have to do. This is because the medical humanities is often regarded as a “second tier” or an extracurricular interest and not something that is fundamental to the practice of medicine.

She finds that the humanities are derided in a classroom setting, as well:

Courses on ethics and social science are few and far between. To make matters worse, students often do not take these exercises seriously, and these courses are often the ones with the poorest attendance, for example

Here, I’ll offer a parallel from a different field: computer science.

As a computer science major at NYU, I too encountered hostility and a dismissive attitude toward the humanities and other “softer” fields from my peers.

A traditional computer science curriculum consists of mathematics, algorithms, and theory. These are important areas of academic interest, and provide a good foundation for thinking about the deepest problems surrounding computation. But the vast majority of computer science majors don’t go on to research computation. They go on to practice it — by becoming software professionals (programmers), writing applications used by real people.

It turns out that to be a successful software professional, you need much more than a computer science background. Indeed, many of the world’s most successful programmers have no computer science background at all. My father was a software professional, but when he graduated from college, computer science did not even exist as a field of study!

You need software design skills, which are often not taught except in a trivial way in traditional curriculums. It is considered “vocational”. You need communication, management, and product design skills. These are too “soft” to be taken seriously.

The industry suffers from a widespread lack of these skills.

Read the rest of this entry »

XDDs: stay healthily skeptical and don’t drink the kool-aid

Sunday, February 12th, 2012

On my LinkedIn profile, I list one of my skills as “thought-driven development”. This is a little tongue-in-cheek; software engineering over the last few years has developed a lot of “XDDs,” such as test-driven development, behavior-driven development, model-driven development, etc. etc.

“Thought-driven development” doesn’t actually exist, but by it, I simply mean: perhaps we should think about what we’re doing, rather than reaching for a nearby methodology du jour.

In my last job, a colleague of mine used to also joke about “design-driven design” — perhaps the ultimate play on the XDDs since it is also a strange loop.

All this is not to say the XDDs aren’t useful — they definitely are. A lot of them have spawned entire groups of cross-platform open source projects. I am all for anything that makes the adoption of XYZ best practice easier for my team. But these techniques often require some lateral thinking to get to any real benefit.

When evaluating technologies like this, you have to take each little community with a grain of salt. Almost every programming framework / methodology / etc. that exists portends to offer some order of magnitude increase in software reliability / developer productivity / whatever else. And almost all, if not all, fail to do so, in practice.

Here is one anecdote to illustrate the point. From 2006-2008 at Morgan Stanley, the entire corporation was obsessed with Java’s Spring framework and its core “architectural pattern”, Inversion of Control. I can’t even begin to explain to you the number of man-hours that were wasted re-architecting existing, working software to meet this chimerical conception of component decoupling. I even contributed to this, urged by the zealots and their blind faithful. All of the reasons seemed great: decoupling code, using more interfaces, allowing for easier unit testing, being able to “rewire dependencies” and use fancy technologies like “aspect-oriented programming”.

Even Google got swept up in the madness and developed their own, competing framework called Guice. And in the end — after all that work — my diagnosis is that IoC is basically a non-starter, a complete waste of time.

A complicated framework that morphed into a programming methodology, developed exclusively to work around some annoying limitations of the Java language. Since it was applied without thinking, now everyone’s Java code has to suffer, and you can hardly pick up a Java web application today without being crushed by the weight of its IoC container’s XML configuration files. (Nevermind that most other communities, such as Python’s and Ruby’s, have hardly a clue what IoC is all about — a good enough indication that it is a waste of time.)

Every framework and approach should be judged on its true merits, that is the true cost/benefit analysis of applying that particular technology. Will it save us time? Will it simplify — not complicate — our code? Will it make our code more flexible / adaptable? Will it let us serve our users and customers better?

I regularly go back to old classics like the Mythical Man-Month to remember that nothing we do in software is truly new. I highly recommend you read it, and also its most famous essay, “No Silver Bullet”.

tl;dr stay healthily skeptical, don’t drink the kool-aid.