One Big Fluke

mirrored from the original


from One Big Fluke

Alembic worked well for my MySQL migration. How does it compare to Percona Toolkit? What’s better?
from One Big Fluke

I loathe schemas

from One Big Fluke

I thoroughly enjoyed Gary Bernhardt’s talk from PyCon entitled The Birth & Death of JavaScript.
from One Big Fluke

Maintaining open source projects is hard

I gave dpxdt, my perceptual diff tool, some much needed love today. Feels good to be building momentum again. After my final commit of the night I searched around for “perceptual diffs” as I do. I came across a similar tool called Diffux that was released by Causes back in February. Somehow I totally missed it! In their announcement post they wrote this:

Before deciding to build Diffux, we scanned the open source market for some alternatives. Dpxdt looked promising, so we gave it a spin. It got the job done, but the project looked abandoned (6 month old PRs hadn’t received any attention, last commit was in August 2013) and we couldn’t get the test suite to run locally. Plus, Dpxdt is written in Python, and we are no Python experts. So there was a bit of a hurdle in debugging and adding functionality.

This is the kind of thing that bums me out. I wish they had sent me an email or something.

For a year I wrote a lot of code. I added features and fixed bugs. I merged contributions from others. I made it easy to deploy to production. And it debuted at Velocity. But what Causes wrote is true. I didn’t enhance the project for 6 months after last summer. What can I say? I’ve been busy. I’d like to blame GitHub for never sending me notification emails. But that’s lame.

The truth is I am completely responsible for not making forward progress. I can’t be mad. I just wish I had done a better job of maintaining the project.

This is one of the frustrating parts of open source. It’s hard to team up with others across perceived boundaries. Yet another example I saw recently is Chef vs. SaltStack. What’s the difference? They both do automation. Sure, they have different customers and different architectures. But the obvious difference is Chef is for Ruby people and SaltStack is for Python people. That’s all there is to it sometimes.

Anyways, I’m happy to see more perceptual diff tools out there! I look forward to when we all take it for granted.

from One Big Fluke

Adam Langley explains why you shouldn’t enable revocation checking in Chrome.
from One Big Fluke

When you get the Travis CI build to pass and they still don’t merge your pull request.
from One Big Fluke

Don’t reply to email

This week I helped a friend of a friend understand the reality of managing a team for the first time. I mentioned a few things about productivity offhand they found useful. Reproduced here is how I handle the onslaught of incoming email:

1. All communication must be on mailing lists to create a body of searchable knowledge and overcome the bus factor*.

2. Never reply to an email if anyone else on the thread also knows the answer.

3. Always reply when you have information that nobody else does.

4. If something is important they will email you repeatedly, IM, call, show up in person, etc.

5. Worst-case: Wait a day (or week, or month) and finally reply to an email yourself.

If I didn’t do these things I would never find time to design, review, write code.

* Direct emails for sensitive things are fine, but that’s the only exception.
from One Big Fluke

Wonderful description of what to expect from good product managers.

from One Big Fluke

Data fusion has no error bounds

Over the years I’ve seen attempts to solve what is called the “data fusion” problem. What is that? You have one useful dataset. You have another useful dataset. The goal is to somehow merge them together to create one larger, unified, and more powerful dataset. Sounds awesome! The problem is the two datasets are disjoint and thus have no overlapping sources. There is no simple key with which to join them together.

Companies have been built and busted trying to accomplish this, often in the advertising space. The same idea applies to likely-voter modeling and more.

So can it be done? Let me show you with a simple example.

Imagine you have 3 variables and you want to measure their correlation: X, Y, and Z. For this example let’s say X is owning a car, Y is playing golf, and Z is traveling every week for work. Our hypothesis is these have a high correlation.

Say you send 3 separate surveys, A, B, and C, to different groups of random people to measure these variables.

Survey A is like this (X and Y):

Q1. Do you drive a car? [Yes | No]
Q2. Do you play golf? [Yes | No]

Survey B is like this (Y and Z):

Q1. Do you play golf? [Yes | No]
Q2. Do you travel weekly for work? [Yes | No]

Survey C is like this (X and Z):

Q1. Do you drive a car? [Yes | No]
Q2. Do you travel weekly for work? [Yes | No]

Survey A gives you correlation of Car and Golf, variables X and Y. Survey B gives you correlation of Golf and Travel, variables Y and Z. Survey C gives you correlation of Car and Travel, variables X and Z. That leads to this question:

With datasets for correlation of XY, correlation of YZ, and correlation of XZ, can you calculate the correlation of XYZ? This is exactly data fusion problem. The answer is:

No, you can’t. Here’s why:

You haven’t measured XYZ. How do you calculate it? How can you put boundaries on its size? There are actually 8 set memberships you’re trying to determine:

You know none of these.

You could assume a uniform distribution of Z in the set XY. Assuming Z (Traveling) is split Yes/No as 40/60 in the general population (the red circle), then also assume it’s split 40/60 in the Car & Golf population set (the green section, XY). That sounds reasonable, but there is no way to actually calculate an error boundary on that assumption. You have no idea what the interior of XYZ looks like. It could be a “rogue wave” of correlation, where the distribution of Z (Traveling) in the set XY (Car/Golf) is perfect and the correlation of XYZ is 100%. It could just as easily be the opposite, where the correlation of XYZ is 0%. You have no way of knowing. All of the data measurements you have collected cannot reveal any pieces of the XYZ interior.

Thus, you must assume the error boundary on XYZ is 100%. There’s no way to calculate otherwise. If you want to calculate XYZ, you must measure XYZ. No modeling or bias correction can compensate for this. There are two outcomes in data fusion: you measure so you can calculate the error bars, or you make a wild guess.
from One Big Fluke