Fault Tolerance

It’s been a long week, already.  I need a little break, and this piece from 3 years ago is still very relevant.  

You pay the insurance bills every month.  Car, home, life – they’re all about the same, a bet against yourself that you actually hope is money wasted.  But when things go wrong, like a drunk driver smacking into you one sunny day, it’ll be there when you need it.  If you listen to the commercials, what you get for your money is peace of mind.  It should help you sleep at night without anxiety.

Insurance is just the ultimate form of taking care of when things go bad.  Building fault tolerance into a system so that it never gets that far is a far more complicated and thoughtful process.  Anyone who designs a system of some kind – a physical thing or a process that involves checks and balances – is probably going to be proud enough of their achievement to not want to think about when things go horrible wrong and the whole thing breaks.  But that’s exactly what needs to happen for it to be truly robust.  It’s also something that a culture or society has to think about ultimately, painful as it may be.

Continue reading

Investabots Amok!

You are walking down the street, texting to a friend, when suddenly everything freezes. These things happen all the time, you reason, so as annoying as it is you reboot and carry on.  A desperate text a short time later comes as a call from your friend to please stop bombarding them!  What went wrong?  You have no idea.  You reboot again and keep walking.

Things like this happen to everyone these days and we’re all used to it.  Software glitches.  Bugs.  Crappy software runs amok in the hands of appliance users.

Now imagine that you are a Wall Street trading firm that handles orders for thousands of clients and this happens to you.  Except that this costs $440M in bum trades by the time anyone catches it.  That’s exactly what happened to Knight Capital, the company that used to handle 11% of all trading on Wall Street.  It’s something that was inevitable in a system that is too big to be useful – and the world is starting to realize how dangerous this is.

Continue reading