Failure recovery

I've been categorizing distributed system designs into four groups, according to how they recover from the loss of a single critical element (e.g. a piece of server hardware). Recently I realized that there's a fifth category, perhaps more popular than the other four.

Fault Tolerance

Element deaths and slow responses are expected and tolerated 100% of the time, with no noticeable degradation of service when a single failure happens. Examples: ADIRS, rpc hedging.

High Availability

Loss of an element provokes the automatic withdrawal of the dead element from service. Clients which were talking to the now-dead element automatically recover (either by fast keepalive timeout or asynchronous notification), and they replay the lost requests to other in-service elements. Non-idempotent requests are handled correctly, though it takes extra time to ensure that they are not committed twice. The loss of a single service element cannot, on its own, cause the loss of any request. However, significant performance degradation can attend the recovery. Examples: DNS, Alertmanager.


Loss of an element automatically triggers the withdrawal of the dead element from service, including the promotion of hot standby elements to serving where necessary to restore service. In-flight requests are lost, and some clients may experience full timeouts and errors. Examples: MariaDB/PostgreSQL "high availability", NGINX Plus "high availability".

Disaster Recovery

Loss of an element leads to an urgent automated alert, but no recovery of service happens until a human approves it. The service is partially unavailable until the recovery happens. Examples: NFS, DRBD.

Dunning-Kruger Mode

Loss of an element leads to an urgent automated alert, but no recovery of service happens until a human figures out how to rebuild the system from scratch. The service is partially unavailable for the next couple of weeks, as service users gradually ask what happened to functionality they had come to rely on. Examples: your email server, your source code repository, your SSO server...


Vernor v. Autodesk (or This Car Is Licensed Not Sold)

Vernor v. Autodesk is an interesting case, but it probably won't be as catastrophic as the EFF makes it seem, unless the US Supreme Court somehow makes it worse.

For many years, software publishers have been trying to impose licence terms on unwilling users. A relatively recent wheeze is to include a term stating that the user gives up ownership of the software copy entirely and becomes a mere licensee. The publisher includes this term in the hope of annulling the user's rights as a lawful owner of the copy. Those rights include the right to run the software (the essential step defence) and the right to sell the software (the first sale doctrine).

But how can publishers force users to agree to license terms that take away their rights and give nothing in return? In the US, the judgement in ProCD v. Zeidenberg created a legal rule which says that you automatically agree to the shrink-wrap licence terms whenever you don't attempt to return the copy for a refund. Since that piece of dodgy reasoning is still the law in the US, I recommend attempting to return every piece of software that you buy in the US (after opening it to ensure that the attempt will fail, of course). This is a good time to point out that I am not a lawyer.

The case of Vernor v. Autodesk hinges on the question of whether Autodesk's customer (CTA) was a copy owner or a mere licensee. However, it's completely clear to me that CTA agreed to a licensing term that attempted to deprive CTA of any ownership of the software copies. The US Court of Appeals said that CTA positively agreed to the licence (as part of a settlement with Autodesk), and later agreed to destroy the software copy (as part of a discounted upgrade deal), and that these facts are not in dispute.

So, it seems to me that the Vernor case is distinguishable from the usual shrink-wrap scenario, where you buy the software and then install and run it using your authority as property owner, without agreeing to any licence terms.

But I could be wrong. Somewhat disturbingly, the judgement of the Court of Appeals concerns itself mainly with the criteria for deciding whether a licence successfully denies ownership to the purchaser. This makes it arguable that the real precedent set by the case is validating the publisher's trick of using magic words to revoke the rights of users, if they can only convince those users to actually agree to the license terms.

And so to the car industry. Car manufacturers would love to kill off the second-hand market, for the simple reason that the supply of second-hand cars drives down the price of new cars. If they simply switch to a model where car buyers must enter a restrictive contract before taking possession, then the manufacturers can retain ownership of their cars and rely on the Vernor judgement to prohibit second-hand sales. If it won't work for the cars themselves, then it will work for the software embedded in the cars.

I sincerely hope the car companies try this. It would outrage enough people that Congress might actually legislate to fix the problem.


Loophole Watch: Remove Battery, Defeat Clampers

Giant disclaimer: I Am Not A Lawyer.

If you park illegally in Ireland, you're likely to be clamped by people who are authorised persons under section 101B of the Road Traffic Act, 1961.

That section says (emphasis is mine):

  1. In this section [...] ‘vehicle’ means a mechanically propelled vehicle.
  2. Where an authorised person finds on a public road a vehicle that is parked in contravention of [parking by-laws], he or a person acting under his direction may [...] fix an immobilisation device to the vehicle while it remains in the place where he finds it, or [move it and then clamp it].

However, the phrase mechanically propelled vehicle (which normally includes your car) has a special exception given in section 3(2) (inserted by s. 72 of the Road Traffic Act, 2010):

Where a vehicle, which, apart from this subsection, would be a mechanically propelled vehicle, stands so substantially disabled (either through collision, breakdown or the removal of the engine or other such vital part) as to be no longer capable of being propelled mechanically, it shall be regarded—

  1. for the purposes of the Road Traffic Acts 1961 to 2010, if it is disabled through collision, as continuing to be a mechanically propelled vehicle, and
  2. for all other purposes of this Act as not being a mechanically propelled vehicle.

When you put this together, it seems to mean that if you take the battery out of your car, it no longer counts as a mechanically propelled vehicle for the purposes of the Road Traffic Act, 1961, and so it can't be legally clamped.

But I'm not going to try it with my car.


What I Did On World IPv6 Day

Mostly, I cursed Vodafone (my mobile Internet provider). First, they blackholed 6to4 traffic, so the default strategy used by Microsoft Windows Vista reliably timed out. Second, they suffered 100% packet loss on IPv4 packets through their network. Actually, they did appear to work on this. Traffic to and came back first, then traffic to www.google.com. At 1600Z (two-thirds of the way through World IPv6 Day) IPv4 service was restored. Third, they use 192.168/16 addresses for their network routers, which should have been a big clue about why IPv6 deployment should be a priority. Fourth, they drop ICMP, making ping and traceroute useless for customers. Fifth, they failed to communicate any of this to their customers. Their user forum is the closest thing they have to a dialogue with their customers, and there's nothing that says "we know about this, don't call". Sixth, they don't answer the phone when you call. I think they might be busy dealing with other unhappy users. That's 6 ways to fail at IPv6. Thanks, Vodafone. If you can't be a good example, you'll just have to serve as a horrible warning (as Catherine Aird said).


Notes on Lessons from Chile

I attended this morning's Grattan Lecture on the Chilean Fiscal Framework, delivered by Dr. Andrés Velasco. The lecture started at 8:30 in the morning and they weren't kidding, so I missed the first half hour.

Chile has privatised pensions, but there is a public safety net to supplement very low pension payments. This means there is no "pay as you go" dynamic [usually referred to here as the "pensions time bomb"].

In what Chile calls the Structural Balance Approach, a fiscal council constructs a long-term model for the economy that can be used to predict future GDP growth trends. The council examines trends in basic economic drivers (such as the price of copper) to come up with that model. They then apply cyclical adjustment methodology close to the OECD procedure.

The council needs to be independent of the political government.

The council is divided into groups: one group produces estimates of copper futures, another group does GDP growth. The output of each group goes into "the blender" to produce estimates of state revenues.

X% of GDP is subtracted as a safety buffer. X was 1 initially, but then government debt hit zero and the government was still accumulating assets, so X was revised to 0.

In 2001, the government adopted this arrangement as policy without any legal obligation. In 2006, the Fiscal Responsibility Law gave a statutory basis to it, but it didn't nail down the predictive methodology or the economic targets used.

Initially, copper prices were low and so the council recommended deficits, which were warmly welcomed by the politicians. But soon copper prices rose and the fiscal council started demanding surpluses (i.e. spending cuts), which wasn't so popular at all.

The Fiscal Responsibility Law said that surplus funds must be divided as follows:

  • Between 0.2% and 0.5% of GDP goes into pension reserves;
  • 0.5% of GDP goes into recapitalising the Central Bank; and
  • The rest goes into the Stabilisation Fund (explained later).

It's important to prepare the politicians and the public for the large surpluses that this scheme can produce. They reached 8% of GDP in Chile and they threatened to go higher. The Minister for Finance (Dr. Velasco) was "the most widely hated person in the country". Effigies of him were frequently burned. He often appeared on morning TV, taking 30-second slots between the aerobics and the cookery, to explain it. He learned to explain it like this: "We're doing what you do at home: saving money aside for a rainy day." The operation of the Fiscal Responsibility Law was "very controversial stuff".

The 2009 budget followed the crash, and it turned an 8% surplus into a 4% deficit. However, the 2009 budget was successful at turning the economy around. The 2010 budget was balanced.

Chile's net public debt was 40% of GDP in 1991. By 2006 it had been reduced to zero.

The government must be willing to live with the political pressure to increase spending during the boom years. The Stabilization Fund reached its maximum size (US$20 billion, or 11% of GDP) in January 2009. Dealing with the crisis involved drawing US$8 billion from the fund.

Chile's output stability (roughly the standard deviation of the economic output, reckoned over a reference period of about a decade) fell dramatically over the years. Chile's ability to cushion changes in the "real exchange rate" [something to do with trade imbalances, I think] was the subject of a graph. The real exchange rate was a damped oscillation, converging on its long-term average.

The January 2009 stimulus package amounted to 2.8% of GDP and it consisted of infrastructure investment, extra support for poorer households, and temporary tax cuts.

An odd scatterplot rated several countries on two axes: the size of their interest rate adjustments versus the size (in US$) of their fiscal stimulus packages. Most countries were bunched together in the low-size area, and a few countries with small fiscal adjustments had high interest rate adjustments, but Chile was the only country on the graph with both a high fiscal response and a high monetary response. [I suppose this indicates that Chile's Stabilization Fund gave it the freedom to deal with the crisis by virtue of having more than enough resources on standby.]

To make the rules optimal, there are four main questions:

  1. What to correct for?

    Chile's two criteria were GDP growth and copper. Dr. Velasco would have liked to include the real exchange rate and the stock of government assets, but they were excluded in order to keep the criteria simple to understand. "You want the rule to be something a taxi driver can understand." Also excluded were: expenditure-led activity, sectoral booms, and movements in asset prices. Essentially, you should distinguish permanent from temporary income. For Ireland, you should also exclude revenue driven by the cycle, such as VAT returns.

  2. Cyclical adjustments should be getting you close to Milton Friedman's Permanent Income Hypothesis (PIH).

    There were big fights over how the adjustments' effects should be accounted for in the fiscal rules. The adjustments must be temporary. Chile's law dictated that the stimulus tax cuts must be temporary. Otherwise there would have been huge pressure to keep the tax low after the crisis was over.

  3. Degree of counter-cyclicality

    Engel, Neilson and Valdés (2010) studied this. You need a "switching regime" to decide when to switch from the counter-boom strategy to the counter-bust strategy and vice versa. The challenges here are simplicity (the taxi-driver standard) and legitimacy (meaning free from political interference).

  4. Ex ante versus ex post conflict

    Fiscal targets (Ex ante) are never going to exactly match the actual outcomes (ex post). There are too many significant variables to be able to predict things exactly. It's necessary to fudge the predictions just like central banks do with inflation figures. That is, specify a range of values for the target, and a range of time in which the target can be met. Alternatively, you can let a (non-political) fiscal council decide to activate an escape clause in order to meet unexpected external crises.

Finally, he offered two caveats about the whole approach. Legislating it is not enough; it must be seen as politically legitimate. There are a lot of variables in how to do it, and we could really use experience from trying the approach in more countries.

In answer to questions from the floor, he pointed out that he was appointed to the Minister for Finance position from outside the electoral system (it's a presidential system), so he didn't have to face angry voters on election day. He also said that the biggest fight was in September 2006 when the proposed budget contained a surplus of 5%.

So there we go. This was a very interesting lecture on its own merits, but I was struck by one thing. Here was a politician from a far-away, non-English-speaking country who didn't have to collect popular votes to be elected; and he was far more eloquent, more relaxed and more organised in his address than any of the 165 recently-elected TDs.

3D Secure coming to Bank of Ireland credit cards

My credit card statement warned me today that

3D Secure is launching at the end of March. This free, automatic online security service will make spending online safer than ever!

The promotional insert contained more reassuring messages:

  • as secure as possible
  • verify your identity by answering four questions [name, CVV2, date of birth and mother's maiden name]
  • we will also display your personal greeting giving you added comfort that it is Bank of Ireland who are asking you to enter your 3D Secure Password

Of course, from Murdoch and Anderson's paper we know that 3D Secure is worse than ineffective. So I have questions for Bank of Ireland:

  1. Do the terms and conditions move the burden of losses by fraud onto the cardholder?
  2. Is the Access Control Server outsourced? If so, to whom? What are their practical incentives to maintain high security standards?
  3. What is the official policy on selecting a CA for the ACS SSL certificate? If there isn't one, how can cardholders protect themselves against compelled certificate creation attacks?
  4. What will happen if a fraudster with my card details uses the forgot password procedure in an attempt to negate the benefit of 3D Secure? Will I still be stuck with the cost of the fraud?
  5. Can I be authenticated by something better than a password, for example a DDA card reader?
  6. Can I get an automatic notification every time there is an authentication attempt on my card number?

I couldn't find any information about this on www.bankofireland.com, so I'll phone them tomorrow and post the result. I'm sure it will be comforting.


IPv4 breathes its last

The good news: every member of RIPE can get a /22. The bad news: you'll never get any more IPv4 allocations, ever.

Failure recovery

I've been categorizing distributed system designs into four groups, according to how they recover from the loss of a single critical ele...