In any infrastructure beyond the tiny/small stage, there are always pieces that ...

mattbee · on April 6, 2014

As an infrastructure provider, I disagree. Customers expect to find the answer to the question "did I do something wrong, or are you having problems?" on the status page.

While we've listed every outage affecting >1 customer since 2004 on our status page, the issue is always expressing each outage in a way that allows a customer to identify that _their_ server is affected by a particular entry.

That sometimes involves knowing that their server is in a particular data centre, or that it is connected to a particular switch, etc. and we do our best to make sure that people can identify their problem, if nothing else than by timing, i.e. making sure we list something ASAP.

But status pages are still very useful once people start calling in - support can positively identify that yes, you are affected by this problem and you can track progress at that URL. If we keep updating as we promise, bang, that's one support call (at most) per affected customer.

So even for a minor problem - broken switch, VM host machine, power failure in a rack, that's worth listing from my point of view.

As I said above I'm looking to tie our databases and some basic network monitoring into it this year. That way we can proactively notify people affected by a particular problem, as well as continuing to list even small problems publicly.

nixgeek · on April 6, 2014

At the scale of DigitalOcean (~10k physical nodes), Amazon (>250k physical nodes) or Google this seems wholly unreasonable, there's a definite signal to noise ratio issue because there will be hundreds of failures per day in any large infrastructure, some of which will have customer impact (ranging from minor to major). It's a statistical reality.

A lot of the other suggestions seem to centre around extending your status page to be more personal to the end user, this is (to some extent) the route Amazon AWS has taken in allowing you to see which instances are scheduled for retirement (because they lack live migration capability, a la GCE), etc.

I note that Amazon now sends out maintenance e-mails advising when certain IPsec connections will go down for them to perform upgrades, this is also great, and others should copy.

This leaves the 'globally visible' bit (i.e. http://status.aws.amazon.com/) for critical outages affecting a large proportion of your customers.

neom · on April 7, 2014

I was just discussing how we could provide this level of granularity without overwhelming the status page as we scale. What if we provided an api endpoint for you to check the health of the node?

count · on April 7, 2014

Status needs to be hosted not on the infrastructure being reported on. If the issue is an API endpoint outage, having an API endpoint for status reporting is...counterproductive :)

neom · on April 7, 2014

If there was an API outage we would report it on our status page, the blog post is about the health of individual physical nodes on our network that bring down clusters of VMs. If there was an API endpoint outage we'd post it to the status page. :)

pekk · on April 6, 2014

You also cost 5-10x what Digital Ocean costs.

Is it wrong to provide a lower-cost service?

Sovietaced · on April 6, 2014

You got it. Not sure if the blog post guy knows anything about infrastructure.

jonny_eh · on April 6, 2014

It's a miracle I can even use a computer!

wclax04 · on April 6, 2014

A personalized status page (per customer/droplet) would be nice.