In any infrastructure beyond the tiny/small stage, there are always pieces that are failing or have failed. In some cases these failures are observed by customers. In other cases they are not.
It would be unreasonable to have a company-wide status page that constantly lists "some customers are experiencing some problems". That's not the point of the status page - the status page, as the author suggested, is there to highlight issues that are affecting a significant section of the customer base.
The right thing for Digital Ocean to do in cases like this is to allow you, in your private dashboard, to see the problem and follow up on a master ticket for escalation and resolution.
As an infrastructure provider, I disagree. Customers expect to find the answer to the question "did I do something wrong, or are you having problems?" on the status page.
While we've listed every outage affecting >1 customer since 2004 on our status page, the issue is always expressing each outage in a way that allows a customer to identify that _their_ server is affected by a particular entry.
That sometimes involves knowing that their server is in a particular data centre, or that it is connected to a particular switch, etc. and we do our best to make sure that people can identify their problem, if nothing else than by timing, i.e. making sure we list something ASAP.
But status pages are still very useful once people start calling in - support can positively identify that yes, you are affected by this problem and you can track progress at that URL. If we keep updating as we promise, bang, that's one support call (at most) per affected customer.
So even for a minor problem - broken switch, VM host machine, power failure in a rack, that's worth listing from my point of view.
As I said above I'm looking to tie our databases and some basic network monitoring into it this year. That way we can proactively notify people affected by a particular problem, as well as continuing to list even small problems publicly.
At the scale of DigitalOcean (~10k physical nodes), Amazon (>250k physical nodes) or Google this seems wholly unreasonable, there's a definite signal to noise ratio issue because there will be hundreds of failures per day in any large infrastructure, some of which will have customer impact (ranging from minor to major). It's a statistical reality.
A lot of the other suggestions seem to centre around extending your status page to be more personal to the end user, this is (to some extent) the route Amazon AWS has taken in allowing you to see which instances are scheduled for retirement (because they lack live migration capability, a la GCE), etc.
I note that Amazon now sends out maintenance e-mails advising when certain IPsec connections will go down for them to perform upgrades, this is also great, and others should copy.
This leaves the 'globally visible' bit (i.e. http://status.aws.amazon.com/) for critical outages affecting a large proportion of your customers.
I was just discussing how we could provide this level of granularity without overwhelming the status page as we scale. What if we provided an api endpoint for you to check the health of the node?
Status needs to be hosted not on the infrastructure being reported on. If the issue is an API endpoint outage, having an API endpoint for status reporting is...counterproductive :)
If there was an API outage we would report it on our status page, the blog post is about the health of individual physical nodes on our network that bring down clusters of VMs. If there was an API endpoint outage we'd post it to the status page. :)
It would be unreasonable to have a company-wide status page that constantly lists "some customers are experiencing some problems". That's not the point of the status page - the status page, as the author suggested, is there to highlight issues that are affecting a significant section of the customer base.
The right thing for Digital Ocean to do in cases like this is to allow you, in your private dashboard, to see the problem and follow up on a master ticket for escalation and resolution.