I was at Etsy when we made statsd, and I'm currently working on fixing the same problems at Stripe. This is super interesting and relevant to me personally, thanks for writing it all down!
There are some details of how Etsy uses statsd that are not well-communicated. Etsy samples metrics aggressively to limit the amount of total traffic. And they monitor the packet error rates on the statsd boxes like hawks to keep the loss rate in check. Back when I was working there, if you added a high-volume counter without sampling it, the alarms would sound and you'd have an ops person tapping your shoulder pretty quickly. If you use statsd and skip either of these steps, the 40% loss that github experienced is what you get.
AIUI Etsy's moved to a consistent hashing scheme that's at least vaguely similar to this.
Node was not then, nor is now Etsy's area of expertise. We were going through an adolescent "let's just use every language" phase when we built statsd. I think the problems outlined here are solid supporting evidence that you should use a smallish set of tools and master them (a point of view which is very on-brand for Etsy engineering as it exists today).
Have you considered pushing the aggregation into the applications, rather than doing it across the network?
Having whatever library is sending data to statsd, instead keep counters/gauges in memory and then expose that on a regular basis would greatly reduce the data volumes involved as it's O(timeseries*frequency) rather than O(events).
This is the approach we take with Prometheus, and based on the statsd setups of some people who've come talking to us there's scope for a reduction in network load of at least an order of magnitude without having to do any downsampling.
Isn't the answer to the problem of network load to run statsd locally on each server? I thought that was how it was normally deployed. Then you can have the local statsd write to graphite/carbon directly, or to a second layer of statsd if you want to do additional aggregation.
That'd still have you going through the kernel on every event to handle the UDP packet, keeping in userspace is more efficient (on the order of 10ns of CPU).
Yeah this has come up and it's reasonable, although it's a tricky/laborious migration in practice given a wide variety of things emitting stats.
The statsd design choices here are mostly explained by the fact that Etsy uses it to collect from PHP. PHP doesn't afford a great way to aggregate in the client. (These are design choices that serve PHP well systemically, although it's limiting here.)
Not suggesting it's a good idea, but the PHP standard library gives you enough tools (shmop) to allow a straight port of something like https://github.com/schmichael/mmstats
This is awesome, and more choice is definitely good, but I'm curious if Github tried out statsite[1] at all in this process? My guess is that these two tools were built at the exact same time solving the same problems.
We did not try out statsite, but it also looks great! Brubeck has been under development internally since October 2012. We're really excited to be able to share it with everyone.
switched from statsd to statsite a few years ago and haven't looked back. highly recommend. same high quality code you'd expect from armon and the hashicorp crew (though statsite predates hashicorp, I believe).
This. We recently had to upgrade to a c4 instance in AWS because the single core on a c3 couldn't take what we were sending. We needed something like Brubeck because eventually the c4 will run out of headroom too.
To clarify, statsite does spawn threads to do some of the aggregation and flushing at the end of the collection interval, but yes it is possible to saturate the main loop at a very high ingest rate.
I am really happy to see another project, one that will probably reach more people than my web framework, honor the same wonderful musician I was trying to honor.
Presumably because they didn't exist (or weren't production-ready) three years ago:
"After three years running in our production servers (with very good results both in performance and reliability), today we're finally releasing Brubeck as open-source."
Are you sure? My middle name is Brubeck as well, and there is some overlap in features:
* Brubeck only runs on Linux. It won't even build on Mac OS X.
…
Brubeck has the following dependencies:
* A Turing-complete computing device running a modern version of the Linux kernel
…
The are several ways to interact with a running Brubeck daemon.
I know bitly have an implementation of statsd in golang [0]. Ive not used it in production but have contributed to it. Go is fast to a point, it is not as fast as C but it is easier to write fast Go than fast C.
I'm not sure what you are asking about specifically but in general it's important to remember that channels are a pretty thin abstraction around a mutex (and a buffer). There is nothing about that abstraction that would make you think it would be better for this use case than spinlocks and a lock free table, and quite a bit that would make you think it is much worse.
One of the first things people abandon when scaling golang servers is the channel abstraction and I expect the same would be true in this case.
I'm as keen to see memory safe systems languages become commonplace as anyone but some 90% of our stack is C - Linux, MySQL, Git, MRI, nginx, haproxy, memcached, redis, and many other internal components that we'll be talking about on the engineering blog soon. We like C. We're going to need a few more years of research, real production experience, and language/library maturity before betting critical infrastructure on something else by default.
> We like C. We're going to need a few more years of
> research, real production experience, and
> language/library maturity before betting critical
> infrastructure on something else by default.
That's a really bizarre opinion to hold in the year 2015. Don't get me wrong: I love C, too. But certainly not for home-grown infrastructure at a web company — the incentives just don't align. And hiding behind the "we just don't know" bugbear doesn't parse, either. Go, for example, powers gargantuan-scale infrastructure at Google, and has for half a decade. And there's whole fleets of organizations as big or bigger than GitHub that report the same experience.
Google also has a gargantuan-scale dev team that includes the people behind Go. It's ridiculous to compare. If Github does not yet themselves have sufficient people with sufficient experience supporting large services in Go, betting critical infrastructure on it would be irresponsible.
Note that he is not saying they're not open to alternatives, nor that they're not experimenting with alternatives, but that they need more experience first before "betting critical infrastructure on something else by default".
> Google also has a gargantuan-scale dev team that
> includes the people behind Go. It's ridiculous to
> compare.
It's not ridiculous. There are probably reasonable arguments against using Go for things like this, but "insufficient developer capacity" isn't one of them. Becoming a Go expert is a task measured in weeks.
What is the alternative? Rust isn't there yet when it comes to IO and there are certainly no other languages that qualify and allow a similar level of control.
What about nim ? There are facilities to run an evented loop, to listen on an udp socket, can be tuned for more realtime capabilities, is compiled to C...
Nim doesn't have the same popularity. There aren't a lot of people that know the language, that makes hiring people that do difficult, it means there may be nobody at Github that does either and due to it's lack of popularity people will probably be less willing to learn it than Rust or Go.
Go is more high level and doesn't offer a similar level of control. Go is an obvious candidate that should be evaluated but it's not an obvious choice the same way that Rust is going to be and C is.
Looks great! Question for the author(s) of Brubeck: are you planning to add an InfluxDB backend, or is that something you'd want the community to pick up?
There are some details of how Etsy uses statsd that are not well-communicated. Etsy samples metrics aggressively to limit the amount of total traffic. And they monitor the packet error rates on the statsd boxes like hawks to keep the loss rate in check. Back when I was working there, if you added a high-volume counter without sampling it, the alarms would sound and you'd have an ops person tapping your shoulder pretty quickly. If you use statsd and skip either of these steps, the 40% loss that github experienced is what you get.
AIUI Etsy's moved to a consistent hashing scheme that's at least vaguely similar to this.
Node was not then, nor is now Etsy's area of expertise. We were going through an adolescent "let's just use every language" phase when we built statsd. I think the problems outlined here are solid supporting evidence that you should use a smallish set of tools and master them (a point of view which is very on-brand for Etsy engineering as it exists today).