Brubeck – A statsd-compatible metrics aggregator

mcfunley · on June 15, 2015

I was at Etsy when we made statsd, and I'm currently working on fixing the same problems at Stripe. This is super interesting and relevant to me personally, thanks for writing it all down!

There are some details of how Etsy uses statsd that are not well-communicated. Etsy samples metrics aggressively to limit the amount of total traffic. And they monitor the packet error rates on the statsd boxes like hawks to keep the loss rate in check. Back when I was working there, if you added a high-volume counter without sampling it, the alarms would sound and you'd have an ops person tapping your shoulder pretty quickly. If you use statsd and skip either of these steps, the 40% loss that github experienced is what you get.

AIUI Etsy's moved to a consistent hashing scheme that's at least vaguely similar to this.

Node was not then, nor is now Etsy's area of expertise. We were going through an adolescent "let's just use every language" phase when we built statsd. I think the problems outlined here are solid supporting evidence that you should use a smallish set of tools and master them (a point of view which is very on-brand for Etsy engineering as it exists today).

bbrazil · on June 15, 2015

Have you considered pushing the aggregation into the applications, rather than doing it across the network?

Having whatever library is sending data to statsd, instead keep counters/gauges in memory and then expose that on a regular basis would greatly reduce the data volumes involved as it's O(timeseries*frequency) rather than O(events).

This is the approach we take with Prometheus, and based on the statsd setups of some people who've come talking to us there's scope for a reduction in network load of at least an order of magnitude without having to do any downsampling.

sciurus · on June 16, 2015

Isn't the answer to the problem of network load to run statsd locally on each server? I thought that was how it was normally deployed. Then you can have the local statsd write to graphite/carbon directly, or to a second layer of statsd if you want to do additional aggregation.

bbrazil · on June 16, 2015

That'd still have you going through the kernel on every event to handle the UDP packet, keeping in userspace is more efficient (on the order of 10ns of CPU).

mcfunley · on June 15, 2015

Yeah this has come up and it's reasonable, although it's a tricky/laborious migration in practice given a wide variety of things emitting stats.

The statsd design choices here are mostly explained by the fact that Etsy uses it to collect from PHP. PHP doesn't afford a great way to aggregate in the client. (These are design choices that serve PHP well systemically, although it's limiting here.)

bbrazil · on June 15, 2015

I hadn't realised the PHP link, things make more sense now.

You're getting into IPC then, which is a fun topic alright e.g. https://github.com/prometheus/client_ruby/issues/9 and https://github.com/prometheus/client_python/issues/30

jrv · on June 15, 2015

This is how I scaled StatsD at SoundCloud (without having to change client code) before we switched to Prometheus:

http://stackoverflow.com/questions/12871642/scaling-statsd-w...

(the first answer)

_wmd · on June 16, 2015

Not suggesting it's a good idea, but the PHP standard library gives you enough tools (shmop) to allow a straight port of something like https://github.com/schmichael/mmstats

jameskilton · on June 15, 2015

This is awesome, and more choice is definitely good, but I'm curious if Github tried out statsite[1] at all in this process? My guess is that these two tools were built at the exact same time solving the same problems.

[1] https://github.com/armon/statsite

jssjr · on June 15, 2015

We did not try out statsite, but it also looks great! Brubeck has been under development internally since October 2012. We're really excited to be able to share it with everyone.

dpkp · on June 16, 2015

switched from statsd to statsite a few years ago and haven't looked back. highly recommend. same high quality code you'd expect from armon and the hashicorp crew (though statsite predates hashicorp, I believe).

lpgauth · on June 15, 2015

statsite is also based on a event-loop and is limited to a single core (no smp).

touchingvirus · on June 16, 2015

This. We recently had to upgrade to a c4 instance in AWS because the single core on a c3 couldn't take what we were sending. We needed something like Brubeck because eventually the c4 will run out of headroom too.

armon · on June 18, 2015

To clarify, statsite does spawn threads to do some of the aggregation and flushing at the end of the collection interval, but yes it is possible to saturate the main loop at a very high ingest rate.

jmsdnns · on June 15, 2015

I once built a web framework named Brubeck (https://github.com/j2labs/brubeck). I stopped developing it a while ago, though.

I am really happy to see another project, one that will probably reach more people than my web framework, honor the same wonderful musician I was trying to honor.

dberg · on June 15, 2015

Funny i think i remember going to a brubeck meetup here in NY once. My old co-woker Patrick mentioned it and we ended up seeing a talk on it.

jmsdnns · on June 16, 2015

That was me. BrubeckNYC. :)

zallarak · on June 15, 2015

I came here to mention this! I saw you present this at the Recurse Center a long time ago.

jmsdnns · on June 15, 2015

Yep! I built a significant amount of it while I was at Recurse Center.

Proud to say that it all happened in RC's inaugural batch too. :)

anton_gogolev · on June 15, 2015

Can't help but plug our MIT-licensed port of Graphite/StatsD to .NET: [1].

[1]: https://bitbucket.org/aeroclub-it/statsify

piotrp · on June 15, 2015

There already exist many statsd implementations from which two in C and four in golang. Why the need for another one?

https://github.com/etsy/statsd/wiki#server-implementations

simonw · on June 15, 2015

Presumably because they didn't exist (or weren't production-ready) three years ago:

"After three years running in our production servers (with very good results both in performance and reliability), today we're finally releasing Brubeck as open-source."

moogly · on June 15, 2015

Perchance named after Dave Brubeck?

_pvxk · on June 16, 2015

Are you sure? My middle name is Brubeck as well, and there is some overlap in features:

    * Brubeck only runs on Linux. It won't even build on Mac OS X.
    …
    Brubeck has the following dependencies:
    * A Turing-complete computing device running a modern version of the Linux kernel
    …
    The are several ways to interact with a running Brubeck daemon.

Though I guess that jazz guy is more famous.

jssjr · on June 15, 2015

Indeed!

robhanlon · on June 15, 2015

Complete with a subtle joke about odd meter!

agumonkey · on June 15, 2015

Waiting for a prng named Ornette.

rodionos · on June 15, 2015

>A Turing-complete computing device running a modern version of the Linux kernel... I liked that!

Any plans for pluggable backends in addition to BRUBECK_BACKEND_CARBON? Pls consider making it agnostic, i.e. using a wire protocol.

leetrout · on June 15, 2015

I don't really "know" golang but this seems like an area where it would have been useful.

Am I think about this the right way? Would channels have helped handle some of the load effectively?

Any gophers care to comment? :D

Zariel · on June 15, 2015

I know bitly have an implementation of statsd in golang [0]. Ive not used it in production but have contributed to it. Go is fast to a point, it is not as fast as C but it is easier to write fast Go than fast C.

[0] https://github.com/bitly/statsdaemon

kasey_junk · on June 16, 2015

I'm not sure what you are asking about specifically but in general it's important to remember that channels are a pretty thin abstraction around a mutex (and a buffer). There is nothing about that abstraction that would make you think it would be better for this use case than spinlocks and a lock free table, and quite a bit that would make you think it is much worse.

One of the first things people abandon when scaling golang servers is the channel abstraction and I expect the same would be true in this case.

sagichmal · on June 15, 2015

Choosing C for this project is definitely... strange. Keith Rarick calls it "downright irresponsible" [1] — I'm not sure I disagree.

[1] https://twitter.com/krarick/status/610502413007503360

rtomayko · on June 16, 2015

I'm as keen to see memory safe systems languages become commonplace as anyone but some 90% of our stack is C - Linux, MySQL, Git, MRI, nginx, haproxy, memcached, redis, and many other internal components that we'll be talking about on the engineering blog soon. We like C. We're going to need a few more years of research, real production experience, and language/library maturity before betting critical infrastructure on something else by default.

sagichmal · on June 16, 2015

    > We like C. We're going to need a few more years of 
    > research, real production experience, and   
    > language/library maturity before betting critical 
    > infrastructure on something else by default.

That's a really bizarre opinion to hold in the year 2015. Don't get me wrong: I love C, too. But certainly not for home-grown infrastructure at a web company — the incentives just don't align. And hiding behind the "we just don't know" bugbear doesn't parse, either. Go, for example, powers gargantuan-scale infrastructure at Google, and has for half a decade. And there's whole fleets of organizations as big or bigger than GitHub that report the same experience.

vidarh · on June 16, 2015

Google also has a gargantuan-scale dev team that includes the people behind Go. It's ridiculous to compare. If Github does not yet themselves have sufficient people with sufficient experience supporting large services in Go, betting critical infrastructure on it would be irresponsible.

Note that he is not saying they're not open to alternatives, nor that they're not experimenting with alternatives, but that they need more experience first before "betting critical infrastructure on something else by default".

sagichmal · on June 16, 2015

    > Google also has a gargantuan-scale dev team that 
    > includes the people behind Go. It's ridiculous to 
    > compare.

It's not ridiculous. There are probably reasonable arguments against using Go for things like this, but "insufficient developer capacity" isn't one of them. Becoming a Go expert is a task measured in weeks.

jeremyjh · on June 16, 2015

Does he think there were no Garbage Collected languages available in 2007 when he started beanstalkd ? The arrogance is breathtaking.

brazzledazzle · on June 16, 2015

He actually replied to a tweet that brought that up:

"I was young(er) and stupid(er) then.."

DasIch · on June 16, 2015

What is the alternative? Rust isn't there yet when it comes to IO and there are certainly no other languages that qualify and allow a similar level of control.

rakoo · on June 16, 2015

What about nim ? There are facilities to run an evented loop, to listen on an udp socket, can be tuned for more realtime capabilities, is compiled to C...

DasIch · on June 17, 2015

Nim doesn't have the same popularity. There aren't a lot of people that know the language, that makes hiring people that do difficult, it means there may be nobody at Github that does either and due to it's lack of popularity people will probably be less willing to learn it than Rust or Go.

sagichmal · on June 16, 2015

Go is the obvious choice.

DasIch · on June 17, 2015

Go is more high level and doesn't offer a similar level of control. Go is an obvious candidate that should be evaluated but it's not an obvious choice the same way that Rust is going to be and C is.

marceldegraaf · on June 15, 2015

Looks great! Question for the author(s) of Brubeck: are you planning to add an InfluxDB backend, or is that something you'd want the community to pick up?

whost49 · on June 16, 2015

Is there a reason why you didn't just use collectd, which is written in C and has lots of plugins, including statsd, InfluxDB, Graphite, etc.?

jssjr · on June 18, 2015

We use collectd extensively and it is wonderful software. Brubeck and collectd do very different jobs.

whost49 · on June 18, 2015

What does Brubeck do that the collectd statsd plugin can't?

https://collectd.org/wiki/index.php/Plugin:StatsD