Programmers Need To Learn Statistics Or I Will Kill Them All

smanek · on Jan 9, 2010

I strongly disagree with Zed on this one (at least part of his diatribe - he makes some good points about sample size/ramp-up/etc).

I don't care what your mean or standard deviation are. If your performance isn't symmetrically distributed about the mean (and it probably isn't), standard deviation probably doesn't mean what you expect.

I only look at performance in 99th percentile (or sometimes, 99.9th percentile). If 99% of responses are being served in 100ms on my webapp, I'm happy. I don't care if the mean is 5ms with a 20ms standard deviation. As long as the performance at the 99th (or 99.9th) percentile meets a level of 'acceptability' that I have predetermined, I'm satisfied. And you should be too.

This is much easier to deal with than trying to do real math. Just make sure that no more than 1% (or 0.1%, as the case may be) of requests take longer than X.

Pre-emptive answer: I usually get someone asking me why I don't demand acceptable performance at the 99.99999th percentile or at the slowest request. It's simply an issue of resources. To get that kind of performance, you can't use garbage collection, context switches become an issue, disk buffering even comes into play. You have to basically write a hard real-time (or pseudo real-time) app, and that level of effort isn't worth it for a consumer facing webapp. If I was writing code for a pacemaker or something (and please, dear god, no one ever let me do that) it damn sure better be hard real-time though.

viraptor · on Jan 9, 2010

I can't really agree with you. If you care only about 99.9% of your requests being on time, it means that one in a 1000 can time-out and die. Not that bad? Ok - let's have an example:

Facebook - one random user-page view == 63 requests (reported by firebug). That means 16 people generate >1000 requests. Even with 99.9% requests within your "acceptable range", you get every 16'th person either missing an element on a page due to some DB timeout (for example), or seeing an incomplete page for a not "acceptable" time (with 1/63'th probability of it being the main document, if we don't know which element fails).

Is that really an acceptable behaviour?

I don't believe that a requirement saying "the complete website loads in less than 1s" is affected by GC or context switches at all, tbh. If you actually tested the website with load of 10x the average expected one (over a couple of hours of random scripted walk with a natural write/read balance - testing you'd do on a service you really care about) and you still get random glitches with the expected load, then there's most likely something wrong on a higher level (disk seeks failing? shared hosting overloaded? network? etc. etc.)

Retric · on Jan 9, 2010

Taking more than 100ms is not the same things as "time-out and die". Performance is focused on how long something takes and it normally assumes that the system is not broken. It would take a horribly broken system to timeout 1% of requests. Assuming that 1% of your requests taking long enough the user might notice, but still working is generally acceptable performance for a webapp. So a reasonable goal might be 99% within 150ms and 99.95% don't fail.

viraptor · on Jan 9, 2010

smanek wrote: "I don't care what your mean or standard deviation are.", which if taken seriously, means exactly that 1/1000th request can take an hour to finish and he doesn't care. Anything over 10s in real-life means a time-out in connection, or user refreshing, or going to the next page, or...

Going to 99.95% not-failing connections doesn't help you much IMHO. It means that you've gone from every 16'th person having a problem to every 32'th person. It's not acceptable at all.

Retric · on Jan 9, 2010

Your also assuming that he / we measure per element not per page, which is just silly. Also, stating a minimum accepted value is not the same as ignoring the rest. It just places minimum reasonable performance that a context.

viraptor · on Jan 9, 2010

I read his "Just make sure that no more than 1% [...] of requests" as actual requests. If it meant whole-page-load-time then fair enough.

"stating a minimum accepted value is not the same as ignoring the rest" - then what is it if you don't care about mean and stddev? If you do care about other requests you care about the mean (or stddev, or median, or whatever other parameter) but just don't put it in a scientific way. If you say reasonable, it might mean - it doesn't time-out. That means it's below X seconds. That means with the other maximum load time constraints, you expect a mean time for Y requests to be less than Z. Saying "Assuming that 1% of your requests taking long enough the user might notice" you just put a constraint on that parameter.

(just to explain why I keep arguing this point - because that's what the post is about (kind-of), people say "reasonable time", but don't want to set the actual mean/stddev - which is what they do care about)

Retric · on Jan 10, 2010

No, because taking less than the threshold is effectively meaningless. So, when you calculate the mean and standard deviation the calculation is altered by meaningless trivia. When you add in the latency and rendering times 1ms and 30ms might as well be the same number. What you really want is to round everything under your thresholds up to that threshold and then look at the breakout as x% take 100ms or less, y% take 1 second or less, and z% take less than 3 seconds etc.

You then look at that breakout based on what the page is, or the time of day ect. But, from a performance standpoint the the only really important number is the first number, once enough of your site is fast enough having a few slow areas, or some issues at peak times is reasonable, and once it's slow enough for a user to notice it's a problem.

alain94040 · on Jan 10, 2010

If you care only about 99.9% of your requests being on time, it means that one in a 1000 can time-out and die

Your argument is mathematically correct, but from an engineering point of view, doesn't make sense.

Of course in theory a system which responds 99.9% of the time means that it's possible for 1 of 1000 requests to die horribly. But we are engineers. If I achieve 99.99% of my target, it means that it's highly likely that that one request will also behave with pretty good performance, but I can't be sure that it will be 100ms. It might be 150ms. Or 200ms. But as an engineer, I claim that I am not concerned and don't care about knowing precisely that number.

I think this is what the author of the original article misses completely: he is applying advanced math and refuses to consider that engineering tradeoffs are practical. Of course in theory they aren't. But in practice they are.

kelnos · on Jan 10, 2010

You're arguing about the wrong thing: I still think the parent's position is reasonable and that's it's fine to only care about performance up to a certain percentile.

Maybe you don't like 99% and think it should be 99.99999%. Fine, doesn't matter. Do what makes you happy for your web app.

And besides, this "measurement" has nothing to do with the functional reliability of the app. 99.99% of requests completing within 100ms doesn't mean the other 0.01% of requests fail (as another poster noted); it just means they take longer. If you're worried about requests failing (as we all should be!), then that's a separate issue to deal with. So then you say something like, "99.99% of all requests must complete within 100ms, and no more than 0.0000001% of requests are allowed to fail on average." Or something like that.

zedshaw · on Jan 9, 2010

That'll work, if you have access to 100% of the timings for 100% of the requests and can get that information without having your measurements also modify the results.

Otherwise, it's just easier to do samples and then collect meta-statistics (which are always normal). In that case you'll need to know error rates which require things like variance and std. deviation.

Another technique is to go with statistical process control theory which mostly rejects confidence intervals and instead focuses on live sampling of processes to watch for outliers that need justification. With those you're looking at std. deviation as a measurement of the range of what's been happening vs. what's currently happening.

DaniFong · on Jan 10, 2010

Meta-statistics for website are typically normal?

I think for one way flow through systems this has a tendency to be true (and so is likely to arise on most test harnesses), but I recall that multi-way routing systems with feedback effects (like, say, IP networks or highways) tend to have substantially non-gaussian congestion statistics: typically worse that gaussian, actually.

Consider B. Huberman et. al.

www.hpl.hp.com/research/scl/papers/InternetCongestion/InternetCongestion.pdf

One problem with relying upon standard deviations when you are trying to design a reliable system is that the majority of the variance in many distributions is concentrated in rare events. This is particularly true with long-tails (e.g. a Pareto distribution with alpha <= 2). If you happen to have a system conforming nearly to such statistics, and you keep your eyes on the mean and the standard deviation alone, you'll end up severely underestimating the standard deviation (which might not even converge) and you could be designing a faulty system.

Of course, the real problem, how can one gain confidence in one's statistical analysis. Not so easily, really. So if there's two things people who know about statistics, it's that statistics is critical, and statistics is hard.

gambling8nt · on Jan 10, 2010

What Zed is saying when he notes that meta-statistics are normal is that, thanks to the central limit theorem, the average and standard deviation of data sets collected from the same underlying probability distribution (with convergent average and standard deviation) will tend to be normally distributed (in the limit approaching infinite sample size), even if the underlying system behavior is far from a normal distribution. In practice you work with finite sample sizes, so an underlying distribution sufficiently far from normal will result in a non-normal distribution of meta-statistics--but in most applications, these sort of pathological distributions are largely irrelevant.

Take our example of looking at response time for loading a web page. There is some finite point (say, 10 sec) beyond which we no longer care how much longer it takes. So instead of considering the distribution of response times t, we consider the distribution of min(t, 10 sec). This distribution only has support over a finite interval, so its meta-statistics normalize rapidly as you increase the number of trials.

Using this will under-report the actual standard deviation in the response time (which might, as you say, not even converge), since we've eliminated extremely low probability events with very high response time, but as a practical matter this is largely irrelevant--if these events are high enough probability for us to care we'll notice them anyway. The point of this exercise is not to perfectly ascertain the underlying distribution of t, it is to develop useful predictions for system behavior in practice.

DaniFong · on Jan 10, 2010

The calculated standard deviation from any finite sample size of a long tailed distribution (e.g. Pareto with alpha <= 2) will be off by a factor of infinity.The point is that, not only is the standard deviation irrecoverable in this case, but it's hardly the figure of merit if you do know it.

gambling8nt · on Jan 10, 2010

Except that in real life there are no distributions with support outside of a finite interval in space or time; there's always some point when you stop running the system...if some packets don't arrive by that point, you generally don't care how much longer they would have taken.

hadley · on Jan 10, 2010

Except that the sampling distribution of the standard deviation is a scaled chi-square, not a normal. The central limit theorem is only for the mean, not any statistic that you might dream up. It's trivial to think of many that would not converge even with a windsorised response time.

gambling8nt · on Jan 10, 2010

Chi-squared distributions are well approximated by normal distributions close to the mean.

The point is not that arbitrary statistics will necessarily always be perfectly behaved (or even well behaved) on sampling data--it's that to make reasonably accurate predictions of system behavior, under certain practical conditions, these statistics are well-behaved, and an inexperienced statistician (as most people are) is less likely to make a gross error.

DaniFong · on Jan 10, 2010

Practical conditions not including routing networks and the stock market, you may wish to add...

gambling8nt · on Jan 10, 2010

Real life situations have finite cutoffs in behavior that remove many pathological problems with certain statistical models.

hadley · on Jan 10, 2010

The mean is asymptotically normal and the standard deviation is asymptotically scaled chi-square; the distribution of every "meta"-statistic will be different.

tel · on Jan 10, 2010

Totally agree. This was obviously one of Zed's older, asshole dramatic posts and it doesn't read anywhere more clearly than him yelling about people failing to do statistics correctly and then blanket assuming normality on something pretty unlikely to be normal.

Your method is a safe bet as long as you can collect "enough" data to be sure that you're covering all cases you claim to be covering. That being said, power analysis and data interpretation (ramp up, confounders, etc.) are all more difficult and important than this rant conveys. One R function is not your cure-all.

dspeyer · on Jan 10, 2010

This works if your request stream is homogeneous. If it isn't (and they usually aren't), you need to break it up by type and check the 99.9th percentile of each of them. Otherwise you could find that 99.9% of HTTP GETs are asynchronous AJAX polling and the requests that matter are slower.

Locke1689 · on Jan 9, 2010

I'm pretty sure your system is useless without a statistically significant sample size.

jerf · on Jan 9, 2010

Sampling is not relevant to smanek's point, since he can look at all events. You don't need to sample when you have all the data.

On the other hand, I guess this argues against smanek's points being relevant to Zed's point, since statistics aren't really involved for the same reason. You don't need stats to analyze your performance if you have all the performance data; just read it off.

(Edit: I suppose his point that you shouldn't be using averages but just thresholds is still a statistical argument, against using the wrong tool like "average" on the grounds that you don't actually want the "average".)

lmkg · on Jan 10, 2010

Even if you have all the data of your current run, that's still a sample (broadly speaking) of all the possible runs in the future. This is one of the core aspects of statistics, but it's so fundamental that it gets glossed over rather than getting drilled into people's head like they need. As a result, I have to tell people that when running a survey, total responses are more important than response rate.

thaumaturgy · on Jan 9, 2010

> I strongly disagree with Zed on this one...

I'm unsurprised.

Programmers have a remarkable belief that they are experts in every subject, and if you tell them that there is something wrong with the way they argue, they will argue with you about it.

nailer · on Jan 9, 2010

This recalled to mind recently meeting a physicist who had some experience of programming at a party.

The physicist repeatedly insisted (and explained to me with an authoritative tone) that Python 'has no VM as it is a scripting language'.

scott_s · on Jan 9, 2010

I'd rather discussions here avoid such broad and inflammatory generalizations.

manvsmachine · on Jan 9, 2010

In case you're interested in the (lengthy) original discussion: http://news.ycombinator.com/item?id=626771

EDIT: that wasn't the original, but it was the longest. This is the original (I think): http://news.ycombinator.com/item?id=48006

jmonegro · on Jan 9, 2010

It's a really good article - I came across it today on my twitter feed, and it surprised me that it still has the same impact it had in 2005.

Perceval · on Jan 9, 2010

I'm trying to teach myself statistics (for social sciences) right now. I got two books (Agresti & Finlay along with Knoke, Bohrnstedt & Mee) but both seem to be written in jargon instead of English.

Does anyone have recommendations for statistics books that are actually written in English?

kqr2 · on Jan 9, 2010

Zed actually recommends quite a few books in his post:

    * Statistics; by Freedman, Pisani, Purves, and Adhikari. Norton publishers.
    * Introductory Statistics with R; by Dalgaard. Springer publishers.
    * Statistical Computing: An Introduction to Data Analysis using S-Plus; by Crawley. Wiley publishers.
    * Statistical Process Control; by Grant, Leavenworth. McGraw-Hill publishers.
    * Statistical Methods for the Social Sciences; by Agresti, Finlay. Prentice-Hall publishers.
    * Methods of Social Research; by Baily. Free Press publishers.
    * Modern Applied Statistics with S-PLUS; by Venables, Ripley. Springer publishers.

thomaspaine · on Jan 9, 2010

Funny you should ask, I just ordered a copy of a stats textbook I used back in undergrad because I remember thinking that it was pretty clearly written with lots of examples. It's "Probability and Statistics for Engineers and Scientists" by walpole . If you get a used 7th edition instead of the 8th, it's $10 instead of like $100. http://www.amazon.com/Probability-Statistics-Engineers-Scien...

Introductory Econometrics by Woolridge is a pretty readable econometrics text you should also check out, since it'll be especially applicable for social science: http://www.amazon.com/Introductory-Econometrics-Approach-App...

bokonist · on Jan 9, 2010

For a long time, I wanted to teach myself statistics for the social sciences too. I decided the best way to do this was to find an interesting social science issue, and then learn statistics as part of tackling the issue. But there was one problem with this. There are almost no problems in the social science where it makes sense to use advanced statistics to analyze them. Using very simple regressions can be illuminating and illustrative (though not conclusive). But when you try to get more advanced, data quality, data coding, issues with choosing the right controls, too many variables, sensitivity to variable selection, inability to separate causation from correlation, etc. render advanced statistics utterly useless. Do you have a particular social science problem that you think would be a good fit for statistics?

thomaspaine · on Jan 9, 2010

I guess it depends by what you mean by "advanced statistics". Is correcting for heteroskedasticity, a fixed-effects regression, instrumental variable regression, etc advanced? Because if so, all of that is pretty common practice in say, economics. Simple linear regression is easy and often illuminating, but good luck getting a paper published if that's all you've done.

I agree that in practice, data munging is 90% of the work and can be pretty tedious and discouraging when working with real world datasets, but if you don't know about issues like the ones I mentioned above, how do you even know where to start?

For examples of how statistics can be used to solve interesting problems in the social sciences, just browse through Freakonomics.

bokonist · on Jan 9, 2010

Yes, that's what I mean by advanced statistics. I've read many papers in economics using such techniques. They are 99.9% utter garbage. The issues I mentioned above (data quality, data coding, issues with choosing the right controls, too many variables, sensitivity to variable selection) doom the project before you can even get to using the statistics. Garbage in, garbage out. If you data munge and get a result, you can never determine if it's a real result or a data mining effect.

And yes, I understand you cannot get published in the social sciences without using advanced statistics. All this means is that the economics academia is following into the trap of Schoclastism, and insular world that uses complicated and absurd methods of research, increasingly divorced from reality.

Social science needs to get over its science envy. As I've written here before, the proper tools to use as a student of policy and sociology are the tools of product management: http://news.ycombinator.com/item?id=836196

I have not read Freakonomics the book, but I read their blog occasionally. It's interesting when they find a simple statistic that raises an interesting point.

It might be easier to explain the problems with reference to a specific example. If there is some social science paper that uses the techniques you mentioned above, and you think is particularly good, send me a link, and I can explain why I think it's likely garbage.

Perceval · on Jan 10, 2010

I agree with your assessment of the 'scientism' of social sciences, and the rather absurd uses of statistical methods (that should be inapplicable because of data collective and basic assumption issues).

What I am to do is mostly descriptive statistics, but with a few simple tests (e.g. a test of variance with covariance). I've got three populations of civil war data and would like to compare features of those populations against each other (e.g. instance of civil war in period one vs. period two vs. period three). Controlling for # of states, or # of new states (say, within five years of their founding).

Most of the complicated stuff really can't be applied to stuff like data on civil wars, since the data doesn't meet basic assumptions of the models, but I would like to be able to say something more than just 'the data from period one looks different than the data from the other two periods.'

thomaspaine · on Jan 10, 2010

> Most of the complicated stuff really can't be applied to stuff like data on civil wars, since the data doesn't meet basic assumptions of the models

I think this is precisely why we have more advanced statistical techniques. There are ways to correct for or at least detect serial correlation, homoskedasticity, etc, all of which are probably in your data. All real world data is fucked up in some way, having a large toolbox of statistical techniques helps you cope with this fact. Maybe not perfectly, but at least you'll know where/why your model is wrong.

Perceval · on Jan 11, 2010

Can you recommend a book on advanced statistical techniques or non-parametric statistics that's written mostly in English with a minimum of jargon?

foldr · on Jan 10, 2010

> All this means is that the economics academia is following into the trap of Schoclastism, and insular world that uses complicated and absurd methods of research, increasingly divorced from reality.

That's being a little unfair on the scholastics. They did a lot of important work (particularly on logic), and the renaissance didn't just spring into existence out of nothing.

hadley · on Jan 10, 2010

Sorry, that's rubbish. It's the "advanced" statistical techniques that help you discover and control for these problems. Sure, a lot of people apply them incorrectly and without really thinking about what they're doing and what their results mean (Sod's law: 99% of everything is crap), but that doesn't invalidate their use.

bokonist · on Jan 11, 2010

As I've asked for twice already, give me an example.

jbl · on Jan 9, 2010

The Cartoon Guide to Statistics by Gonick and Smith is actually quite good for dipping your toes in. It should give you an okay conceptual foundation from which you can tackle the textbooks you've already got.

silentbicycle · on Jan 10, 2010

I was just going to recommend this, too. It's hokey as hell, but it's actually a great intro -- After reading it, you'll know the vocabulary and fundamentals well enough to ask meaningful questions. It's also cheap, and a quick read. I just loaned my copy to my little brother for his stats class, and I'm betting he'll find it more useful than the dreadful (and expensive) intro textbook his professor chose.

hadley · on Jan 10, 2010

I'd second the recommendation. But avoid the manga guide to statistics - it's nowhere near as good.

pibefision · on Jan 9, 2010

Mario Triola's book is excellent. Covers all the topics with very clear explanations. Elementary Statics 10th Edition.

tel · on Jan 10, 2010

I can't advise books, but I want to add that beyond the vanilla statistics often taught you'll also probably want to read a book on probability theory and/or combinatorics. These along with some basic calculus are the important underpinning maths that let you build statistics.

korch · on Jan 10, 2010

If you're of the mathematical school that thinks the best way to learn any math field is to walk through the history of how we got here, then I'd recommend the recent book "The Lady Tasting Tea". It traces through the history of the great mathematical minds who created the field of statistics, explains the problems they were solving, how and why they were able to solve them, and how future problems and solutions built up over time into the field of statistics.

It is an academic book, and is more of a history book that explains the math, rather than a math text book; which is why I found it note worthy. It's also worth reading solely for the detailed citations, from which you could completely learn the core of the field by "studying the master's" original papers.

http://www.amazon.com/Lady-Tasting-Tea-Statistics-Revolution...

amoeba · on Jan 9, 2010

Statistics will save us all.

And for the HN folks that like to Lisp it up:

http://incanter.org/

thisisnotmyname · on Jan 9, 2010

This is a great resource if you're studying / reviewing on your own:

http://spartan.ac.brocku.ca/~jvrbik/MATH2P82/Statistics.PDF

jrockway · on Jan 9, 2010

Hacker News readers need to stop submitting linkbait titles, or I will kill them all.

hyperbovine · on Jan 9, 2010

Quoi? That is the verbatim title of the post.

jrockway · on Jan 9, 2010

That means you took the linkbait. Why reward articles like this with free views?

raganwald · on Jan 9, 2010

If the article delivers on the hype, why not reward the author for having the walk to match the talk?

jrockway · on Jan 10, 2010

Does this article deliver on the hype?

What I got out of it was: "Some people are dumber than me." This is a well-known problem both inside and outside the programming world, and unfortunately killing the people dumber than you is not the solution to the problem.

raganwald · on Jan 10, 2010

I took it as "some people are ignorant of basic statistics," which isn't quite as daunting as "some people are dumber than me." The cure is motivation to learn, and one can take the view that the hyperbole of the title broadens the post's reach and simultaneously provides a little negative motivation. I didn't take the title quite that literally.

I personally take the view that you catch more flies with honey than vinegar, but I suggest the article as a whole including the title provides a general good.

zedshaw · on Jan 9, 2010

Don't hate the player, hate the game. :-)

scott_s · on Jan 9, 2010

And blame the player for playing the game.

stcredzero · on Jan 10, 2010

This article is my call for all programmers to finally learn enough about statistics to at least know they don’t know shit.

His call can be generalized: "This article is my call for all programmers to finally learn enough about [insert here] to at least know they don’t know shit."

The power that we programmers have as little gods inside systems of our own making, tends to have us over-estimate how much we know. Thus our penchant for posting stuff that makes experts in various technical and scientific fields roll their eyes.

sireat · on Jan 11, 2010

Two great books for those who need to re/freshen up their statistics:

http://www.amazon.com/How-Lie-Statistics-Darrell-Huff/dp/039...

http://www.amazon.com/Cartoon-Guide-Statistics-Larry-Gonick/...

I haven't had a chance to check out the manga guide to statistics but that might be a decent introduction, as well.

dylanz · on Jan 9, 2010

It's a great post, albeit, probably about 3 years old.

asb · on Jan 9, 2010

Just thought I'd share an example where the graphs really help you to understand the performance profile.

http://s3.amazonaws.com/four.livejournal/20090911/benchmark....

MikeTLive · on Jan 10, 2010

once again slashdot posts hackernews as "new" news. would be nice if they linked HERE to gain from the new comments not the 2.5yr old comments. oh well.

http://developers.slashdot.org/story/10/01/09/2154224/Why-Pr...

ez77 · on Jan 9, 2010

P = "Programmers Need To Learn Statistics"

Q = "Zed Will Kill All Programmers"

If P || Q must hold, go with P!

AlphaMonkey · on Jan 9, 2010

Zed Shaw should stick to his area of expertise. His angry, opinionated rants on topics he clearly lacks deep knowledge of are perfect examples of the pseudo-intellectual junk that litters HN from time to time.

zedshaw · on Jan 9, 2010

Alright, let's see you do better. I see in the comments you've come around to realizing that what statistics needs is more explanations that work for regular folks who have to apply them. Why not write your own blog post that's better than mine?

Also, just FYI: I studied statistics in business school both BS and MS, so most of my background is in applied simplified stats. This includes several graduate courses in stats in the sociology department. In addition I have extensive experience since about 2000 (so 10 years) applying statistics to problems using the R language.

Now, when you write this blog post, I'll expect to see your full Ph.D. level CV laying out all of your experience, publications, and applications of statistics just to be fair.

AlphaMonkey · on Jan 10, 2010

Fair enough. Please do note:

1) I find your writing style juvenile and distasteful. Maybe some of you yanks actually like that kind of writing. Personally, such writing makes the reading intolerable, even if the content is interesting.

2) Still regarding 1): if you already have the fame, why do you keep writing like a disgruntled teenager desperate for attention?

3) Interestingly, I don't disagree with the points you made in the article, I just don't think that insulting your audience is the best way to make your point. Maybe some people like it. To me, it just sounds vulgar.

4) Statistics taught in undergrad and business school is pseudo-Statistics. I don't mean to be pedantic, but the truth is that only graduate-level, rigorous, Mathematical Statistics counts as "true Statistics". One does not need to know the entire field (that takes at least one lifetime), but one needs to know the fundamentals very well in order to apply them.

5) Mere mortals can only be experts in one or two things. You have programming and guitar. Sorry, but you still are no Statistics expert on my book. 10 years of R is impressive and valuable, but if you don't know the foundations, you're a technician, not a statistician.

6) My field is not Statistics. I am no expert in Statistics, and I am not qualified to write about it. I can write a blog post on Algebraic Geometry, but you would not understand it, so there's no point.

7) Unlike you, I have no interest in revealing my true identity. Unfortunately, there are evil people in this world, and whatever I would write on HN under my real name could be used against me later on. If you're not in that position and can write what the fuck you want on your blog, then I must say that I envy your freedom.

To recap: writing like a teenager does alienate some readers, and calling a non-expert an expert is something to be avoided.

richardw · on Jan 10, 2010

The guy puts himself out there, so I respect him for that. His vibe also put me off, but he's got some seriously good points if you read past it. Besides, he's chilled out a lot since the earlier posts.

In this and a few other cases, the angry tone is a lightning rod that got more attention than a quiet "please guys, will you learn some stats?" If you have a better way to do it, then do it.

Raphael_Amiard · on Jan 9, 2010

thanks for that argumentated non opinionated answer !

mariorz · on Jan 9, 2010

the comment requires no further argumentation. what could anyone take away from this article? "most programmers are idiots, zed shaw is fucking awesome" and "SD is important yo!". this isn't HN material.

AlphaMonkey · on Jan 9, 2010

At least I don't write like a slightly retarded teenager, and I don't pretend to be an expert on Statistics. Thanks for the "kudos" anyways.

Raphael_Amiard · on Jan 9, 2010

Sorry for the sarcastic answer, never a good thing. My point was, despite Zed's tone , that i found irritating, he does actually provide content in this article, that a newbie like me can use and put at profit (wich i did when the article first came out).

On the other hand, your post sounds angry, gives no insight as to why / how he lacks knowledge (which would have been interresting) and use condescending expressions like "pseudo intellectual garbage"

mariorz · on Jan 9, 2010

> he does actually provide content in this article, that a newbie like me can use and put at profit (wich i did when the article first came out).

what could you possibly use from this article? if you're really interested in learning stats concepts and tools you can be better served by blogs with actual content.

here's a good one: http://incanter-blog.org/

Pistos2 · on Jan 10, 2010

Vb > Va does not imply that Va == 0. We cannot conclude that A has no value just because B has more value than A.

AlphaMonkey · on Jan 9, 2010

Look, I know it's none of my business, but if you're a newbie and you want to learn some Statistics, wouldn't you be better off checking the MIT OCW pages, or the countless lecture notes in PDF format that can be found on the web? This is an honest question.

Contrary to what many short-sighted "pure" mathematicians think, Statistics is hard. You can't explain statistics in a bunch of blog posts, not even if you are a true expert. You can't condense all that knowledge in a blog post, one must read the books, though painful that may seem.

loumf · on Jan 9, 2010

To learn statistics, yes, of course. To get psyched up about learning statistics? That's what this article did for him.

Sometimes it's just fun to read someone go nuts about something like statistics, and it may give you motivation to go learn more about it so that you can understand the whole thing.

tobtoh · on Jan 9, 2010

I agree. The last time I dabbled with statistics was at university and whilst I could see it was useful, it was boring as anything. I couldn't see any really application in my day to day job - after all, besides understanding the difference between mean and medium, what else would I need?

But Zed's article, juvenile as it was in tone, had undeniable passion about statistics. And it was enough to keep me reading through the whole article and pay attention to the examples. As it is, I will be doing further reading on statistics and so Zed's article has been incredibly useful to me.

"Maybe it's just me, but Statistics is a beautiful field, and such intellectual beauty should be all the "psyching up" one needed."

It's just you. Nine out of ten people think statistics are nothing more than boring numbers made up on the spot.

AlphaMonkey · on Jan 9, 2010

Nine out of ten people learned Statistics from teachers who knew jack of Statistics. The difference between Statistics and Math is that Statistics is data-oriented and somewhat "experimental". Yeah, I hated Statistics in high-school, but when I learned Information Theory and Statistical Signal Processing, then I understood what one could do with knowledge of Statistics, and what the field is all about.

If you can't see the beauty in it, you've probably been taught pseudo-Statistics. Don't feel bad. All the Math that engineering students learn is kind of pseudo-Math. All one learns in high-school is BS. If you want to learn something, here's my advice:

i) Don't use modern, over-designed textbooks.. you know... the thick expensive ones with many colors and boxes highlighting the formulas one must memorize.

ii) Instead read the classic books from the 1960s, many of which are published by Dover. They look boring at 1st sight, but their content is rich. Another option is to get the old Soviet books from the 1960s, which tend to be forgotten gems. The Russians are the best at marrying theory with application.

on Jan 9, 2010

[deleted]

skorgu · on Jan 9, 2010

So how is one to ever know that stats is beautiful if all they see is impenetrable jargon? Where is a good, free, up-to-date resource for doing what Zed covers?

We're all busy, accessibility and immediate relevance are common and useful heuristics for deciding what to investigate.

AlphaMonkey · on Jan 9, 2010

I find MIT OCW courses on Statistics a good source of references. Reading books takes time, but you can find some lecture slides and lecture notes there.

I, too, found Statistics impenetrable. One solution I found was to learn Statistics from the Electrical Engineering and Computer Science perspective. Instead of weird abstractions, in EE and CS one thinks of more concrete things, like random signals, or data sets. The best way to learn Statistics is to apply it.

One problem, though, is that knowing only a little Statistics might be worse than knowing nothing.

skorgu · on Jan 9, 2010

Considering the article is a concrete introduction to statistics with applications and is passionate about your last point you seem to agree with Zed.

ErrantX · on Jan 9, 2010

> You can't condense all that knowledge in a blog post, one must read the books, though painful that may seem.

No, that's a silly approach. If your, say, launching a web app advice like this [sic] is useful. You dont need or want a deep understanding, you just need the pointers to analyse stuff right.

(as it stands I never mind Zed's tone and when I first read this article it provided some useful info to me in a simple format :)

Also r.e. your first post - from what I recall Zed can be considered some sort of an expert on this stuff.

AlphaMonkey · on Jan 9, 2010

Ruby ninja, guitar-player extraordinaire, self-taught Statistics expert... wow!!! Zed's propaganda machine rivals that of Kim Yong-il.

Get this: you need deep understanding of the fundamentals, otherwise you're not doing Statistics, you're doing Voodoo Magic. If you learned some Statistics from Zed's post, then you must know even less than he does. But then, one of the many symptoms of ignorance is the false belief that one knows. Do you even realize how many years one needs to become an expert on any field, especially a tough field like Statistics?

ErrantX · on Jan 9, 2010

Look, I dont want to get into a protracted argument... but.

It seems your definition of "expert" is a lot higher than mine; which is fine. Lets use a different word: experienced.

However with that said what Zed wrote in that post stood up to extra research by myself and I actually understood what he had to say.

If you wish to claim it is wrong and he has no clue, fine; but can you please - for our benefit - explain why and specifically where he is wrong?

AlphaMonkey · on Jan 9, 2010

I am not saying Zed is wrong, I am merely saying his post is not really that insightful... at least when compared to other essays on Statistics one could read instead. For example, what Cosma Shalizi writes on Statistics is much deeper than what Zed did. Another example: Andrew Gelman has a great blog on Statistics, and he's a true expert.

For me an "expert" is someone who has at least 10 years of experience in one field. An expert on Statistics must have carried out extensive data-analysis work, or must have published original papers. Reading books gives one some understanding, but only when one must solve a problem, does one realize that one's knowledge is utterly superficial.

Zed is no superman. His area of expertise is not Statistics, it's something else. I am not saying he's stupid. All I am saying is that he has not invested the many years of effort into studying Statistics to earn the title of "expert". By calling him an expert, one insults all the true experts out there, the one who do not write blog posts using juvenile language because, you know, they have better things to do... like spending time with their friends and family...

ErrantX · on Jan 9, 2010

Cheers, Ill look those guys up.

I do think 2 things are happening here though. I think firstly Zed's style doesn't really appeal to you - which is fine, I can see how he wouldn't gel with lots of people. As a result the second thing happens, his content seems irrelevant.

As someone who quite enjoys his approach :) I found quite a bit of useful content. I dont think the point is really to infuse a deep understanding of statistics - it's to correct some common mistakes hackers like us make when playing with stats :)

For example I scanned the blogs of those 2 names you mention and their stuff certainly seems interesting; but frankly I didnt notice anything massively relevant to things I might need to use day to day in my "startup". Zed's stuff I've already taken to heart and corrected some of my "work practices" when dealing with stats.

As to the rest I think it is just a case of differing definitions of expert :) as I said "experienced" is a better word to use - assuming our definitions match :P.

AlphaMonkey · on Jan 9, 2010

The world changed faster than Statistics. These days, anyone with a computer can analyze data sets, which is awesome. The problem is that the Statistics books are not designed for the masses, and the experts are concerned with technicalities that programmers, for instance, couldn't care less. If Zed helps brings Statistics to the programmers community, I guess that can only be a good thing...

The good news is that there seems to be an opportunity here: write Statistics for the non-statistician. Not everyone who needs Statistics can take 2 years off to learn it. Back in the 1970s, Digital Signal Processing was an advanced grad course... now it's a basic undergrad course. Maybe Statistics will go the same way, becoming more and more prevalent, and less and less ivory tower.

ellyagg · on Jan 9, 2010

Zed, is that you?

zedshaw · on Jan 9, 2010

Nope, that's not me. AlphaMonkey is definitely not me.

AlphaMonkey · on Jan 9, 2010

OK, I must admit that one made me laugh. I assume it's a rhetorical question, so I won't even answer...