Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't buy this take, and this rebuttal does a better job than I could of explaining why: https://andersource.dev/2022/04/19/dk-autocorrelation.html

Basically, this autocorrelation take shows that if performance and evaluation of performance were random and independent, you would get a graph like the D-K one, and therefore it states that the effect is just autocorrelation. But in reality, it would be very surprising if performance and evaluation of performance were independent. We expect people to be able to accurately rate their own ability. And D-K did indeed show a correlation between the two, just not as strong of one as we would expect. Rather, they showed a consistent bias. That's the interesting result. They then posit reasons for this. One could certainly debate those reasons. But to say the whole effect is just a statistical artifact because random, independent variables would act in a similar way ignores the fact that these variables aren't expected to be independent.



Yup. Assuming the sample sizes are statistically significant, the original paper clearly shows:

- On average, people estimate their ability around the 65th percentile (actual results) rather than the 50th (simulated random results) -- a significant difference

- That people's self-estimation increases with their actual ability, but only by a surprisingly small degree (actual results show a slight upwards trend, simulated random results are flat) -- another significant difference

The author's entire discussion of "autocorrelation" is a red herring that has nothing to do with anything. Their randomly-generated results do not match what the original paper shows.

None of this really sheds much light on to what degree the results can be or have been robustly replicated, of course. But there's nothing inherently problematic whatsoever about the way it's visualized. (It would be nice to see bars for variance, though.)


The autocorrelation is important to show that it's transformation to D-K plot will always give you the D-K affect for independent variables.

However, the focus on autocorrelation is not very illuminating. We can explain the behaviors found quite easily:

- If everyone's self-assessment score are (uniformally) random guesses, then the average self-assessment score for any quantile is 50%. Then of course those of lower quantile (less skilled) are overestimating.

- If self-assessment score vs actual score are dependent proportionally, then the average of each quantile is always at least it's quantile value. This is the D-K effect, which is weaker as the correlation grows.

-The opposite is true for disproportional relation.

So, the D-K plot is extremely sensitive to correlations and can easily over-exaggerate the weakest of correlations.


> "On average, people estimate their ability around the 65th percentile (actual results) rather than the 50th (simulated random results) -- a significant difference"

This is a different issue than D-K. The D-K hypothesis is that self assessment and actual performance are less correlated for weaker than higher performing individuals. People think they're better than average is a different (and much less controversial) bias.

---

[DK-Effect] : I totally know I scored at least a 30% on that test, and that's certainly way better than average (it's not). [Actually scored 10%]

[No DK-Effect] : I totally know I scored at least a 30% on that test, and that's certainly way better than average (it's not). [Actually scored 30%]

---


> The D-K hypothesis is that self assessment and actual performance are less correlated for weaker than higher performing individuals.

Isn't that what the graph shows? The bottom quartile group is guessing almost 50 percentile points higher than their actual performance, whereas the top quartile is at most 15 points off.

They're all guessing somewhere between the 60th and 75th percentiles (i.e. "I'm a bit better than average") - with some upwards trend since the high performers seem to at least know they have some skill, although not very accurately. It's just that for the poor performers, a guess of the 60th percentile wayyy off the mark.


EDIT: Something important for the rest of this post. In case it's not clear, the graph is showing your percentile ranking within the group - not your actual score.

Nope, because there's an interesting statistical trick in play. Imagine you take 100 highly skilled physicists and give them some lengthy series of otherwise relatively basic physics questions. Everybody is going to rate their predicted performance as high. But some people will miss some questions simply due to silly mistakes or whatever. And those people would end up on the bottom 10% of this group, even if the difference between #1 and #100 was e.g. 0.5 points. Graph it as D-K did, and you'd show a huge Dunning Kruger effect, even when there is obviously nothing of the sort.

In fact the fewer differences in ability within a group, and the greater the relative ease of a task, the bigger the Dunning-Kruger effect you'd show. Because everybody will rate themselves relatively high, but you will always have a bottom 10%, even if they are practically identical to the top 10%.

You can see this most clearly in the original paper. They carried out 4 experiments. The one that was most objective and least subject to confounding variables was #2, where they asked people a series of LSAT based logic questions, and assessed their predicted vs actual results. And there was very little difference. Quoting the paper, "Participants did not, however, overestimate how many questions they answered correctly, M = 13.3 (perceived) vs. 12.9 (actual), t < 1. As in Study 1, perceptions of ability were positively related to actual ability, although in this case, not to a significant degree." Yet look at the graph for it, and again it shows some seemingly large D-K effect.

And there's even more issues with D-K, and especially experiment #1 (which is the one with the prettiest graph by far), but that's outside the scope of this post. I'm happy to get into it, if you are though. I find this all just kind of shocking and exceptionally interesting! I've referenced the D-K effect countless times in the past, never again after today!

[1] - https://sci-hub.se/10.1037/0022-3514.77.6.1121


Yes yes yes! I’m in the very same boat, and came to an epiphany that the ranking trick here, combined with some subjective questions (ability to appreciate humor - seriously!?), that these things hide almost everything about actual skill. Not only does it amplify mistakes, it also forces the participants to have to know something about their cohort. Having to guess your ranking fully explains the less than perfect correlation. It also undermines all claims about competence and incompetence. They’re not testing skill, they’re only testing ability to randomly guess the skill of others.

What about the slight bias upwards? Well, what exactly was the question they asked? It’s not given in the paper. They were polling only Cornell undergrads looking for extra credit. What if the question somehow accidentally or subtly implied they were asking about the ranking against the general population, and then they turned around and tested the answers against a small Cornell cohort? I just went and looked at the paper again and noticed that the descriptions of the ranking question changed between the various “studies” with the first one comparing to the “average Cornell student” (not their experiment cohort!). The others suggest they’re asking a question about ranking relative to the class in which they’re receiving extra credit. Curiously study 4 refers to the ranking method of study 2 specifically, and not 3. The class used in study 4 was a different subject than 2 & 3. How they asked this question could have an enormous influence on the result, and they didn’t say what they actually asked.

Cornell undergrads are a group of kids that got accepted to an elite school and were raised to believe they’re better than average. Whether or not all people believe they’re better than average, this group was primed for it, and also have at least one piece of actual evidence that they really are better than average. If these were majority freshmen undergrads, they might be especially in calibrated to the skills of their classmates.

In short, the sample population is definitely biased, and the potential for the study to amplify that bias is enormous. The paper uses suggestions and jumps to hyperbolic conclusions throughout. I’m really surprised that evidence and methodology this weak claims to show something about all of humanity and got so much attention.


> The D-K hypothesis is that self assessment and actual performance are less correlated for weaker than higher performing individuals.

I’m not sure that’s an accurate summary. The correlation of the perceived ability is effectively the slope of the line, and the slope is more or less constant. The paper suggests that the bias of the bottom quartile is higher than the bias of the upper quartile, not that the correlation is any different.

But it’s strange that the DK paper makes an example of the lower performers, since the bias of the scores appears to be constant; it appears the high performers have pretty much the same bias as the low performers — it’s a straightish line that goes through 65% in the middle rather than the expected straight line that goes through 50% in the middle. If the ‘high performers’ had a different bias, then the line wouldn’t be so straight.


Yeah my understanding is

1. the slope of self-perceived ability is lower than actual ability

2. The y intercept is dependent on difficulty of test

Therefore with an easier test the better testies are more accurate, and with a very difficult test the worse testies are more accurate because of where the lines intersect. Meaning DK is artifact of test difficulty.

This also means if the test was difficult enough you could create a bizarro-DK effect where the better testies were less accurate.


For 1, the data is based on guessing, so it’s zero surprise that self-perceived ability doesn’t correlate perfectly with actual ability. It would be extremely surprising and unbelievable if the slopes were the same, right?

For 2, the DK paper shows one thing, but the replication attempts have show this effect doesn’t even exist for very complex tasks, like being an engineer or lawyer. The DK effect doesn’t generalize, and doesn’t even measure exactly what it claims to measure, which is why we don’t need to speculate about the bizarro-DK reversal effect - we already have evidence that it doesn’t happen, and we already have a big enough problem with people mistakenly believing that DK showed an inverse correlation between confidence and competence, when they did no such thing.


> This is a different issue than D-K.

No, its literally the D-K finding.

> The D-K hypothesis is that self assessment and actual performance are less correlated for weaker than higher performing individuals

That may have been a hypothesis Dunning and Kruger had at some point, its not the effect they actually identified from their research. But I don't think its even that, its an “effect” people have associated with D-K because they heard discussion of the D-K research that got dustorted at multiple steps from the original work, and then that misunderstanding, because it made a nice taunt, replicated widely and became popular.


To be fair, the paper itself uses hyperbolic language that completely distorts it’s own data. It heavily pushes and leads the reader into one possible dramatic explanation for their results, while downplaying and ignoring a bunch of other less dramatic explanations. Using words like “incompetent” are almost completely unfounded based on what they actually did. Section headings like “competence begets calibration”, “it takes one to know one”, and “the burden of expertise” are uncurious platitudes and jumping to conclusions. I’m kind-of stunned at the popular longevity of this paper given how unscientific is it and how often replication results with better methodology have shown conflicting results.


This is straight from their paper [1]:

"Perhaps more controversial is the third point, the one that is the focus of this article. We argue that when people are incompetent in the strategies they adopt to achieve success and satisfaction, they suffer a dual burden: Not only do they reach erroneous conclusions and make unfortunate choices, but their incompetence robs them of the ability to realize it."

[1] - https://sci-hub.se/10.1037/0022-3514.77.6.1121


> That people's self-estimation increases with their actual ability, but only by a surprisingly small degree (actual results show a slight upwards trend, simulated random results are flat) -- another significant difference

If everyone thinks they are slightly above average, isn't this inevitable? If everyone thinks they are slightly above average, people who are slightly above average are going to be the most accurate at predicting where they land?


> If everyone thinks they are slightly above average, isn't this inevitable? If everyone thinks they are slightly above average, people who are slightly above average are going to be the most accurate at predicting where they land?

Yes, it’s inevitable. And this study only asked Cornell undergrads what they think of themselves - people who were taught to believe they are above average, and also people who got into a selective school and probably all had higher than average scores on standardized tests. Is it surprising in any way that this group estimated their ability at above average?


Even if "people tend to slightly overrate their own ability," was the only takeaway, it would still refute the author's conclusion that DK has nothing to do with human psychology.


Have you not just summarized the Dunning-Kruger effect in other words?

That essentially follows from everyone assume they are slightly above average. That's also the crux of the refutation and why the whole autocorrelation is a red hering, even if we all would just self assess completely randomly, that actually confirms the Dunning-Kruger effect is real (because if we self assess randomly worse performance are more likely to overestimate).

We could argue that this is not surprising, but the "surprising" bit is that the curves show that better performers are actually more skilled at assessing their performance, which incidentally was also confirmed by the followup studies.


Is it though? Everyone overestimating their ability a bit isn't DK effect. It's when people with less knowledge and ability vastly over estimate their ability (because they don't know how little they know - while others do), and the opposite for those who are truly more able and knowledgeable (again because they understand how vast the topic is and though they know more and are capable more than the average person, they also understand how little they truly know compared to what they don't know)


I'll give it a stab.

There are those that don't know, and don't know that they don't know. They evaluate themselves the highest.

There are those that know, and don't know that they don't know. They evaluate themselves a bit better than those before.

There are those that know, and know that they don't know. They evaluate themselves worst than those before them. This is the d-k valley, imposter syndrome, confidence issues.

There are those that know, and know that they know. They are much better at evaluating themselves than those before them. They have experience to know what they know, and what they dont know, and they still continue to underrate themselves vs the first bunch, but they are more accurate and closer to the truth.

Anyway thats how I always understood d-k.


Yes but then you'd see a flat line for people's estimates, which wasn't the result.


> assuming the sample sizes are statistically significant

Nitpick: should read "assuming the sample sizes provide sufficient statistical power"


> And D-K did indeed show a correlation between the two, just not as strong of one as we would expect. Rather, they showed a consistent bias. That's the interesting result.

"D-K effect in its original form" vs "D-K effect in pop culture" is the biggest D-K effect live example. Of course I mean D-K effect in pop culture here.

Interestingly, the "interesting" part of the original result is that the correlation between actual performance and perceived performance is less than people intuitively think.

But as the "D-K effect in pop culture" spreads, people's collective intuition changes. Today if you explained the original D-K effect to a random person on the internet, they might find it interesting because the correlation is greater than they thought: they thought the correlation would be negative!


D-K effect effect is almost as entertaining as the Butterfly effect effect[1].

[1]: Which is the far-away effect attributed to having watched the movie The Butterfly Effect.


> And D-K did indeed show a correlation between the two, just not as strong of one as we would expect. Rather, they showed a consistent bias. That's the interesting result.

Right, so:

1. If the data were truly random, with no correlation, we'd expect the line to be straight across the middle, with the first quartile at 50% and the last quartile also at 50%

2. If the data were 100% accurate and precise [1], we'd expect the line to be diagonal, with the first quartile at 12.5% and the last quartile at 87.5%.

3. If the data were accurate but not precise (i.e., basically right but with some randomness built in), we'd expect the line to be in between #1 and #2 -- basically, changing from #2 into #1 as the randomness increases, but with the intersection at 50%.

That's because someone in the 2nd percentile can't underestimate themselves as much as they can overestimate themselves, and someone in the 98th percentile can't oversetimate themselves as much as they can underestimate themselves. But in any case, the "0 bias" case looks symmetric.

4. But what we actually see is none of the above: we see the 1st quartile being at (eyeballing the chart) 60%, and the last quartile at 75%.

That shows that there is indeed some ability for self-evaluation, but it's off. The fourth quartile could indeed just be random, the effect of clipping at the top meaning that the upper quartile cannot overestimate themselves as much as they understimate themselves. But there's no getting around the fact that the bottom quartile are overestimating themselves.

[1] https://en.wikipedia.org/wiki/Accuracy_and_precision


> But there's no getting around the fact that the bottom quartile are overestimating themselves.

It's because higher competence goes along with more accurate self-assessment but not less bias. So the high performers underestimate with less magnitude than the low performers overestimate, but they both under and over estimate themselves with the same frequency.


The author of this assumes the conclusion in order to decide how to analyze his data.

He cannot reasonably say both:

> we have a decision to make: what are we going to assume? How are we going to quantify our surprise from the results?

> The first option is, as in the case of the state census, to assume dependence between X and Y. I.e. to assume that, generally, people are capable of self-assessing their performance.

> The second option conforms with the Research Methods 101 rule-of-thumb “always assume independence.” Until proven otherwise, we should assume people have no ability to self-assess their performance.

> It seems to me glaringly obvious that the first option is much, much more reasonable than the second.

— and -

> most notably the claim that the more skilled people are, the better they are at self-assessing their performance. This result is supported by their plot, but in any case, my issue is not with objections to this claim

and then expect to carry any credibility.

The author of this piece both suggests that a key variable is fixed and later admits it varies within the same dataset.

I guess at least they admit it, but this lacks basic self-consistency.


I'm utterly confused. The latter statements it just the author explaining which parts they didn't discuss in their article; it has no bearing whatsoever on the section before it.


It discloses the cognitive dissonance in his position. He seems to be saying both “skill at assessing ability is random and mathematically bounded only” while admitting “skill at assessing ability changes with ability.”


> The author of this piece both suggests that a key variable is fixed and later admits it varies within the same dataset.

I don't see how that variable changes, here is an example how the error variable can be exactly the same for everyone and reproduce the results:

Lets say the overconfidence is always that you feel 50% of those better than you are actually worse than you. So everyone is equally overconfident, just that the top wont move their own placings as much as the bottom since there are much fewer people that they can mistake being worse than them. Then apply noise to this and you get the graph Dunning-Kruger got.

You could say "But they are better at estimating their rank!", but that is just a mathematical artefact, it isn't a psychological result. Even if everyone always guessed that they are number 1, the better you are the better your guess will be, but in that case it is easy to see that everyone overestimates their skill in the same way instead of the better people having a fundamentally different way of evaluating themselves.


Both analyses seem to agree on one finding: people’s skill at estimating their own ability increases with that skill. It can’t be a purely mathematical artifact because you would see a tapering at either end, or a narrowing distribution of errors at the bottom end, not just a narrowing toward the top end.

This should be unsurprising for anyone who has become sufficiently skilled at something. Beginners can’t even discern the differences the experts are discussing, and frequently make errors in classes they don’t even understand.


Beginners, by definition, are guessing 100%. Some will guess high, others low, and the rest in between. But they are all guessing. Perhaps There's a cultural bias to over-estimate their skill? Perhaps there's a nudge in the process of the study that led them to overestimate?

The lede isn't that people over-estimate their skill level. The lede is, why would that be as they have nothing else to go on. That is the trigger or triggers? And to say, the more experienced estimate better? Well, duh.


> Lets say the overconfidence is always that you feel 50% of those better than you are actually worse than you. So everyone is equally overconfident, just that the top wont move their own placings as much as the bottom since there are much fewer people that they can mistake being worse than them. Then apply noise to this and you get the graph Dunning-Kruger got.

But the data of original D-K paper shows that the top 25% people underestimate their placings. So this whole paragraph, while logically true, has little to do with the original D-K effect.

> You could say "But they are better at estimating their rank!", but that is just a mathematical artefact, it isn't a psychological result. Even if everyone always guessed that they are number 1...

If everyone always guessed that they are number 1, it's a huge psychological result: it means people are extremely irrational when it comes to self-evaluation.


> But the data of original D-K paper shows that the top 25% people underestimate their placings. So this whole paragraph, while logically true, has little to do with the original D-K effect.

That is what you would expect under my model, due to the randomness being limited upwards for the high placings but still go downwards. That is the effect the article we are talking about refers to when they say "Autocorrelation".


I found two very interesting things in the original D-K paper [1] that challenge your otherwise reasonable point. The first is that the graph everybody associates with D-K, the one showing the beautifully perfect linear result, is one of 4. The other 3 graphs are far messier, and indeed the paper discusses the fact that the correlations tend to be weaker and in some cases nonexistent.

The second thing is that that beautiful perfectly linear graph everybody references, was measuring 'humor'!!! Humor is going to be something that's all but guaranteed to create near complete noise between self evaluation and 'expert' (professional comedians in this case) evaluation. And if everybody is essentially randomly guessing on their performance, then it will always show an extremely strong D-K effect with the top performers underestimating themselves, and the bottom performers overestimating themselves.

The experiment that most simply and directly measured 'intelligence', without complicating matters in a potentially confounding fashion, is #2. It was based on logic problems from the LSAT. And the resultant graph is just all over the place. Quoting the paper's evaluation of this study:

---

"Participants did not, however, overestimate how many questions they answered correctly, M = 13.3 (perceived) vs. 12.9 (actual), t < 1. As in Study 1, perceptions of ability were positively related to actual ability, although in this case, not to a significant degree."

---

This is really looking like another Zimbardo.

[1] - https://sci-hub.se/10.1037/0022-3514.77.6.1121


Yes, D-K is another one of those "classic" psychology studies that everyone knows about but is actually rubbish and shouldn't be cited for anything. You're not the first to notice this, I pointed it out on HN last year:

https://news.ycombinator.com/item?id=31119836

At some point I should write up a proper blog post on the D-K paper in the hope that it eventually surfaces in search results, because it's past time for this paper to be put to bed. The problems you cite aren't even the full set. The whole thing was (of course) a study on a handful of psych undergrads, their selection method for expert comedians has circular logic in it and it all goes downhill from there.


But again isn't the fact that, "perceptions of ability were positively related to actual ability, although in this case, not to a significant degree" an interesting result? Not the fact that they were related, but the fact that they mostly weren't! That does demonstrate the core result as I understand it, that people are little better than random at evaluating their own performance, which was a surprising finding.


Nope, because I think D-K played a neat little trick. Whether it was intentional or not is another topic. They were using a largely homogenous group of people - Cornell undergrads taking psychology classes, and querying them on things where all performance would fall close to a similar mean.

Imagine you take 100 literal clones of somebody and query them on something, and then ask them to predict their performance. Assuming these clones are smart, they'd all estimate their performance as being at 50%, which is what would be expected. But due to natural variance, not everybody will score identically (in the same way even identical twins do not perform identically). And so you'd end up seeing some huge D-K effect with literal clones! Those at the bottom would be greatly overestimating their performance, while those at the top would be greatly underestimating it. Now step away from clones into the regular world of students, where everybody is going to think they're a bit better than average, and you get people predicting a score of about 60%. Now suddenly you see the same thing, except the lower performers would be overestimating their performance by a greater degree than the top performers were underestimating theirs.

To truly measure D-K, you'd need an extremely heterogenous group of people, and you'd also need questions with perceived and real domain expertise. Would a farmer with a 5th grade education evaluate his performance on a differential equations test as above average? Would a professor of diff eq evaluate his performance on a test of optimal growth strategies for buckwheat and corn, as above-average? Of course not, but then you can't make a shocking claim, don't get published, and don't become famous.


The issue is people have differing personal definitions of Dunning Kruger. The generally demonstrated effect in the sample of people Dunning and Kruger analyzed was "people tend to estimate the percentile of their own skill as closer to the average than it really is, with a slight bias towards an above-average mean. This leads to overestimation of relative ability by those in lower percentiles, and the opposite for those in higher percentiles"

However when people cite Dunning Kruger in popular culture they mean "below average people think they're above average, and above average people assume they're below average", which was not shown in the original study, and wouldn't show up in an analysis attempting to justify it via a misunderstanding of autocorrelation.

The general point in the rebuttal is correct. A completely noisy graph of people's estimations of their own ability would show a Dunning-Kruger resembling residual graph (x-y vs x). However, one wouldn't expect people in the 1st percentile to have an equal distribution of perceived skill as people in the 50th or 99th percentile. If that were true, it would be worth reporting.


> "below average people think they're above average, and above average people assume they're below average"

There’s no way to know if you’re wrong, but when I see it used it seems to be pointing out - “some (not all) under qualified people tend to defer to their own beliefs rather than the views/statements from experts, even when that is demonstrably silly.”

^ Referring to the pop-sci interpretation, not in disagreement with the general point.


Which also has nothing at all to do with this study by Dunning and Kruger. So you agree with the general point of parent.


Yes. Just clarifying a small disagreement about the pop-sci interpretation of the phrase.


The rebuttal by Daniel (andersource.dev) is useful, generally. However, when he writes ...

> The history of statistics is well out of scope for this post, but very succinctly, my answer is that statistics is an attempt to objectively quantify surprise.

... I cannot agree. Statistics is not this; it is much broader. One may or may not be surprised by particular statistics, sure, but there are _specific_ concepts that map more directly to surprise, such as entropy from information theory.


If entropy is defined as statistical disorder than I think the definition of "quantifying surprise" is great.


You aren't suggesting that statistics as a field defined a notion of "order", prior to thermodynamic entropy or Shannon entropy, are you? To me, that would be circular.

Based on my knowledge, it seems likely the first published quantification of disorder arose in the study of thermodynamic entropy. Later, Shannon defined entropy in information-theoretic terms, independent of physics. It can be interpreted as a notion of 'surprise' or what he called information.

My claims:

First, the field of statistics is _not_ historically rooted around concepts such as: "order/ordering" or "information/surprise".

Second, the field of statistics, as a directed graph of abstractions, is not rooted in ordering nor surprise.

Third, in teaching statistics, practically or conceptually, the concept of surprise isn't foundational. The idea of _variation_, on the other hand, is central.

I'll add a few more comments. To talk meaningfully about 'surprise', there has to be a stated or assumed baseline or 'expectation' about what is _not_ surprising. For Shannon, if the probability of an event is certain, there is no surprise. Probability and statistics work together, but they are conceptually separable. This is particularly clear when you compare descriptive statistics with, say, probabilities over combinatorics problems.


> The field of statistics is not organized around concepts relating to "order" or "ordering".

Sure but reduced to the simplest form, statistics are used to predict things, the most basic thing in the Universe being "is this particle gonna stay put or move a little in a given direction", which is related to entropy, so to me intuitively these two things seem very related. The fact that in statistics we don't use the words "order" and "disorder" doesn't mean it doesn't reduce to that.

Btw I'm an electrical engineer that isn't amazing at statistics or thermodynamics so beware I might just be talking nonsense.


> ... reduced to the simplest form, statistics are used to predict things

Inferential statistics is not the simplest kind of statistics. Descriptive statistics are both simpler and foundational for inference.

P.S. I should say that I am a bit of a stickler regarding discussions along the lines of e.g. "these things are related". Yes, many things are related, but it is really nice when we can clearly tease things apart and specify what depends on what.


I was surprised by the figure from the original article, imho that's the strongest rebuttal: perceived ability grows strictly mononotonically with actual ability, no sign of the famous non-monotonic U-curve. Yeah, the slope is less than one, and it grows a bit faster from the second to the third quartile than from the first to the second, but none of that changes the fact that people tend to slot themselves correctly. The chart is interesting in that it confirms that everyone perceives themselves to be slightly above average in terms of ability, which of course can't be true in practice. But what it also shows is that when they think they'll be below or above that (false) baseline, they're actually correct about it. So pretty much the exact opposite of what the Dunning-Kruger effect claims.


> The chart is interesting in that it confirms that everyone perceives themselves to be slightly above average in terms of ability, which of course can't be true in practice.

No, everyone biases their self-assessments toward a point slightly above the mean. That's not the same as saying everyone thinks they're slightly above average, nor that people's self-assessments have no predictive power whatsoever. The lowest performers still think they're below average, just not as much as they should. The highest performers still think they're considerably above average. But they all have a bias toward (slightly above) the middle.

So yes, people are generally correct in the direction that they deviate from that median self-assessment, but that just shows that people's self-assessments aren't completely without basis. Which D-K certainly didn't claim.


D-K claim a non-monotonic relationship, which simply isn't supported by that data, as you yourself point out: people rank themselves correctly (ordinally). I didn't mean to say that all self-assessments are the same, if that was the misunderstanding. My point is that the self-assessments indeed are meaningful, even more so than D-K claim.


Check the original paper by D-K. Fix only focused on the first plot which has a monotonically increasing trend. The later plots show varying degrees of nonmonotonicity, though sadly they don't include error bars to indicate how statistically significant the differences between groups is.


But we don’t know their true ability, only the results on one test. It could be they accurately predicted their ability but because of random chance they did better/worse than their guess. Then you would get the exact data that is observed.


I thought they were estimating their performance on the test relative to others. There was no “real world” element.


The slope will be less than one if there's e.g. any random guessing in the test even if the self-assesment is perfect (apart from whether they know if their guess is right or wrong of course) [1].

I think this is the effect that the post is dancing around, but doesn't seem to really understand (and how "autocorrelation" and indepence are discussed is very nonstandard to be charitable).

[1] https://en.m.wikipedia.org/wiki/Regression_dilution


I agree, the statistical analysis in the original post makes me very uneasy. I think it could be a case where the conclusion is correct, even though argument isn't necessarily.

And yes, the fact that the slope is less than one is fairly uninteresting.

The real problem here is that the Dunning-Kruger effect, as it's classically stated, claims that if you asked four people to rank themselves in terms of ability, the result would be 1-3-2-4, ie the people who know a little would put themselves above the people who know a lot but aren't quite experts. The problem is the data shows that they'd actually rank themselves correctly 1-2-3-4. But such a boring finding probably wouldn't have made the authors quite as famous, which might be why they tried bit of data mangling, and they found this really cool story that everyone would secretly love to be true.

Which is a shame, because I think the fact that the mean of perceived ability is too high (and the variance too low) is really interesting too, and perfectly supported by the raw data.


Yes. The methodology in the original D&K is quite shoddy, and vulnerable to e.g. good old regression to the mean, and the interpretations are too strong. This is sadly very common in psychology (and many other fields I'd guess) and even researchers don't care so much if the story is juicy enough.

The pop version of the DK effect seems to be something like a 4-3-2-1 ranking, which is obviously not supported by the data.


But they wouldn't. They'd rank themselves something like 1,2,2,3. We're not dealing with a population collaborating to all rank themselves in order, but rather each person individually estimating where their abilities lie in the population.

The point is that if you ask someone in the, say, 5th percentile of ability what their ability is compared to the population, they might say 25th percentile. Ask someone at the 25th,and they might say 40th. At the 40th they could say 55th. And at the 90th, maybe they'll say 80th. So yes, if you order their guesses, they will be in roughly the correct order. But, crucially, that doesn't mean that they are ranking themselves correctly!


I really appreciate that he points out that the use of the term in the original article of autocorrelation is nonstandard. Because it is nonstandard but it's a rather flippant way to dismiss the rest of the article.


This rebuttal seems weak because it’s using unbounded datasets (population). A big issue with the DK research is using bounded data (test scores). For example if I get 100% right it’s mathematically impossible to have overestimated.


I agree. Using the terminologies from the author, the DK paper was trying to show that dy/dx < 1 = dx/dx, rather than the correlation of y-x vs x.


I have to agree. You cannot separate the statistical analysis from the meaning of the study. In the article, the author's random data is exactly an extreme replication of Dunning-Kruger. Why? Because, in his random data, people with low test scores almost always overestimate their ability, while people with high test scores almost always underestimate.

That is precisely the premise of the Dunning-Kruger effect. The fact that the original Dunning-Kruger paper shows a less extreme effect? That just shows that people are slightly better than random at estimating their own abilities - but still nowhere accurate.


So that’s what the Dunning-Kruger effect basically boils down to, right? That people in general are just bad at assessing their skills.


> But in reality, it would be very surprising if performance and evaluation of performance were independent. We expect people to be able to accurately rate their own ability.

This seems to be attacking an irrelevant point in the analysis. The argument goes as such: Researcher carries out all the studies needed to prove the Dunning-Kruger effect, then trips and drops all the results into a vat of acid. But he's ashamed and quickly generates random numbers for the results, and somehow the data still proves the Dunning-Kruger effect. Not just that, repeating the same exercise again and again with completely random data leads to the same result, the effect is always present. So is the Dunning-kruger effect so powerful that it exists in the very fabric of the universe devoid of any human interaction, or is something amiss?

In this situation we are forced to look at the test we have that concluded from the data that the Dunning-Kruger effect exists and conclude that it's a bad test, we need something different.

You seem to be arguing "oh no, you can't look at random data, because we wouldn't expect the experiment to yield random data!". But that doesn't work as an argument for why the test should still be considered good. If it's supposed to have any worth, then the test has to be able to come to one of two conclusions: The Dunning-Kruger effect exists or the Dunning-Kruger effect doesn't exist. And if the test is set up such that for positive experimental results, or just random noise, it comes out in the positive, and only in extremely unlikely and a narrow band of the possible outcome space come out negative, then the test is bad.

If we want to try to rephrase everything a bit to make the issue much clearer. Lets set up a coin-toss competition between ChatGPT and a group of 100 people. Each participant goes 1:1 against ChatGPT where both parties toss a coin and whoever has the most heads wins, on draws toss again, in case a pair goes into an infinite loop that doesn't end before our allotted trial time, they get removed from the study. A human assistant tosses on the behalf of ChatGPT on account of it not having arms yet.

Now we ask each person how they would rate their ability vs. ChatGPT in a coin-toss, everyone answers 50/50, for obvious reasons.

So we run the experiment, the line for "ability plotted against ability" is a straight diagonal line. The line for estimated ability vs actual ability is a a straight flat line at 50%.

Eureka! To the presses! we have just proven the Dunning-Coin-Kruger effect! People who are worse at throwing coins tend to over estimate their ability, and people who are better at throwing coins underestimate their ability! What a marvelous bit of psychological insight, it really tells us something about how the human mind works, and has broader insights about our society! But naturally we always expected this outcome, people who are bad a tossing coins are dumb and of cause they are overconfident, not like people who are good at tossing coins who have a remarkable Intellect about themselves and are therefore humble in their self estimation... and so on and on about preconceived biases that have nothing to do with the actual test we performed.


But we would not expect the coin toss to have a correlation. Whereas we might expect a correlation between actual and perceived ability.

So yes, both are null results, but only one is interesting.

For instance, we would probably expect there to be a correlation between height and ability to dunk a basketball. If someone were to show that there is not a correlation, that would be an interesting result. Just because random data would match my result doesn't mean my result is nonsensical. Getting results that look like random data is still a result--it just means there isn't a correlation.


Thanks this is a great way to rephrase the OP to bring out the salient parts


thinking about this more (i'm replying to myself!) -- i guess what the experiments for D/K show is exactly that performance on a test is uncorrelated with your idea of the performance on a test.

yes, it's kind of surprising that, having dropped the "real" results in a vat of acid, our hapless researcher replaces the missing data with random numbers and gets the same result -- but that's only because we didn't expect random numbers to model the outcome.

instead, we would have expected that, towards the bottom of the distribution of test-takers, those folks would rate themselves lower, while towards the top they would rate themselves higher. at the extreme of perfect self-awareness, the line for subjective results would exactly match the line for objectively-scored results.

this is the exact argument that is made in the post linked in the top comment: > by using random data to argue that the Dunning-Kruger effect is not real, the author is arguing to default to the base assumption. But which base assumption do they make? One even more radical than what’s proposed by Dunning-Kruger. In the author’s world, the Dunning-Kruger study should be interpreted in the reverse direction, claiming that there is at least some self-awareness in the way people self-assess.

source: https://andersource.dev/2022/04/19/dk-autocorrelation.html


That was a very nice read!


Yeah this must be some high end satire where the guy Dunning-Krugers up an explanation of Dunning-Kruger. Since even an economist is supposed to understand ANOVA I have to conclude that this article is a joke.


The incorrect usage of "autocorrelation" made me double take and wonder if this was satire the first time it was posted.


So what we have here is some scientists trying to prove that the Dunning-Kruger effect doesn’t exist and instead they give us a perfect example of the Dunning-Kruger effect.


> The irony is that the situation is actually reversed. In their seminal paper, Dunning and Kruger are the ones broadcasting their (statistical) incompetence by conflating autocorrelation for a psychological effect. In this light, the paper’s title may still be appropriate. It’s just that it was the authors (not the test subjects) who were ‘unskilled and unaware of it’.


The effect that the worst overestimate their skill is known since before, that wasn't the main result of Dunning-Kruger. The effect that the best underestimate their skill can be chalked up to auto-correlation.


The best don't tend to overestimate their skill; they underestimate it. The D-K results show a consistent bias in estimates toward (somewhere near) the mean. Hence an overestimate at the bottom and an underestimate at the top.


> The best don't tend to overestimate their skill; they underestimate

I wrote the wrong word, I fixed it. The best can't overestimate their rank, so of course that wasn't what I meant.


Dunning-Kruger posits this as a psychological effect, yes? On the top half psychological effects such as imposter syndrome could come in to play.

Have sociological factors such as being kind or big fish little pond been considered as likely causes of the misestimates?


I have the same question...why do some get it so wrong? Was there a nudge in the process of the study that caused some to answer what they did?

Heck, I'm wondering if "Honestly, I can't say" was an allowed response. Or were they forced to pick a number? If so, then I'd want to know what happens when you ask 100 ppl to pick a number between 0 and 100. I bet it's not evenly distributed. Maybe the beginners give a "discounted" version of the distribution?

Even if the autocorrection explanation is off, there does now seem to be flaws in DK, at least from the perspective of pure and proper science




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: