Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
30% of Google's Emotions Dataset Is Mislabeled (surgehq.ai)
334 points by echen on July 14, 2022 | hide | past | favorite | 144 comments


Anyone who's dealt with any kind of human-annotated datasets would be familiar with these kind of errors. It's hard enough to get good clean labels from motivated, native-English speaking annotators. Farm it out to low-paid non-native speakers, and these kind of issues are inevitable.

Annotation isn't a low-skill/low-cost exercise. It needs serious commitment and attention to detail, and ideally it's not something you outsource (or if you do, you need an additional in-house validation pipeline to identify dirty labels).


For a sentiment / emotion classification project, we (2 founders) just ended up doing most of the labeling ourselves. It was a big grind, but given how abysmal the performance of “crowd-sourced” solutions are (eg Amazon Mechanical Turk), and how incredibly important the quality of these labels are for training a model, it made the most sense.

I wonder how others do this kind of thing. Assuming I have 100k text blurbs I want to classify of sentiment and/or emotion and want at least 95% accuracy, what are the options?

I’m still shocked how low the quality of Mechanical Turk was for just sentiment (positive/negative/neutral/unsure), 99% of the classifications were just random. We narrowed our section for workers to higher-qualified ones, for that matter.

What a giant waste of money and time that was, because supposedly it’s the canonical use case for it.


> I wonder how others do this kind of thing. Assuming I have 100k text blurbs I want to classify of sentiment and/or emotion and want at least 95% accuracy, what are the options?

Others who successfully do it have exactly the same secret sauce that you do: they assign it to someone who is well-compensated and competent.

The one time I needed anything remotely like this I just took the old "nobody said programming was gonna be glamorous" adage and ran with it for two weeks. Money-wise, two weeks of programmer time sorting data manually sure beats twelve weeks of programmer time debugging models to cope with inaccurate data, and the results are orders of magnitude better.

Given the widespread understanding of how critical training data is, it's mind-boggling to me that businesses in this field try to outsource it to the lowest-paid external company they can find, thus offering the lowest possible performance incentives and pretty much losing any control they have over quality assurance. Then they proceed to spend humongous amounts of money on clean-up and further refining training data sets, with which they could've hired English Lit majors in the first place, who would've given them a perfectly-classified data set from the very beginning.


But then STEM would have to admit that English Lit Majors didn't waste their money /s


One of my university professors liked to quip: next time someone tells you that you're arguing semantics, ask them what semantics means. That'll teach 'em.


English Lit Majors can't find work? Try MTurk!


I was using MTurk for labeling about 10 years ago.

To see the other side I also did a 1 Month stint as a MTurk worker, earning about $300.

It is absolutely horrible work, and I used MTurk subreddit to find the "decent" jobs. I had the special firefox extension which ranked the job givers etc.

All jobs were below 1st world minimum wage and were incredibly depressing. I think the adult content ones were the worst.

"Jian Yang: No! That's very boring work"

Worst thing was that you could not stop to think, you had to keep going if you wanted to at least make a few bucks an hour.

There are two solutions to the labeling issue: Pay well (at least $20 an hour no matter the location).

If the workers feel exploited you do not get good results no matter the sanctions.

Even better solution to labeling is to find people who give a shit about your task.

That is how I've done 50k lines of labeling training our OCR for a rare font for Tesseract. We have a few volunteers and they know this is important work (preserving cultural heritage, non-profit, national library, etc).

Old reCaptcha had a similar "feel good" element.

Compare it to newest Google reCaptcha - you know those labels are going to be used for evil at some point in the future.


I'm eager to see what nefarious things Google will do once they've mastered the art of identifying all the busses in a photo!


What tells me to start worrying if they get very good at identifying paperclips?



their motorcycle, hills, signal light, and traffic light AI will break the world when it's over!


> To see the other side I also did a 1 Month stint as a MTurk worker, earning about $300.

are you me?

I did the same thing (worked as a mTurk labeler for 2 weeks) which convinced me to never use mTurk for anything even remotely important.

I've been able to use semi-supervised approaches with actual domain experts reviewing outputs.


There are ways to detect problems with results due to things like fatigue or attenuation. And there are ways to deal with those problems (like forced break time and switching up tasks).

It's a wonder that none of that is built in. But maybe things like MTurk aren't built to maximize worker effectiveness because it costs too much. Are there better quality crowd-sourcing options?


> Compare it to newest Google reCaptcha - you know those labels are going to be used for evil at some point in the future.

I for one always try at least once to get something wrong, often several times if it doesn't go through immediately, depending on urgency of my task. I hate being made to work for someone else like this.


> I wonder how others do this kind of thing.

We did the exact same for a text classification project.

The multi-week grind was awful, but it meant 1. we had a really good understanding of our data 2. we discovered surprising edge cases that we would have missed otherwise.

There is a very large fixed overhead you need to pay when you start outsourcing that work, so doing it yourself is cheaper at scales beyond what you'd normally expect.


> I’m still shocked how low the quality of Mechanical Turk was

I don't know about Mechanical Turk, but there is a crowdourcing platform by Yandex. The pay is so low that the reasonable way to earn something is to find a task that is not properly validated and put random answers there from multiple accounts (because there are speed limits). Usually those are tasks by naive foreign companies not knowing about validation.

So if you want high quality you need to implement proper validation, triple check every label by different people and do not expect that someone is going to do it for $5/hour. And maybe you should learn how a crowdsourcing service looks from the worker's side, for example, by registering and trying to do some tasks yourself or by reading forums for workers.


Why shouldn’t someone do it for five dollars per hour, and do it properly lest they get fired? Seems like a very easy job and easy to supervise.


Money isn't what gets people motivated most of the time. This job is boring and unfulfilling, you'll get poor results unless you're offering a life changing amount of money.


If it's such an easy job, why outsource it instead of doing it yourself?


Because there’s other stuff you need to do that you can’t outsource.


Surely if it's such an easy job, you can fit it along your other tasks.

If you can't then perhaps it's worth re-examining if it's as easy as you think it is.


It’s easy but requires time. Like turning a page is easy but turning a million pages takes time.


That's what I'm saying. Classifying one data point is very straightforward and brings negligible value to a company. Reliably classifying hundreds of thousands of them is very complicated and not at all easily supervised. And if your company's business model is based on applying trained models to $real_world_problem, it doesn't just bring a lot of value to your company, it's literally critical for its success, just like a solid CI/CD pipeline or having a good security process.

It's attractive to think that this is just like classifying one data point over and over again. It's nothing like that, just like crossing the Atlantic from Galway to New York is nothing like kayaking around Mutton Island over and over again.


Are you trolling us here? The act of classifying is easy. It needs to be repeated many times. So you scale up by hiring many people to do it.


No, I am not trolling you here.

First of all, real-life data sets have hundreds of thousands of endpoints, and it's easy to classify a few hundred, maybe a few thousands endpoints for a single person. So scaling it up to a point where it's easy for every person involved in it requires hiring a team of dozens, or even 100+ people. That is absolutely not easy, especially not on a short term notice, and not when it's 100% a dead end job, so it's difficult to convince people to come do it in the first place. I'm yet to have met a single company whose idea of scaling it up involved something more elaborate than "we're gonna hire three freelancers". At 30k endpoints/person that's about an order of magnitude away from "easy", and transferring that order of magnitude to the hiring process ("we're gonna hire thirty freelancers") isn't trivial at all.

Second, it is absolutely not at all easy to supervise. QC for classification problems is comparable to QC for other easily-replicable but difficult to automate industrial processes, like semi-automatic manufacturing processes. There is ample literature on the topic, starting with the late period of the industrial revolution and going all the way to present times, and all of it suggests that it's a very hairy problem even without taking into account the human part. Perfect verification requires replicating the classification process. Verification by sampling makes it very hard to guarantee the accuracy requirements of the model. Checking accuracy post-factum poses the same problems.

This idea that classifying training models is a simple job that you can just outsource somewhere cheap is the framework of a bad strategy. Training data accuracy is absolutely critical. If you optimize for time and cost, you get exactly what you pay for: a rushed, cheap model.


Design a machine, possibly including humans as "parts," to automatically turn those million pages. Maybe the machine is just a single human, serially, turning pages, but that's slow. There must be something better!

How easy is it to design a reliable system, considering human/machine interaction and everything we know about human behavior and constraints like human attention, potential injury, dry fingers, etc.?

How simple is the design?


Because they can do more rolling carts around in Wallmart.


I mean for places where $5 per hour is an attractive wage.


> Assuming I have 100k text blurbs I want to classify of sentiment and/or emotion and want at least 95% accuracy, what are the options?

The way I think Google does it with reCAPTCHA is to request classification for each item multiple times. If they differ, keep sending them out until you get a consensus on those items. It weeds out those responses that were just basically random clicks.


I haven't worked in this space specifically, but next time you are thinking of outsourcing this type of work I would suggest giving some Filipino VAs a shot. You can hire fluent, sometimes native English speakers who are motivated at $4-7/hr. Actually even less but I stick to the top of that range personally. (I use OnlineJobs.ph to find people)


I'd love to chat. Want to reach out to the email in my profile? I'm the founder of a startup solving this exact problem (https://www.surgehq.ai), and previously built the human computation platforms at a couple FAANGs (precisely because this was a huge issue I always faced internally).

We work with a lot of the top AI/NLP companies and research labs, and do both the "typical" data labeling work (sentiment analysis, text categorization, etc), but also a lot more advanced stuff (e.g., search evaluation, training the new wave of large language models, adversarial labeling, etc -- so not just distinguishing cats and dogs, but rather making full use of the power of the human mind!).


Good news for you: being your target audience, we actually did have you guys on our radar.I

For the scale of our project, however, the price point was prohibitive.

We ended up building a small cli tool that interactively trained the model, and allowed us to focus on the most important messages (eg those where positive/negative sentiment was closest, the labels with the smallest volume, etc).

EDIT: If I now look at your website, it seems like you’ve also just provide good tooling for doing these types of things yourself? If that were the case, I wouldn’t mind having paid $50-$100 for a week of access to such a tool. But $20/hr to hire someone who classifies data which we would still need to audit afterwards was too much for us.


$20/hour to classify data sounds reasonable though?

If you have more time than money it might not make sense, but at that price point I could save myself a lot of time by just working a few extra hours doing SE and let someone else do 3x that amount of labelling.


I fully agree $20/hr is reasonable, it just was too expensive for us at that time.

So in the end the whole problem boils down to “quality is (more) expensive”; but MTurk is a special case since they’re so heavily positioning themselves as “the” solution for this and they’re terrible.


Why don't they label using the same method as captcha

Show the same image to 10 people, and keep only those who have a high confidence


Because that literally costs 10x


> I wonder how others do this kind of thing. Assuming I have 100k text blurbs I want to classify of sentiment and/or emotion and want at least 95% accuracy, what are the options?

Fwiw, in a similar situation I did about half myself and farmed out the other half to my retired parents. Using trusted people was the only way I found I could get high accuracy without spending thousands of dollars. But, as I frequently tell people, the hard part of ML isn’t the model/code, it’s the training data.


I was going to recommend https://gengo.com/sentiment-analysis/

But it seems they’ve been acquired a few years ago, so I have no idea if the quality is still the same.


you have experience with this, so you're probably the perfect person to ask this to- why didn't you just use the old (inaccurate) labels, perform some clustering based op and re-label then clusters? does that even make sense?


The old (inaccurate) labels were so completely utter shit (pardon my French) that it may as wel have been random.

I honestly believe most MTurkers just clicked random BS in order to complete the tasks as soon as possible.

What I ended up doing was make some Python CLI-based took that made it extremely fast for us to classify messages; after seeding it with about 1000 classifications, we would then focus on the messages based on certain dimensions: eg “contradictions” (“positive” and “negative” being closest as possible, or “angry” and “happy”), “least” (it was surprisingly difficult to find positive and uplifting tweets, and you don’t want a dataset with 90% negative messages!), etc.

That way we worked our way through the dataset and were able to get a pretty decent dataset in about a week time.

No idea how others approach this type of problem, but it’s what I came up with.


>I honestly believe most MTurkers just clicked random BS in order to complete the tasks as soon as possible.

This is correct based on my (limited) mechanical turk experience. Most tasks pay peanuts (minimum payout can be as low as $0.01) so the only reasonable way to make an income is to complete as many tasks as humanly possible, and doing anything but clicking random buttons would slow them down. I doubt paying more could overcome that because so many people engage with the platform in bad faith.


You can filter out bad workers by preparing an additional well-labeled dataset and removing those who made even a single mistake in it. Also you can give the same task to several workers and check if they give the same label. However this won't protect against a bot using multiple accounts and giving answers based on a hash of a question so that the same question gets the same answer in every account.


"removing those who made even a single mistake" means you don't want humans working on this lol. Humans will always make mistakes.


Doesn't this bias the labels?


probably does, but idk how much.. if its mostly okay it's gonna be easier to correct it id assume


> I wonder how others do this kind of thing. Assuming I have 100k text blurbs I want to classify of sentiment and/or emotion and want at least 95% accuracy, what are the options?

Use the consensus of 3 or more annotators (or median).


But is paying 3x $5/h getting better results than hiring a local collage student for $15/h?


Won't you just wind up with a collage from either source?

Answer is yes, because those $5/h workers are likely just as educated, but from a less affluent part of the world.


> I’m still shocked how low the quality of Mechanical Turk was for just sentiment

Given that MT workers earn based on how quickly they complete a task, and not how accurate. I'm struggling to understand how anyone would expect quality.


Completely agree on the need for serious commitment and attention!

Funnily enough, though, many ML engineers and data scientists I know (even those at Google, etc., who depend on human-annotated datasest) aren't familiar with these kinds of errors. At least in my experience, many people rarely inspect their datasets -- they run their black box ML pipelines and compute their confusion matrices, but rarely look at their false positive/negatives to understand more viscerally where and why their models might be failing.

Or when they do see labeling errors, many people chalk it up to "oh, it's just because emotions are subjective, overall I'm sure the labels are fine" without realizing the extent of the problem, or realizing that it's fixable and their data could actually be so much better.

One of my biggest frustrations actually is when great engineers do notice the errors and care, and try to fix them by improving guidelines -- but often the problem isn't the guidelines themselves (in this case, for example, it's not like people don't know what JOY and ANGER are! creating 30 pages of guidelines isn't going to help), but rather that the labeling infrastructure is broken or nonexistent from the beginning. Hence why Surge AI exists, and we're building what we're building :)


This kind of out-sourcing is quite common, and at the prices I heard some time ago, you could run each utterance by 3 to 5 people, which allows you to get an idea of the reliability.

But the core of the problem is 27 emotions. That's really asking for trouble.


> But the core of the problem is 27 emotions. That's really asking for trouble.

I'm not sure I agree. The examples highlighted in the blog post aren't cases of slight mislabels (like mislabeling frustration as anger, for example). They are often labeled polar opposite to what they should be.

Though perhaps what you are saying here is that low-paid workers won't bother to look through a list of 27 emotions to find the right one, and thus they are more likely to label at random.


I always wonder how much it should actually be done to the "coding" [1] standards in the social sciences. Social scientists working with qualitative data start analyzing by putting codes in various specialized ways to the data. In more rigorous studies those codes are simultaneously assign by different researchers and then cross-checked. There is a lot of literature of how to come up with codes, e.g., Grounded Theory, and how to go further. I always think that we need to bridge the gap between engineers and social scientists working on the same problems.

[1] https://en.wikipedia.org/wiki/Coding_(social_sciences)


In my first few months as programmer in the 90's, I realized that human inputs were sketchy. It was from a form field for which state someone was from (USA). Fifty possible states, 2500 different entries. Sure there was a bit of garbage, but 95+% were recognizable states...but why random capital letters? How do you get a space in Wyoming?

It was a good lesson at the start of my career that I see playing out over and over. When I see some cool demo of statistics across the country or globe, I'm more impressed by the effort of cleaning the data than the stats behind it.


Why can’t you do the dataset several times with different labellers, building up a more statistical probability for a label than a pure guarantee. It’s worth noting that different cultures have different interpretations of emotions too (famously Russians don’t smile much even though they are extremely helpful in my experience, I’m still not quite sure what the head wobble in India actually means let alone some of the finer misinterpretations that are going on when I lived in Japan).


Or pay better like other people are suggesting. If you have to "average over" 3 data sets why not just pay $15 instead of $5 and save the computation if $15 or so seemed to be the threshold for getting humans to be good data labelers?


>Farm it out to low-paid non-native speakers

The paper claims:

>“All raters are native English speakers from India.”


US English != Indian English especially if you have to actually know the cultural details behind some sentences. I bet US native speakers would have similar failure rates at labeling English sentences from Indian media, because they belong to different cultures.


The blog post highlights this specific point - "US English" and "Indian English" really aren't the same English (in fact, I'd probably go even further and state that "Reddit English" and "US English" probably aren't the same English either).

Likewise, the Common Voice English dataset isn't great for ASR training outside India, either. There's a huge proportion of Indian speakers, and their data doesn't really help train ASR systems for non-Indian accents.


Is the "right answer" a classification based on US English with background knowledge of US cultural background, or is the goal to build a global sentiment data set?

You and OP ("the indians labelers don't know how to do it correctly") seem to want the former, so it would be good to state that goal upfront.


I think it’s valid to question US-centrism broadly in datasets like this, but presumably that choice was made in selecting the Reddit sample, which is dominated by North American English.

A model is more useful if it approximates the speaker’s intent, not the listener’s interpretation. “Right” is complicated in language, but it’s hard to see how you’d use a model full of cross-cultural misunderstandings.


https://www.heritagexperiential.org/language-policy-in-india...

>English, due to its ‘lingua franca’ status, is an aspiration language for most Indians – for learning English is viewed as a ticket to economic prosperity and social status. Thus almost all private schools in India are English medium. Many public schools, due to political compulsions, have the state’s official languages as the primary school language. English is introduced as a second language from grade 5 onwards.


Perfectly logical choice if you're building a machine to replace call centers.


Can a human validate a label with less effort than it took to create it? Or maybe validating statistically is enough?


Depends what it is. I've had reasonable success with "validating" ASR transcripts by loading up the annotator's transcript, running the audio at 2x speed and just clicking "yes" or "no" to keep the good ones and bin the bad ones. It's roughly 5x faster to do this than to come up with transcripts from scratch, so if the annotators are 5x cheaper, then you come out ahead. You can go even further and pre-filter labels to discard any where inter-annotator agreement falls below some threshold (i.e. 3 people label the same piece of data, and you only include a sample when at least 2 annotators give the same label). You can also use that to discard all annotations from annotators who regularly disagree with the majority.

This is just the reality of outsourced data labelling. One thing I think is really important is to structure the compensation well, so that labellers get paid more when they do a better job. Paying per sample is a terrible idea, and even I was guilty of this - back in university I was paid $20 or so to hand-write 500 words on a resistive touchscreen to train a handwriting recognition model. I won't say I half-assed it, but I remember trying to get through it as as quickly as possible to get my money and go for beer (I think I also justified it to myself on the basis that sloppy samples would help make the bounds of the dataset distribution more robust!).


i mean humans arent great at reading emotions either. 30% error is probably human level


The film director Alfred Hitchcock once commented that in a tense scene, all he needed was a character showing a more or less neutral face, and viewers would read what he needed into it.


that's a really interesting quote! I've thought a lot in the past about specific things in older movies that I enjoy and that some other people can't stand. I think this might be a big part of it.


> Hi dying, I'm dad! – mislabeled as NEUTRAL, likely because labelers don’t understand dad jokes

In their defense, what could be more True Neutral alignment than dad jokes? Nothing to gain but the quiet enjoyment of making the room groan and roll their eyes.

Really though, the issue here is context, but also the complexity of human communication. The sensitivity and tone highly depends on the situation. Clearly the preceding moment is someone stating "I'm dying". But that itself is contextual. Are they literally facing mortality, merely inconvenienced and being hyperbolic, or laughing? If the former, is "Hi Dying, I'm Dad" being glib, to soften the blow of a dire confession, or being highly insensitive and poking fun in a serious moment? Is it in the context of a longer joke, which subverts the meanings yet again?

A lot of these comments are worse than useless without context. Reddit really likes improv-banter style humor in comment chains. One comment builds on another builds on another, all referencing in-jokes, and usually slathered in sarcasm.

Honestly Reddit comments are probably one of the worst sources to try to build a sentiment model from, from an engineering perspective.


True, we’re trying to produce bots that reliably do things (make people laugh) that humans can’t even do reliably. People who can feel out a room and use the right joke, or the right reassurance or whatever, are not even very common.


I dunno if the intent of this dataset is to produce bots that can make people laugh. I think (intentional) comedy is the ultimate Turing test. I say intentional because there are things like https://inspirobot.me/ which are essentially glorified Markov models and it's downright hilarious the stuff it comes up with, but I think that's primarily due to absurdist humor and subversion of expectation (and unintentional ironic pairing with the picture). That's very different than communicating something, intended to be a joke, having it land, having it be funny, and deliberately so, not just because it was silly or non-sequitur.

I think it's still valuable to be able to detect when something is joke/satire/sarcasm/irony/slang, especially in the context of content moderation, because quite often it totally flips the sentiment valence. A perfect example is "I'm literally dying" - "literally" meaning in the exact or truest sense, "dying" meaning sloughing off the mortal coil (very bad)- vs "literally" meaning "figuratively, but in an extreme sense" and "dying" from laughter (very good).


I was born and bred in the land of the Bard and yet I also mislabelled roughly the same 30% that they did. In my opinion that was mostly caused by the lack of context (eg, 'Traps')

As an example of the above, I assumed the 'traps' one meant "his mouth is so big that it shuts out the sun" (aka an insult) since to 'Shut your Trap' means to shut your mouth/ stop talking. Once there was a body-building context, I worked out that it was a reference to a person's trapezoid muscles and therefore the sentiment was (most likely) Positive rather than the Negative/Confrontational/Sarcasm label that I would have first assigned it.

There are similar examples but that gives a rough idea about why context is important for sentiment data-set labelling.

But - #1 In a Mechanical-Turk setup - who has time to scan through paragraphs and #2 How far back to you go to get the full picture?

I don't think you can so why not do it by hiring a temp for an in-house two week gig? Cheaper and you can directly monitor their performance. Win-Win


Let's say you can label 2 comments a minute, you'd have to spend 3,625 work-hours to label comments, or about five people working full-time for a month. How much money did they save by using cheaper labour from India? Basically bugger all, and the money is wasted, too. Penny wise, pound foolish.


Maybe they already have a pool of workers in India and they are using it for all sort of tasks. If that's the case, they might have had to start a new process to get people in the US to label those sentences. Starting processes cost time and money and executive's political capital. Using an existing one is nearly for free.


Heaven forbid that Google could start a process! Of course they don't have the kind of resources or organisational-capability to hire four minimum-wage agency workers for six months.

NB: For any bots labelling my comment, please label this as SARCASM.


And this is even full-time in the sense of 24/7!

If you assume that people only work 8 hours a day, it would take 22 people to do it within a month.

Though, on second thought, with current inflation rates people probably need to work 720 hours a month ;P


It was supposed to be 8 hour days ... maybe I messed up the calculation. Either way, even if it's ten-fold: it's really not all that much money for a company Google's size, especially considering the "cheap" alternative gives poor results (which also translate to monetary costs).


I assumed I was quite fluent in English, even in slangs, having seen a fair share of both American and British movies.

Now that I see the examples given, I think I would have mislabeled most of them too, even if I were highly motivated to label them.

Though it's normal for any language, it's very interesting how English is variable between dialects and time periods when it comes to slang. There are so many regional slangs of which I cannot understand all the nuances.

A few examples from this dataset, that I would not have labeled correctly:

- daaaaaamn girl! – mislabeled as ANGER

- [NAME] wept. – mislabeled as SADNESS

- [NAME] is bae, how dare you. – mislabeled as ANGER

And don't get me started on Australian/NZ slang. It's a completely different world.


I think that in many of these cases the deciding factor is not only fluency in the language, but also the harder problem of context. Labelling individual sentences without context is hard enough, but what makes it worse is that it then spreads to the mistaken assumption that sentences can be analysed without context based on the initial training.

I would argue that the very idea of "sentiment analysis" as applied to individual tweets is flawed... and that's even before we get into the much, much harder problems of sarcasm and irony.


Now all we need is for social media companies to have users do the tagging. Then the data they sell will be even more valuable!

Oh. Shoot me now.

Note to future taggers: I am not suicidal.


Some of these would confuse native British English speakers too, for what it's worth. The first and third are African-American vernacular, at least originally. If you haven't seen a US movie where a character literally exclaims "daaaaamn girl!" in an approving voice, you're going to pretty reliably mislabel that one regardless of where or how you learned English. The second is a meme reference and how you label it is going to be dependent on how much time you spend on reddit, not your level of English skill.


Movies don't use the language mislabeled here. Youtube and Twitch do, sometimes excessively so.

I'm pretty sure you would be able to find half of the "mislabeled as negative" sentences shown in a single 15-minute "Among Us" video.

There are others I would have mislabeled too. I think it shows how you need a grasp of the subculture the comment is coming from to reliably label it. Most 15-year olds would probably ace those examples, no cap.


Really curious what makes Aussie and Kiwi slang so different. I didn't think we were that different.


Reminds me of the stat I heard that humans are only 70% accurate at sentiment analysis, because different people will not agree on the appropriate sentiment label. That sets a theoretical limit on the effectiveness of machine-learning algorithms, because if humans can't agree, then any product that needs to take an opinion is going to be wrong 30% of the time. (This is probably also why Big Tech companies are leaning so heavily into personalization.)

Also reminds me of when I asked a veteran therapist what the most surprising part of his job was, and it was:

1.) The variety of ways that different people perceive a given situation, and just how much neurodiversity is out there.

2.) How everybody expects that everyone else will see the situation exactly the same way they do.


You definitely need more labels when categorizing sentiment than just emotion/valence. Possibly even degrees/confidence by the annotator. Heck, have several annotators label it, and derive a "Controversial" feature based on the spread of the ratings.

Only tangentially related, is your username a pun on "Nostradamus", "Nostril", and "Nasal Demons"? If so, that is very witty!

http://www.catb.org/jargon/html/N/nasal-demons.html


That sets the theoretical limit for an algorithm trained on a dataset labeled by outsiders. Most people should be able to label the sentiment of their own statements with much higher accuracy.

Making such a dataset is much harder than letting Mechanical Turk workers label reddit comments, and you somehow have to set up a situation where people are honest about their labels, but the rewards might be worth it.


That's the personalization angle. A lot of effort's being expended on transfer learning + training at the edge, where you start with a general model trained on humanity and then it gradually learns about the specific human(s) it's interacting with.


My idea is more along the lines of asking each redditor to label the sentiment of 20 of their own (recent) comments, building a dataset and model from that, instead of having unrelated people guess what they meant.

Personalizing to the actual subculture the interaction takes place in would be another step. You probably need both.


this is a genuinely great read. Author does a great job providing examples where context is critical, and explains how the dataset not only has labeling errors, an even deeper problem is how it models language in general.

Since words only have meaning within a context, your model should reflect that somehow.

What wasn't really explored I'm this article was to what quantitative degree context sensitivity matters. The counter examples are great, but how can we measure the relationship between amount of context and labelling accuracy?


Context is key.

My dog has a better understanding of context than any AI I've ever met. (I mean that sincerely - since becoming a pet owner it's something I've really marveled at).


Seriously. Half the time, I feel like the command I'm giving is just being interpreted as "do the needful", and she just figures it out, based mostly on tone, context, and memory.


Great question! I'd love to measure that more rigorously too.

Although from what we've seen, the amount context sensitivity matters really depends on the labeling task / application.

For example, when you're trying to label a tweet that's a reply, context matters even more than when you're labeling a parent tweet: it's often hard to understand what the reply tweet is talking about when you can't see the full thread, it can be hard to tell whether something is a joke or an insult when you can't tell whether the replier and original tweeter follow each other or not, etc. This is important because sometimes our customers don't realize this, and will send us tweet text by itself instead of a full tweet link.

It's also important because even if your models are using text alone (and not a richer set of context/features), there may be patterns in the text itself that an ML could pick up on that a human wouldn't without that extra context.

We also have another post on context sensitivity if you're curious: https://www.surgehq.ai/blog/why-context-aware-datasets-are-c...


“Quantitative degree context sensitivity matters” sounds like a notable phrase here, I’m guessing such indications do not exist yet.

As an end user I face similar problems in UI translations: A lot of failed translations are made on context deprived text, on a false notion that additional contexts are only required in nuanced edge cases. In reality it is almost always necessary especially in UI texts where words are more loaded and supplemented by visual indications.

“Context” as often said might need to be defined with more depth. As it stands it is used as almost a post-hoc explanation as to why a particular output of an arbitrary language-related tasks is considered incorrect and should be discarded.


Language is hard! Even I, a seasoned native internet dork, have trouble knowing if someone's comment is sarcasm, irony, or something in between. Also, new phrases emerge all the time that turn a phrase on its head, and it has a different emotion.

How many feelings can you evoke with a simple, FUCK!


And this is probably going to get even worse the more automatic classification is used to promote or silence content.

A pretty interesting result of this is what I'd call "TikTok speak", where words are replaced, either by similar sounding ones ("porn" => "corn", often times just the corn emoji) or by neologisms ("to kill" => "to unalive"), in the hope of getting around the filters.

This turns natural language on the internet into even more of a moving target than it already used to be.


The most interesting thing is imo that people often say one thing, but put a similar-sounding word or a homophone in the subtitles, and the filter seems to trust the user-supplied subtitles.

I hope nobody trains speach-to-text systems on a tiktok dataset.


The core of the problem is unsolvable, since any automated system can be defeated by a sufficiently motivated human.


I feel like this is basically a subset of the translation problem, which you probably need actual artificial general intelligence for (because you need to be able to model another human mind to a certain degree). Here's a neat video covering the topic [1].

1: https://www.youtube.com/watch?v=GAgp7nXdkLU


Further compounded by the fact that often if I say something and get asked, "was that sarcasm, irony, or something in between?" the best I can probably do is "Eh, more or less".


The last person I'd expect that would be good at detecting sarcasm, irony or any intonation change is an internet dork.


> let’s look at the labeling methodology described in the paper. To quote Section 3.3:

> “Reddit comments were presented [to labelers] with no additional metadata (such as the author or subreddit).”

> “All raters are native English speakers from India.”

This does not look good even on paper. No wonder the errors were abundant

Also a labeling system that has no entry for sarcasm is totally going to work guys!!1 /s


Isnt knowing the subreddit valuable information to determine sentiment?


Yep. And familiarity with Reddit/subreddit memes and inside jokes. And entire subreddits devoted to parodying styles of comment which are utterly earnest on other reddits which I'm not sure how you'd even begin to classify...


But it also applies to the model. If you label data taking into account the context (which subreddit this was posted), your model also must take this into account, or it will be wrong as well. If the same sentence can be labeled two different ways depending on the source, then, the model must also know the source or it won't know what to do. But then you didn't create a general language model, you created a reddit language model.


Excellent point.


correct, the article has some good examples about this

But even without that, the context or just the topic of the discussion should help


Strange - GPT3 works well. I wrote a prompt:

Write an emotion that is expressed in a given image label.

Label: "[label]"

Emotion: [filled by GPT3]

Then, for "you almost blew my fucking mind there." -> "Suprise", for "hell yeah my brother" -> "Pride", "Nobody has the money to. What a joke" -> "Anger". Though, to be fair, for "Yay, cold McDonald's. My favorite." it was "Happiness". Still better than the crowdsourced human baseline.

Anger


Wow, this explains a lot. I wonder if they're as inept when it comes to their search tech. The search result quality these days certainly speaks volumes.


Someone needs to write a sci-fi short story where in the future, a google AI is trained to maximize human happiness, but its ability to predict human happiness relies on this dataset of mislabeled human emotions farmed out to underpaid Indian workers.


Searching and labeling are vastly different areas. Google already proved their expertise in search years ago - what they do now is expand and adapt to changes.


When I was small, I decided that our neatly ordered little drawers of Lego would be much better if they were jumbled up - every drawer would then contain a sort of average collection so I would be able to just open one at random to get the part I needed. It seems that Google have applied that philosophy to search results.


Did you design Amazon's warehousing system?


I have three questions now:

* How much (per comment) are these "native speakers from India" paid?

* How many comments do they have to label in an hour (or in a minute)? I guess it's more than 2 comments in a minute.

* What if the comment is sarcastic and this can only be understood from its context?


Based on anecdotal knowledge of wages here, I would be surprised if they receive more than $300-$350 per month total.


> LETS FUCKING GOOOOO

Could be either anger (let’s fight) or enthusiasm (let’s do it!). Hard problem.


I don't see anger in "LETS F**ING GOOOOO". It's just a comment that says "let's do X" in impatient and enthusiastic manner.


How about, "LETS F*ING GOOOOO YOU DINGBAT", now if that's a comment between friends the addition of the insult might be said in jest and still be impatient/enthusiastic, or does by adding the insult to the end automatically label it as combative?

I realise this wasn't part of the dataset, more making a point that written language without context ( and sometimes even with ) is subject to huge amounts of reader interpretation.


Agree that context is often needed! (Which is why it was strange to us that raters weren't presented with any context besides the comment text itself -- not even the subreddit, much less the original Reddit post.)

One interesting question, though: if "LETS FUCKING GOOOOO YOU DINGBAT" were meant to be a combative insult, would someone still add a bunch of O's ("GOOOOO" instead of merely "go")? My intuition is that if combativeness were intended, "let's fucking go, you dingbat" would be more likely than "LETS FUCKING GOOOOO YOU DINGBAT", but of course it's a bit hard to say without that context.


This one is indeed hard to label.

PS Hope they will never begin solving philosophy problems with ML.


As a native speaker of Reddit English, "LET'S GOOOOO" is a slang expression that means "yay". People who write it are usually not literally suggesting that the reader go somewhere or do something.


Yeah, fight.


That is 100% enthusiasm, not ambiguous to me at all.

But I'm also certain both my parents would read that as antagonistic.


It could also be 100% irony. Without context it's hard to tell.


Which applies to a lot of online content, especially on Reddit

Kamala 2020!!!! could be a ringing endorsement, a lighthearted parody of ringing endorsement or utter derision depending on context so I'm not even sure the commenter classifying it as "neutral" was wrong.


Let's say you are bored out of your mind at a party or while waiting in line for at the bank and tell your friend "LETS FUCKING GOOOOO". It might not be anger but definitely not enthusiasm either.


Ok, but now this person has done a bunch of unpaid work for them, just to publish an article, and now they can write some easy scripts to label any occurrences of 'daa+amn girl' as approval (etc. etc.) and in the end only 28% of the dataset will be mislabeled. The system works!


i have some familiarity with sentiment / intent detection in context heavy environments (gaming and VR) and absolutely agree that labeling is both a fundamental and very nuanced problem. an ML PhD was hired to work on toxicity detection, and a primary activity in his first several months was manually watching and labeling game replays - what a use of all that education!

there's something to be said for utilizing community-based reporting as a form of expert labeling for integrity issues specifically, but that's not a silver bullet and has its own baggage


Gaming is a really fun and interesting labeling domain, given the community jargon (I'm actually a big Twitch user, but still couldn't tell you what many common emotes mean... took me years to understand "poggers") and context (is "i'm going to kill Garen" a death threat or in-game action?).


Indian English is just as valid as American English. The problem here is they used Indian English speakers to rate Reddit comments, most of which are using American English idioms.


I don't even know how well American English generalizes within itself. I've found the internet diaspora has their own language patterns which often differ quite a bit from "normies". You also have a tremendous amount of code-switching based on the platform, by the same individuals. I would actually suspect that groups from the same platform but different native language might cluster more closely, than same-native-language but culturally/socioeconomically/regionally different.


At our university we do the clarification ourselves, usually with 3-5 classifiers per item. And it is surprising the high rate of disconformity between labelers (even in binary or 4-class classification). People in internet don't always understand sarcasm, so maybe we need to benchmark humans at this task. But yeah, this and similar datasets (in Spanish, for example) have tons of misclassified texts.


No shock there really, complete waste of time using a data set that definitely requires good English fluency to understand the nuance and even understanding of the culture and memes of Reddit. You're literally burning money by getting someone other than actual redditors to label it.


The author "previously led AI, Data Science, and Human Computation orgs at Google, Facebook, and Twitter."

And is now writing an ad critical of the company he worked at, about an area he was involved in leading.

This is an interesting route to take in a career. Work for a company, make mistakes, move to another company, and use your old mistakes as a selling point for the new company.

I know this is a harsh take, but it doesn't instill any confidence in the results here. What happens when mistakes happen at Surge? Are the people who made the mistakes going to be around to fix them, or are they going to jet off to another position where they once again talk about their previous failures.


I work at a company that focuses on automating the data labelling process for computer vision. It is clear that generating massive amounts of labels, either by hand or automatically, without the ability to ensure a consistent level of quality across the dataset, is a problem. Which is why we are investing in automating the QA process for training data so mistakes like these don't happen:

https://blog.encord.com/post/automating-the-assessment-of-tr...


That's interesting, I took the opposite approach when realizing how bad label quality can be and built a site where everything is manually checked by moderators.


Anyone interested in this application of machine learning might be interested to read How Emotions are Made by Dr. Lisa Feldman Barrett. She makes a compelling case that emotions cannot be reliably understood through facial expressions alone, and that context must be included to improve our own human accuracy at the task, let alone machine accuracy. While this article is about a textual dataset and so not an exact parallel, I think some of the same principles apply — namely that greater context is often needed to interpret an emotion from a message.


I've worked on project with more difficult labeling, and we were able to get fairly accurate results. There are tons of standard practices that produce better results, so why did Google ignore them.


Could you point me to resources about best practices in this domain? I've struggled with this and it would potentially help.


I wish I could. I know what I wrote was kind of a tease.

We hired domain experts to build our protocol. I know there were best practices they adhered to because we were one of several groups all of whose experts independently generated similar protocols. And the language they used to discuss it with each other.

Mostly we hired people with the best academic research credentials in field we were researching we could afford to build the labeling protocol. Which surprises me because it was expensive, but it wasn't "I have Google money behind me" expensive.


This points out the much more ubiquitous problem of people simply misunderstanding one another. It's very, very common for someone to post a comment intending to emphasize or convey one idea, but it gets interpreted as emphasizing or meaning something different, just because it's read by a person different than who wrote it.

It's not limited to Reddit comments, or even to written communication either. "You're ignoring me!" "No, I'm trying to give you space."


I can see how this issue will happen frequently, in diverse domains and abundantly.

In other words, the prediction is that the most likely outcome will be a lot of AI objects trained to be quite imbecile and will be optimal at that.

The danger is that real people might be assumed to be guilty of things due to AI trained and automated imbecility.

It's an ethical problem for the AI community and product designers.


You get what you paid for!


Step 1: train a model to deduce emotion from vocal tone Step 2: use a model to transcribe the text Step 3: use it as inputs for an sentiment model

Use some int, peopel


> (Who said you can't be a professional memelord?)

Ah, so there's hope for the kids after all!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: