Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Visualizing language usage in New York Times news coverage (nytlabs.com)
49 points by lelf on July 26, 2014 | hide | past | favorite | 45 comments


I love how strikingly editorial changes show up:

"Theater" vs. "Theatre": http://chronicle.nytlabs.com/?keyword=theater.theatre

As well as how quickly culturally accepted terms are replaced:

"Mrs." vs. "Ms.": http://chronicle.nytlabs.com/?keyword=mrs..ms.


The trends of "teenager" and "teen" was suprising:

http://chronicle.nytlabs.com/?keyword=teenager.teen.adolesce...




Between 1975 and 1976, they seem to switch from "per cent" to "percent".

http://chronicle.nytlabs.com/?keyword=percent.per%20cent


Mrs. vs Ms. is even better with "Miss" as well: http://chronicle.nytlabs.com/?keyword=mrs.miss.ms


The OCR anomalies are very interesting...for example, it seems that the word "fuck" occasionally found its way into the NYT pages throughout history:

http://chronicle.nytlabs.com/?keyword=fuck&format=count

But if you look at the pre-digital-text archive period (pre-1990?), you won't see "fuck" in the actual articles, such as the ones in the 1880s...I tried other cusswords and saw that they would appear in the very-old stories but no other time in NYT history...So I'm guessing that there is some fuzzy matching going on to compensate for the text in the scanned archives.


There are several hits for 'Internet' in 1853, and hundreds for 'email' in the 1890s. To avoid that, maybe they could weight the OCR recognition based on a word's first occurrence in dictionaries (backdated a decade or two).


I was about to ask why the word "equation" is trending upwards since 1960 even though the related words "equations" (plural) has not. It also doesn't exactly track the word "mathematics". But after I added the word "music", it reminded me that the Y axis shows "equation" at less than 1%. I suppose that's within the realm of random noise insofar as word count statistics. Maybe NYT had the odd mix of staff writers that happen to use "equation" more often in the journalism that's unrelated to any larger cultural context.

http://chronicle.nytlabs.com/?keyword=equation.mathematics.m...

(unclick "music")


Perhaps non-mathematical usage of "equation" in phrases like "entered the equation" and "wasn't part of the equation" has been increasing?

I certainly hear it in conversation often enough, but I've never taken note of how frequently it's written.


Yes, that seems possible.

The Google Books Ngram Viewer doesn't have any hits for "enter the equation". However, the word "equation" and "equations" have the opposite trend from NYT.[1] But to emphasize again, the hits are less than .01% which is potentially abusing the word "trend".

[1]https://books.google.com/ngrams/graph?content=equation%2Cequ...


actually, both "equation" and "equations" trend up, just not at the same rate. Look at them independently.


Love it.

Just a simple one of past presidents (last names) http://chronicle.nytlabs.com/?keyword=clinton.bush.obama.rea... (didn't want to try ford because of the automotive bailout, but that might have given better results)

http://chronicle.nytlabs.com/?keyword=social%20security.bail...

Seems like "jobs" wasn't thrown during the great depression like it is today.

These match up pretty well...

http://chronicle.nytlabs.com/?keyword=politics.supreme%20cou...


try this one for presidents: http://chronicle.nytlabs.com/?keyword=president%20nixon.pres...

Seems like "jobs" has somewhat replaced "employment" and "unemployed": http://chronicle.nytlabs.com/?keyword=jobs.unemployed.unempl...


I'd be careful with "jobs" since case appears to be ignored, so it's likely picking up references also to Steve Jobs (or anyone with that last name)



I think his vs her is more interesting than men vs women.

http://chronicle.nytlabs.com/?keyword=his.her

Surprisingly, the difference seems to be be fairly constant.


I would expect that to be more constant when individuals are addressed. My interest was towards addressing men and women as a group, which - to some degree - would reflect the 'relevance' of the group for the given time in media.


It took me a long time to realize that articles related to a topic can be seen by clicking on the year. This instruction is hidden below the screen fold. It'd also be nice to see actual numbers (along with the %) for each year during hover.

It'd also be useful to see a curated list of topics to select from, instead of just randomly picking some related topics on page load. Hopefully they harvest some interesting suggestions they receive through @nyt. There are some fun topic-suggestions in this thread already.

Overall, a nifty tool that I wish existed for all news sources in the world and worked across all languages.

Thanks to the creator(s)!


Wow nice, and look, and maybe it can even predict a likely future:

http://chronicle.nytlabs.com/?keyword=vietnam.korea.iraq.rus...


This surprised me; even at the height of the phrase "war on terror", it was still matched by "cold war", which was going down, then bounced up: http://chronicle.nytlabs.com/?keyword=cold%20war.war%20on%20...

Also, a nice way of visualizing just how far back this data reaches: http://chronicle.nytlabs.com/?keyword=reconstruction



I love how "Netscape", "Loudcloud", and "Opsware" were mere flickering events for the NYT, but "Andreessen" got traction and has been on an upwards trend for the last ten years. Also - a unique spelling, while probably annoying when he was growing up, certainly makes it easier to track his presence in media.


Does anyone have an idea why the occurrence of "New York" drastically drops (62,000 articles vs. 29,000 articles) between 1980 and 1981?

http://chronicle.nytlabs.com/?keyword=new%20york&format=coun...


Perhaps they stopped using datelines for stories?

It was probably a change in style or format, not a reduction in the number of stories about New York, because the percentages of stories about other cities drop in a similar way at the same time.

http://chronicle.nytlabs.com/?keyword=new%20york.washington....

Bonus... the end of that chart, I think you can see when the NYT really did begin to expand its proportion non-local news.


Aaah, that makes sense. Thanks!



What if more articles were published in a particular year? I mean I assume the data would be skewed if more words existed in a particular year. I figured they'd measure how the language of say a random bunch of 5000 words as changed over the course of the life of NYT.


Search Query: ipad. This article looks interesting: http://www.nytimes.com/1999/11/11/technology/state-of-the-ar...


These plots would be even more interesting with a logarithmic scale option.

In Chronicle-golf, the most common word I can find is "new" at 74% of articles in 1978 (which beats out "who","what","when", "why", and "how"...).


"I" gets ~79% in 1869.


Whoa - "can" scores 100% in only one year.


The obvious one, some religious denominations: http://chronicle.nytlabs.com/?keyword=jewish.christian.musli...


http://chronicle.nytlabs.com/?keyword=nigger.chink.chinaman

I wonder how much of the change is terms becomes slurs and how much is NYT racism.


How expensive food items are "created": truffles, caviar: http://chronicle.nytlabs.com/?keyword=truffles.caviar


Civil War, World War II, and more!

http://chronicle.nytlabs.com/?keyword=assault.attack

Ok, what happened in 1871?



the largest percentages i found so far are for the words: president and war.


The trends for "russia" and "military" worries me.


An interesting one: try "code". There's an odd spike in 1934


It's very likely to be related to the National Recovery Administration, which was formed in 1933, and set price codes and codes of fair practice. It must have generated a huge amount of public debate and lots of news stories.

http://en.wikipedia.org/wiki/National_Recovery_Administratio...


c# was way more popular in the 1800's than today.


Interesting that "republican" is much more common than "democrat" and has been for all the decades covered by the graph.


You're comparing different things there: Republican is both the noun and adjective form, while Democrat is only the noun form. "Clinton is a Democrat" and "Bush is a Republican", but "Obama was the 2000 Democratic candidate" vs. "Dole was the 1996 Republican candidate".

You could try to adjust for that by comparing Republican vs. the sum of Democratic+Democrat, but that also pulls in unrelated uses of both terms: "democratic reforms in $countryname" and "Irish republicans", especially since it isn't case-sensitive. Which then probably overcounts "democratic", because it's used in that non-US-party sense more than "republican" is.

In general this kind of thing makes it very tricky to conclude things from pure word or n-gram frequency counts, since without more semantic annotation there are a ton of confounding issues.


If you graph "republicans" and "democrats", the graphs are almost identical.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: