Visualizing language usage in New York Times news coverage

But if you look at the pre-digital-text archive period (pre-1990?), you won't see "fuck" in the actual articles, such as the ones in the 1880s...I tried other cusswords and saw that they would appear in the very-old stories but no other time in NYT history...So I'm guessing that there is some fuzzy matching going on to compensate for the text in the scanned archives.

FatalLogic · on July 26, 2014

There are several hits for 'Internet' in 1853, and hundreds for 'email' in the 1890s. To avoid that, maybe they could weight the OCR recognition based on a word's first occurrence in dictionaries (backdated a decade or two).

jasode · on July 26, 2014

I was about to ask why the word "equation" is trending upwards since 1960 even though the related words "equations" (plural) has not. It also doesn't exactly track the word "mathematics". But after I added the word "music", it reminded me that the Y axis shows "equation" at less than 1%. I suppose that's within the realm of random noise insofar as word count statistics. Maybe NYT had the odd mix of staff writers that happen to use "equation" more often in the journalism that's unrelated to any larger cultural context.

http://chronicle.nytlabs.com/?keyword=equation.mathematics.m...

(unclick "music")

discostrings · on July 26, 2014

Perhaps non-mathematical usage of "equation" in phrases like "entered the equation" and "wasn't part of the equation" has been increasing?

I certainly hear it in conversation often enough, but I've never taken note of how frequently it's written.

jasode · on July 26, 2014

Yes, that seems possible.

The Google Books Ngram Viewer doesn't have any hits for "enter the equation". However, the word "equation" and "equations" have the opposite trend from NYT.[1] But to emphasize again, the hits are less than .01% which is potentially abusing the word "trend".

[1]https://books.google.com/ngrams/graph?content=equation%2Cequ...

eevilspock · on July 26, 2014

actually, both "equation" and "equations" trend up, just not at the same rate. Look at them independently.

Glide · on July 26, 2014

Love it.

Just a simple one of past presidents (last names) http://chronicle.nytlabs.com/?keyword=clinton.bush.obama.rea... (didn't want to try ford because of the automotive bailout, but that might have given better results)

http://chronicle.nytlabs.com/?keyword=social%20security.bail...

Seems like "jobs" wasn't thrown during the great depression like it is today.

These match up pretty well...

http://chronicle.nytlabs.com/?keyword=politics.supreme%20cou...

eevilspock · on July 26, 2014

try this one for presidents: http://chronicle.nytlabs.com/?keyword=president%20nixon.pres...

Seems like "jobs" has somewhat replaced "employment" and "unemployed": http://chronicle.nytlabs.com/?keyword=jobs.unemployed.unempl...

jojohack · on July 27, 2014

I'd be careful with "jobs" since case appears to be ignored, so it's likely picking up references also to Steve Jobs (or anyone with that last name)

eamsen · on July 26, 2014

Some interesting symbolic stats:

[1] http://chronicle.nytlabs.com/?keyword=computer.smartphone.mo...

[2] http://chronicle.nytlabs.com/?keyword=security.freedom

[3] http://chronicle.nytlabs.com/?keyword=men.women

[4] http://chronicle.nytlabs.com/?keyword=homophobic.homosexual....

[5] http://chronicle.nytlabs.com/?keyword=love.money

pavanky · on July 26, 2014

I think his vs her is more interesting than men vs women.

http://chronicle.nytlabs.com/?keyword=his.her

Surprisingly, the difference seems to be be fairly constant.

eamsen · on July 27, 2014

I would expect that to be more constant when individuals are addressed. My interest was towards addressing men and women as a group, which - to some degree - would reflect the 'relevance' of the group for the given time in media.

pumainmotion · on July 27, 2014

It took me a long time to realize that articles related to a topic can be seen by clicking on the year. This instruction is hidden below the screen fold. It'd also be nice to see actual numbers (along with the %) for each year during hover.

It'd also be useful to see a curated list of topics to select from, instead of just randomly picking some related topics on page load. Hopefully they harvest some interesting suggestions they receive through @nyt. There are some fun topic-suggestions in this thread already.

Overall, a nifty tool that I wish existed for all news sources in the world and worked across all languages.

Thanks to the creator(s)!

bellerocky · on July 26, 2014

Wow nice, and look, and maybe it can even predict a likely future:

http://chronicle.nytlabs.com/?keyword=vietnam.korea.iraq.rus...

scott_s · on July 26, 2014

This surprised me; even at the height of the phrase "war on terror", it was still matched by "cold war", which was going down, then bounced up: http://chronicle.nytlabs.com/?keyword=cold%20war.war%20on%20...

Also, a nice way of visualizing just how far back this data reaches: http://chronicle.nytlabs.com/?keyword=reconstruction

chatmasta · on July 26, 2014

Here's an encouraging one:

http://chronicle.nytlabs.com/?keyword=beatles.bieber

ghshephard · on July 26, 2014

I love how "Netscape", "Loudcloud", and "Opsware" were mere flickering events for the NYT, but "Andreessen" got traction and has been on an upwards trend for the last ten years. Also - a unique spelling, while probably annoying when he was growing up, certainly makes it easier to track his presence in media.

davidbarker · on July 26, 2014

Does anyone have an idea why the occurrence of "New York" drastically drops (62,000 articles vs. 29,000 articles) between 1980 and 1981?

http://chronicle.nytlabs.com/?keyword=new%20york&format=coun...

FatalLogic · on July 26, 2014

Perhaps they stopped using datelines for stories?

It was probably a change in style or format, not a reduction in the number of stories about New York, because the percentages of stories about other cities drop in a similar way at the same time.

http://chronicle.nytlabs.com/?keyword=new%20york.washington....

Bonus... the end of that chart, I think you can see when the NYT really did begin to expand its proportion non-local news.

davidbarker · on July 26, 2014

Aaah, that makes sense. Thanks!

pdevr · on July 26, 2014

Note: Don't take the titles seriously.

Publishing schedule changes? http://chronicle.nytlabs.com/?keyword=saturday.sunday.monday...

Strong correlation: http://chronicle.nytlabs.com/?keyword=moon.mars

Each peak higher than the previous leader: http://chronicle.nytlabs.com/?keyword=microsoft.google.faceb...

Resilience: http://chronicle.nytlabs.com/?keyword=computer.phone

And then there was one: http://chronicle.nytlabs.com/?keyword=fax.iphone

Exponential decrease of area: http://chronicle.nytlabs.com/?keyword=telegram.fax.iphone

There's still hope: http://chronicle.nytlabs.com/?keyword=science.pop%20music

We meet again, after almost a century: http://chronicle.nytlabs.com/?keyword=america.russia

The baby boomers had it so good: http://chronicle.nytlabs.com/?keyword=marriage.divorce

Didn't last: http://chronicle.nytlabs.com/?keyword=blessings

Interesting: http://chronicle.nytlabs.com/?keyword=luck

Added:

Increasing aspirations or inflation? http://chronicle.nytlabs.com/?keyword=millionaire.billionair...

Obligatory: http://chronicle.nytlabs.com/?keyword=hacker

debt · on July 27, 2014

What if more articles were published in a particular year? I mean I assume the data would be skewed if more words existed in a particular year. I figured they'd measure how the language of say a random bunch of 5000 words as changed over the course of the life of NYT.

kasperset · on July 26, 2014

Search Query: ipad. This article looks interesting: http://www.nytimes.com/1999/11/11/technology/state-of-the-ar...

ISL · on July 26, 2014

These plots would be even more interesting with a logarithmic scale option.

In Chronicle-golf, the most common word I can find is "new" at 74% of articles in 1978 (which beats out "who","what","when", "why", and "how"...).

andrioni · on July 26, 2014

"I" gets ~79% in 1869.

ISL · on July 26, 2014

Whoa - "can" scores 100% in only one year.

closetnerd · on July 26, 2014

The obvious one, some religious denominations: http://chronicle.nytlabs.com/?keyword=jewish.christian.musli...

mnarayan01 · on July 26, 2014

http://chronicle.nytlabs.com/?keyword=nigger.chink.chinaman

I wonder how much of the change is terms becomes slurs and how much is NYT racism.

pdevr · on July 26, 2014

How expensive food items are "created": truffles, caviar: http://chronicle.nytlabs.com/?keyword=truffles.caviar

pravda · on July 26, 2014

Civil War, World War II, and more!

http://chronicle.nytlabs.com/?keyword=assault.attack

Ok, what happened in 1871?

johncoltrane · on July 26, 2014

http://chronicle.nytlabs.com/?keyword=negro

officialjunk · on July 26, 2014

the largest percentages i found so far are for the words: president and war.

draugadrotten · on July 26, 2014

The trends for "russia" and "military" worries me.

BTurkE · on July 26, 2014

An interesting one: try "code". There's an odd spike in 1934

FatalLogic · on July 26, 2014

It's very likely to be related to the National Recovery Administration, which was formed in 1933, and set price codes and codes of fair practice. It must have generated a huge amount of public debate and lots of news stories.

http://en.wikipedia.org/wiki/National_Recovery_Administratio...

c0achmcguirk · on July 26, 2014

c# was way more popular in the 1800's than today.

lkrubner · on July 26, 2014

Interesting that "republican" is much more common than "democrat" and has been for all the decades covered by the graph.

_delirium · on July 26, 2014

You're comparing different things there: Republican is both the noun and adjective form, while Democrat is only the noun form. "Clinton is a Democrat" and "Bush is a Republican", but "Obama was the 2000 Democratic candidate" vs. "Dole was the 1996 Republican candidate".

You could try to adjust for that by comparing Republican vs. the sum of Democratic+Democrat, but that also pulls in unrelated uses of both terms: "democratic reforms in $countryname" and "Irish republicans", especially since it isn't case-sensitive. Which then probably overcounts "democratic", because it's used in that non-US-party sense more than "republican" is.

In general this kind of thing makes it very tricky to conclude things from pure word or n-gram frequency counts, since without more semantic annotation there are a ton of confounding issues.

samirmenon · on July 26, 2014

If you graph "republicans" and "democrats", the graphs are almost identical.