That was probably me. We have two sides to the webspam team at Google: engineering and manual. We definitely prefer to write algorithms so that we avoid dealing with individual websites--the idea is that you strive to fix the root cause of an issue, not to tackle specific sites. However, if we see a website that violates our guidelines and that gets past the algorithms, we are willing to take manual action. Where possible, we use the output of the manual team not only to reduce spam itself, but to train the next iteration of algorithms.
For example, one of the big issues in blackhat spam this past year was illegally hacked sites. Our algorithms weren't doing the best job on hacked sites, so the manual team kept an eye out for hacked sites to remove them (and often to alert the website owners that they'd been hacked). The data generated by the manual team helped us build and deploy multiple new algorithms to detect hacked sites, leading to a 90% reduction in the number of hacked sites showing up in Google's search results in the past few months. That decrease in hacked spam in turn frees up the manual team to tackle the next bleeding-edge technique the spammers use.
I suspect every major search engine uses similar approaches: try to stop the majority of spam with algorithms, but be willing to take action in the mean time while engineers work to improve the algorithms.
Great to know. Out of curiosity, in this particular case, did you save supposed violations for each site, or did you blacklist all of them based on a few?
It varies for different cases depending on a lot of factors like severity, impact on users, etc. In the particular case from above, to find out the history of what might have happened, I just picked a domain at random and dug into its history to find the autogenerated pages with tons of typos for each domain.
I could post more examples from the other domains, but my point is that this is the sort of thing that users dislike and complain about. If you were a blogger and saw pages like this ranking for your name or your site's name, you probably wouldn't be happy either. From looking at a few domains, I don't think that we overgeneralized from a few pages in this case.
I know that you've moved on and the domains are shut down now. And I'm not trying to be cantankerous. I'm just trying to say that from our point of view there's good reasons to take action on sites like this so that users don't complain to us.
So, basically what you're saying is I went wrong with the typos? I got really excited by my algo and was overzealous with adding it. I believe I did take it off of the sites I issued re-inclusion requests for, but they never got re-included and I never got any messages back (to my knowledge). Also, they were not on every one of those domains.
Each site took a long time to make actually. They either involved generating a data set from scratch or piecing together and parsing other large data sets. This one in particular, I was crawling the Web for feed discovery and was planning on adding stuff like grouping the best posts by category, etc.
Yeah, would love to know about some others, e.g. japanese2englishdictionary.com, idnscan.com, serverslist.com. Also, did you actually get any complaints about this or was it triggered by some other threshold/thing? On a side note, I still get requests about exposing some of this data, i.e. sites behind ip addresses or lists of domains matching some criteria. In any case, thx for the info!
I can understand the need to take action. I just think it could have been handled better. If typos were the problem, I would have removed them immediately if someone told me, and that could have been automated. In retrospect, it seems pretty obvious, but it wasn't at the time.
The typos were definitely going overboard. I can understand the appeal of "I've got this great tool--what can I do with it?" But we get a lot of complaints about typo spam, so that's a sensitive issue. I definitely would have done less of that.
There's also a class of folks we call navigation spammers who try to show up for tons of domain name queries. I can give you some history to provide context. In the old days, when you searched for [myspace.com] we'd show a single result as if someone had done the query [info:myspace.com]. The problem is that people would misspell it and do the query [mypsace.com], and then we'd end up either show no result or (usually) a low-quality typo-squatting url. So we made url queries be a string search, so [myspace.com] would return 10 results. That way if someone misspelled the query, they might get the exact-match bad url at #1, but they'd probably get the right answer somewhere else in the top 10. Overall, the change was a big win, because 10% of our queries are misspelled. But if you're showing 10 results for url queries, now there's an opportunity for spammers to SEO for url queries and get dregs of traffic from the #2 to #10 positions. Now we're getting closer to present-day, so I'll just say we've made algorithmic changes to reduce the impact of that.
But you were hitting a bunch of different factors: tons of typos, specifically for misspelled url queries, autogenerated content, lots of different domain names that looked to have a fair amount of overlap (expireddomainscan.com, registereddomainscan.com, refundeddomainscan.com, etc.). If you were doing this again, I'd recommend fewer domain names and putting more UI/value-add work on the individual domains.
For example, one of the big issues in blackhat spam this past year was illegally hacked sites. Our algorithms weren't doing the best job on hacked sites, so the manual team kept an eye out for hacked sites to remove them (and often to alert the website owners that they'd been hacked). The data generated by the manual team helped us build and deploy multiple new algorithms to detect hacked sites, leading to a 90% reduction in the number of hacked sites showing up in Google's search results in the past few months. That decrease in hacked spam in turn frees up the manual team to tackle the next bleeding-edge technique the spammers use.
I suspect every major search engine uses similar approaches: try to stop the majority of spam with algorithms, but be willing to take action in the mean time while engineers work to improve the algorithms.