What about something like http:// tubes.io

freshhawk · on Dec 9, 2012

We're fine with scrapers and scraping infrastructure, although tubes.io is a very interesting idea.

I'm more interested in what I can do to write fewer scrapers since the content is, at a high level, relatively similar. I've just started with experiments writing "generic" scrapers that try and extract the data without depending on markup. It's going to eventually work well enough but to get the error rate down to an acceptable level is going to take a lot of tweaking and trial and error.

There's a few papers on this, but not much out there. That's why I was interested in someone else working on the same problem in a different space.