It’s not copyright violation to train ML on content. So the license doesn’t matt...

WoodenChair · on July 8, 2021

> It’s not copyright violation to train ML on content.

The training is not a copyright violation. That seems to be settled case law. Whether the verbatim copying as a result of that training is a copyright violation I think is less tested.

Let’s flip the domains. Say we had an ML algorithm that could auto generate news stories and it at some point (not all the time) copied verbatim a Wall Street Journal article and posted it to a blog. Copyright violation?

With copilot, we’re sometimes seeing “paragraphs” of source lines copying verbatim, so this analogy is not such a stretch.

I think we need to think about how much our sharing culture in programming has tinted our view of the legality of this enterprise.

vharuck · on July 8, 2021

>It’s not copyright violation to train ML on content.

I agree. It'd be a nice gesture to reach out to the creators of the training data, like is usual with web scrapers. But collecting and analyzing data publicly available on the web is ok.

>So the license doesn’t matter unless there’s some “can’t use this for ML training” license that I don’t know about (and doesn’t seem to be legal).

I disagree. While Copilot is, at heart, a ML model, the copyright trouble comes from its usage. It consumes copyright code (ok), analyzes copyright code (still ok), and then produces code which sometimes is a copy of copyright code (not ok). The only way it'd be ok is if Copilot followed all licensing requirements when it produced copies of other works.

Personally, I won't touch it for work until either Copilot abides by the licenses or there's robust case law.

prepend · on July 8, 2021

> It'd be a nice gesture to reach out to the creators of the training data, like is usual with web scrapers.

I don’t think this is practical. And who notifies people of scraping content? I would’ve annoyed if I got spam from sites that scraped my content.

vharuck · on July 8, 2021

I've contacted websites about scraping when it'd be a repeat thing and they didn't have a robots.txt file available. Also if their stance on enforcing copyright was hazy (e.g. medical coding created by a non-profit). Sometimes, they pointed me toward an API I didn't know about.

>I don’t think this is practical.

I don't like people ignoring things just because they're impractical for ML. That leads to crap like automated account banning without possiblity of talking to a living customer service representative.