> Just like how people are allowed to read websites, but scraping is often disallowed.
Hosting code on Github explicitly allows this type of usage (scraping) according to their TOS so I have to ask again - why the sudden complains?
Are we still talking about a shortcoming of the ML model, which very occasionally spits out a few lines of copied code or should we include search engines into this, because they do the exact same thing by design?
robots.txt, for example, has a non-binding, purely advisory character as well and Common Crawl [0] (also used for training GPT-3) publishes a dataset that by definition contains GPL'ed code as well, no matter where it's hosted. So is that off-limits now, too?
Why would those be the same thing? It's a matter of scale. Just like how people are allowed to read websites, but scraping is often disallowed.