> But in this case it is obvious that the AI isn't writing the code - at least not all the time, it is instead choosing what to copy - verbatim.
I still don't see any problem with that. If it's larger sections (e.g. entire NON-TRIVIAL function bodies), those can be filtered or correctly attributed after inference. So that's just a technicality.
Smaller snippets and trivial or mechanical implementations (generated code, API calls, API access patterns) aren't subject to any kind of protection anyway.
int main(int argc, char* argv[]) {
Lines like that hold no intellectual value and can be found in GPL'ed code. It can be argued that that's a verbatim reproduction, yet it's not a violation of any kind in any reasonable context.
Where do you draw the line and how would you be able to - automatically even! - decide what does and does not represent a significant verbatim reproduction?
but those aren't only mathematics. There's the choice of variable names, the order in which things are called (maybe to optimize the performance on some CPU, we don't know), etc
Your original argument is based on the false premise that the amount of time or effort matters -- it doesn't. Not all human activity can or should be subject to copyright -- this the dangerous slippery slope of "intellectual property" -- and we are dangling by edge these days.
>I still don't see any problem with that. If it's larger sections (e.g. entire NON-TRIVIAL function bodies), those can be filtered or correctly attributed after inference. So that's just a technicality.
Today copilot does what it does.
I've never heard Microsoft defend anyone running afoul of some of their licensing details with "they can fix it later, it is just a technicality".
I think this should go both ways? No?
> Smaller snippets and trivial or mechanical implementations (generated code, API calls, API access patterns) aren't subject to any kind of protection anyway.
int main(int argc, char* argv[]) {
> Lines like that hold no intellectual value and can be found in GPL'ed code. It can be argued that that's a verbatim reproduction, yet it's not a violation of any kind in any reasonable context.
Totally agree. Edit: otherwise we'd all be in serious trouble.
> Where do you draw the line and how would you be able to - automatically even! - decide what does and does not represent a significant verbatim reproduction?
I am not a lawyer but I guess many can agree that somewhere before copying functions verbatim, comments literally copied as well for good measure, somewhere before that point there is a line.
On the other hand: if there was significant evidence that the AI was doing creative work, not just (or partially just) copying then I think I would say it was OK even if it arrived at that knowledge by reading copyrighted works.
Edit: how could we know if it was doing creative work? First because it wouldn't be literally the same. Literal copying is liter copying regardless of if it is done using Xerox, paid writers, infinite monkeys om infinite typewriters, "AI" or actual strong AI.
After that it becomes a bit more fuzzy as more possibilities open up:
- for student works I look at how well adapted it is to the question at hand: a good answer from Stackoverflow, attributed properly and adapted to the coding style of the code base? Absolutely OK. Copying together a bunch of stuff from examples in the frameworks website? Fine. Reading through all the docs and look at how a number of high profile projects have done it in their open source solution, updating the README.md with info on why this solution was chosen? Now you are looking for a top grade in my class.
(of course IBM will probably not want you to work on their compiler though if you admit that you've studied OpenJDKs, or so I have heard.)
It's also not a commercially released product yet, but a technical preview, so uncovering and addressing issues like that is exactly what pre-release versions are for.
I'd say it succeeded greatly in sparking a discussion about these issues.
> ... will you defend it just because I claim it is a tech preview?
That's a straw man argument and you know it.
Code snippets are in no way shape or form comparable to entire software products and CoPilot neither installs anything nor is its intention to knowingly violate licences or copyright law.
Disingenuous straw manning like this doesn't help the discussion and only serves to distract from actual issues.
It is absolutely not in my opinion and that particular idea did not cross my mind at all so the idea that I knew it is patently double false.
But let me try to be constructive here and be even more precise:
Would it be OK if I launched a tech preview of my AI poem writer companion that would copy lines but also complete stanzas from famous poets, rock bands and singer-songwriters?
> Would it be OK if I launched a tech preview of my AI poem writer companion that would copy lines but also complete stanzas from famous poets, rock bands and singer-songwriters?
Yes it would be if it only happened ~0.1% of the time and if quoting verbatim wasn't the intended function of the system but merely a side-effect. In fact, that's what artists sometimes do deliberately.
It's what happens with other GANs as well and all that needs to happen is to educate users about the possibility of this. As long as you don't take ownership of the output produced by your AI (and neither do Microsoft), it's at the discretion of the user what they use the generated content for and in which context.
It has been demonstrated that training data can be extracted from any large NLP model [0] so this wouldn't come as a surprise either.
I still don't see any problem with that. If it's larger sections (e.g. entire NON-TRIVIAL function bodies), those can be filtered or correctly attributed after inference. So that's just a technicality.
Smaller snippets and trivial or mechanical implementations (generated code, API calls, API access patterns) aren't subject to any kind of protection anyway.
Lines like that hold no intellectual value and can be found in GPL'ed code. It can be argued that that's a verbatim reproduction, yet it's not a violation of any kind in any reasonable context.Where do you draw the line and how would you be able to - automatically even! - decide what does and does not represent a significant verbatim reproduction?