Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Out of curiosity, how do we define license violation in that case? I, as a human being, have trained by reading code, much of which is covered by licenses that are somehow not compatible with code I'm writing. Am I violating licenses?

Asking seriously. It's really unclear to me where law and/or ethics put the boundaries. Also, I'd guess it's probably country dependent.



> I, as a human being, have trained by reading code, much of which is covered by licenses that are somehow not compatible with code I'm writing. Am I violating licenses?

As someone who has taught students in ICT a quick rule of thumb was that I picked a piece of text that I suspected, wrapped it in doublequotes and put it into a search engine.

9/10 times - possibly more - of the times I had that feeling it was true. 17 year olds don't write like seasoned reporters most of the time.

Obviously there needs to be some independent tought in there as well, but for teenagers I put the line at not copying verbatim, and to cite sources.

As we've seen demonstrated again and again copilot breaks both my minimum standard rules for teenagers: it copies verbatim and it doesn't cite sources.

I say that is pretty bad.

If the system had actually learned the structure and applied what it had learned to recreate the same it would be a whole different story.

But in this case it is obvious that the AI isn't writing the code - at least not all the time, it is instead choosing what to copy - verbatim.


> But in this case it is obvious that the AI isn't writing the code - at least not all the time, it is instead choosing what to copy - verbatim.

I still don't see any problem with that. If it's larger sections (e.g. entire NON-TRIVIAL function bodies), those can be filtered or correctly attributed after inference. So that's just a technicality.

Smaller snippets and trivial or mechanical implementations (generated code, API calls, API access patterns) aren't subject to any kind of protection anyway.

  int main(int argc, char* argv[]) {
Lines like that hold no intellectual value and can be found in GPL'ed code. It can be argued that that's a verbatim reproduction, yet it's not a violation of any kind in any reasonable context.

Where do you draw the line and how would you be able to - automatically even! - decide what does and does not represent a significant verbatim reproduction?


what about lines such as

    Idxs[i] += (Imm >> ((i * HalfLaneElts) % 8)) & ((1 << HalfLaneElts) - 1);

    double r2 = fma(u*v, fma(v, fma(v, fma(v, ca_4, ca_3), ca_2), ca_1), -correction);

    seed ^= hasher(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);

    qint32 val = d + (((fromX << 8) + 0xff - lx) * dd >> 8);
even if it's one line, it likely took some non-negligible thinking time from the programmer


What about E = mc^2 ?

Mathematics and physics equations are not copyrightable.


but those aren't only mathematics. There's the choice of variable names, the order in which things are called (maybe to optimize the performance on some CPU, we don't know), etc


Your original argument is based on the false premise that the amount of time or effort matters -- it doesn't. Not all human activity can or should be subject to copyright -- this the dangerous slippery slope of "intellectual property" -- and we are dangling by edge these days.


>I still don't see any problem with that. If it's larger sections (e.g. entire NON-TRIVIAL function bodies), those can be filtered or correctly attributed after inference. So that's just a technicality.

Today copilot does what it does.

I've never heard Microsoft defend anyone running afoul of some of their licensing details with "they can fix it later, it is just a technicality".

I think this should go both ways? No?

> Smaller snippets and trivial or mechanical implementations (generated code, API calls, API access patterns) aren't subject to any kind of protection anyway.

  int main(int argc, char* argv[]) {
> Lines like that hold no intellectual value and can be found in GPL'ed code. It can be argued that that's a verbatim reproduction, yet it's not a violation of any kind in any reasonable context.

Totally agree. Edit: otherwise we'd all be in serious trouble.

> Where do you draw the line and how would you be able to - automatically even! - decide what does and does not represent a significant verbatim reproduction?

I am not a lawyer but I guess many can agree that somewhere before copying functions verbatim, comments literally copied as well for good measure, somewhere before that point there is a line.

On the other hand: if there was significant evidence that the AI was doing creative work, not just (or partially just) copying then I think I would say it was OK even if it arrived at that knowledge by reading copyrighted works.

Edit: how could we know if it was doing creative work? First because it wouldn't be literally the same. Literal copying is liter copying regardless of if it is done using Xerox, paid writers, infinite monkeys om infinite typewriters, "AI" or actual strong AI.

After that it becomes a bit more fuzzy as more possibilities open up:

- for student works I look at how well adapted it is to the question at hand: a good answer from Stackoverflow, attributed properly and adapted to the coding style of the code base? Absolutely OK. Copying together a bunch of stuff from examples in the frameworks website? Fine. Reading through all the docs and look at how a number of high profile projects have done it in their open source solution, updating the README.md with info on why this solution was chosen? Now you are looking for a top grade in my class.

(of course IBM will probably not want you to work on their compiler though if you admit that you've studied OpenJDKs, or so I have heard.)


> Today copilot does what it does.

It's also not a commercially released product yet, but a technical preview, so uncovering and addressing issues like that is exactly what pre-release versions are for.

I'd say it succeeded greatly in sparking a discussion about these issues.


If I release a piece of software today that install Microsoft products but stripped of all attributions and without paying any licenses,

... will you defend it just because I claim it is a tech preview?


> ... will you defend it just because I claim it is a tech preview?

That's a straw man argument and you know it.

Code snippets are in no way shape or form comparable to entire software products and CoPilot neither installs anything nor is its intention to knowingly violate licences or copyright law.

Disingenuous straw manning like this doesn't help the discussion and only serves to distract from actual issues.


> That's a straw man argument and you know it.

It is absolutely not in my opinion and that particular idea did not cross my mind at all so the idea that I knew it is patently double false.

But let me try to be constructive here and be even more precise:

Would it be OK if I launched a tech preview of my AI poem writer companion that would copy lines but also complete stanzas from famous poets, rock bands and singer-songwriters?


> Would it be OK if I launched a tech preview of my AI poem writer companion that would copy lines but also complete stanzas from famous poets, rock bands and singer-songwriters?

Yes it would be if it only happened ~0.1% of the time and if quoting verbatim wasn't the intended function of the system but merely a side-effect. In fact, that's what artists sometimes do deliberately.

It's what happens with other GANs as well and all that needs to happen is to educate users about the possibility of this. As long as you don't take ownership of the output produced by your AI (and neither do Microsoft), it's at the discretion of the user what they use the generated content for and in which context.

It has been demonstrated that training data can be extracted from any large NLP model [0] so this wouldn't come as a surprise either.

[0] https://arxiv.org/abs/2012.07805

https://towardsdatascience.com/openai-gpt-leaking-your-data-...


It’s not AI it is ML. GPT-3 is a very large ML model. It does not reason. It’s a statistical machine.


ML is a subset of AI, in any defintion that I've seen. And both are needlessly anthropomorphizing what are currently simple statistical or rule-based deduction engines.

GPT-3 is no more 'intelligent' in the human sense than it is 'learning' in the human sense.


By this logic there is no such thing as AI.


There's no such thing as AI.


Can you expand on this? Clearly the term exists. I have a degree in AI, do the concepts I learned at university not exist? What does you mean when you say AI does not exist?

Do you mean that the terms, algorithms, concepts, and applications found in the field labelled "Artificial Intelligence" should not be called as such?

I have a feeling you are simply playing a semantic game, though, in which case we are likely to talk past each other.

Edit: I suspect you may be conflating artificial general intelligence[0] with AI

[0]: https://en.wikipedia.org/wiki/Artificial_general_intelligenc...


> Out of curiosity, how do we define license violation in that case? I, as a human being, have trained by reading code, much of which is covered by licenses that are somehow not compatible with code I'm writing. Am I violating licenses?

That depends, if you end up writing copies of the code you've studied then yes. You are on thin ice. Plagarization is definitely something that you can do with computer code. There have been several high profile cases around this in arts. As far as I can see it usually ends up being a question about how much of the work is similar, how similar it is, and how unique what was similar is. And added wrinkle in programing is that some things can be done in only one way, or at least any reasonable programmer will do it in only one way. So for example a swap(var1, var2) function can usually only be done in one way, and therefor you would not get in trouble if your and someone else swap function are the same.

I've been following the discussion about Copilot, and one issue that comes up again and again is that people seem to think that since Copilot is new, the law will treat it, and the code it writes differently, than what it would you or a copy machine. I think that is naive, my impression is that courts care more about what you did, not how you did it, and if you think Copilot can be used to do an end run around the law. Prepare to be disappointed.

So if Copilot memorize code and spits out copies of that code, then it is at best skating on thin ice, or at worst doing a license violation. If the code it is copying is unique, then it definitely is heading into problematic territory. I'm fairly sure sure someone in legal at Github is very unhappy about the quake fast inverse square root function.


My guess is that many people will use it on the backend where a copyright violation is hard to spot and even more difficult to prove.

As for fronted/open source etc... sure, if you don't care about copyright and licensing, use it.


> swap(var1, var2)

Well, there's also the xor way to be pedantic :)

   var1 = var1 ^ var2
   var2 = var2 ^ var1
   var1 = var1 ^ var2
But yeah, not too much wiggle room there.


Another variation (assuming no overflows):

    var1 += var2;
    var2 = var1 - var2;
    var1 -= var2;
And another:

    var1 ^= var2 ^= var1 ^= var2;
Assembly even has an instruction for it:

    xchg eax, ecx


The training question seems much more difficult.

The main problem that has been the topic is a simpler one - about the produced work. If you exactly reproduce someone's existing code (doesn't matter if you copy by flipping bits one by one or which technology you use), isn't it a copyright violation?

I'm kind of imagining a Rube Goldberg machine that spells out the quake invsqrt function in the sand, now...


Yes, if you play a video from Netflix while recording your screen, transcode that video to MPEG2 and use a red laser to write a complex encoding of that MPEG2 bitstream onto a plastic disk, then send that by mail to your friend, a court won't care about the complexity of that Rube Goldberg machine. They will just say it's a clear copyright violation since you distributed a Netflix movie by DVD.

With programming, there's the further complication what constitutes a work. But quakes invsqrt certainly qualifies, just like that one function from the Oracle vs Google case.


None of our laws were created under the assumption that computers would do so much of our jobs and effect so much of our lives. From robotic automation to social media to now computer programming. I think it’s really a mistake to ask what the letter of the law currently means in the evolving context. Laws should serve us and need to be adapted.


Who is "us" that are being served?

I'm not the biggest fan of copyright law as currently written, but I wouldn't say that MS's desire to file off the serial numbers on every piece of public code for their own profit is a good impetus to rewrite the law.


> I, as a human being, have trained by reading code, much of which is covered by licenses that are somehow not compatible with code I'm writing. Am I violating licenses?

There are many good answers from the legal side. I would also attack this side: the way human beings learn is entirely different from the way ML models are trained. We don't do gradient descent to find the slope of data points and find the most likely next bit of code.

We humans create rational models of the code and of the world, and use deduction from those models to create code. This is extremely visible in the way we can explain the reason behind our code, and in the way we are aware of the difference between copying code we've seen before vs writing new code. It's also visible in that we can be told rules and produce code that obeys those rules that doesn't resemble any code ever written before.

The difference is also easily quantifiable: humans learn to program after seeing vastly fewer code examples than Co-pilot needed, and we are much better at it.

One day, we will design an AI that does learn more similarly to how humans learn, and that day your question will be far more interesting. But we are far from such problems.


I'm not sure this is actually true. We can explain code, but the fact that we can explain code is not necessarily related to the way we actually end up writing it. Have you ever written a function "on autopilot"? Your brain has selected what you wanted it to do, and now you're just typing without thought? I don't think we're as dissimilar to this model as we'd like.


The feeling of being "on autopilot" when doing a task has to do with your, let's call it, supervisory process being otherwise occupied. It doesn't suggest that that the other mental processes which are responsible for figuring out the actions have changed their character or mode of operation.

"You" are just not paying attention to it in that moment.


The fact remains that, even on autopilot, in not writing code based on similarity with other code I've seen, in writing code to solve a task. In general, the code I'm writing is entirely novel - you could search all of the code ever written and you wouldn't find anything identical, or even similar much of the time. This puts not a brag - I work on fairly standard CRUD stuff most of the time - but just an observation about how human writing works, confirmed by code scanning tools such as Black Duck.


If you were to write large swaths of copyrighted code from memory then yes you'd be committing a copyright violation.

Most humans don't do so unintentionally though.


I’m not so sure Copilot is doing so “unintentionally” either...


Just as an example, this is very widespread in music though.


If the whole 'Dark Horse' debacle proved anything it would be that that can still be considered a copyright infringement. Sure that particular example was (rightly IMHO) deemed to not be a copyright violation, but they still had to show their version was original enough, they couldn't just claim such copying wasn't ever an infringement.


I am not a lawyer but I am sure that any legal standard for ML has to be different than "isn't it just doing what humans do, but faster?"

GitHub scanning billions of code files to build commercial software is different than you learning at human pace, even if they're both "learning" and in the end they both produce commercial software.


> isn't it just doing what humans do, but faster?

The human activity most like training an ML system is memorizing a text by reciting from memory, checking against the original, adjusting, and repeating until there are acceptably few mistakes.

And if a human did so for thousands of texts then publicly repeated those texts, they would be violating copyright too.


It does not have to be different but it certainly can be different, a difference in quantity can certainly be a difference in quality. People watching other people walk by and a camera - maybe with face detection - doing the same are not only a difference in quantity but also in quality.


That is exactly what needs some careful consideration. As a start, two people can write the exact same code independently, therefore having identical code is not sufficient. On the other hand I can copy some code and slightly modify it, maybe only the spacing or maybe changing some variable names, and it could reasonably be a license violation, therefore having identical code is also not necessary.

Does the code even matter at all? If I start with a copy of some existing code, how much do I have to change it to no longer constitute a license violation? Can I ever reach this point or would the violation already be in the fact that I started with a copy no matter what happens later? Does intention matter? Can I unintentionally violate a license?

But I think we don't have to do all the work, I am pretty sure this has already been considered at length by philosophers and jurist.


The boundaries are not set in stone, and so the answer is the old theme of "it depend". To provide a slightly different situation which was discussed a few years ago, can you train an AI on pictures of human faces without getting permission? Human painters have created images of faces for a very long time, so is it any different in terms of law and/ethics if an AI do it?

Yes, a bit? It depend. Using such things for advertisement would likely cause anger if people start to recognize images of the training set the AI was trained on.


My opinion would be that if the training set for the face generator was made up of photos whose creators had asked you to credit them if you re-used their work, then, yes, the generator is ethically in the wrong if it's skipping that attribution. Regardless of copyright. (And I feel the same way about Copilot.)


https://en.wikipedia.org/wiki/Clean_room_design

sometimes? it's enough of an issue that companies explicitly avoid it by having two teams.


Clean room design is a technique to avoid the appearance of copyright infringement. If the courts were omniscient and could see into your mind that you didn't copy then there would be no need. Why this is relevant is because we can see into the mind of copilot. Whether what it does it considered infringement I think will come out in the details.

If the ML model essentially is just a very sophisticated search and helps you choose what to copy and helps you modify it to fit your code then it's 100% infringement. If it is actually writing code then maybe not.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: