Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
All public GitHub code was used in training Copilot (twitter.com/noradotcodes)
1017 points by fredley on July 8, 2021 | hide | past | favorite | 707 comments


To me, the particular use case and whether it is fair use or not, is of minor interest. A far more pressing matter is at hand: AI centralization and monopolization.

Take Google as an example, running Google Photos for free for several years. And now that this has sucked in a trillion photos, the AI job is done, and they likely have the best image recognition AI in existence.

Which is of course still peanuts compared to training a super AI on the entire web.

My point here is that only companies the size of Google and Microsoft have the resources to do this type of planetary scale AI. They can afford the super expensive AI engineers, have the computing power and own the data or will forcefully get access to it. We will even freely give it to them.

Any "lesser" AI produced from smaller companies trying to compete are obsolete, and the better one accelerates away. There is no second-best in AI, only winners.

If we predict that ultimately AI will change virtually every aspect of society, these companies will become omnipresent, "everything companies". God companies.

As per usual, it will be packaged as an extra convenience for you. And you will embrace it and actively help realize this scenario.


I have about 300,000 photos that haven't been scanned by AI (unless someone at Backblaze did it without permission). I'm sure there are lots of other photographers out there who miss Picassa, which Google killed off to push everyone's data to their service. (It did really well in matching faces, even across age, but the last version has a bug when there are multiple faces in a picture, sometimes it swaps the labels)

If there were offline image recognition we could train on our own data privately, could the results of those trainings be merged to come up with better recognition on average than any one person could do themselves with their own photos?

In other words, would it be possible for us to share the results of training, and build better models, without sharing the photos themselves?


Absolutely possible.

What I'm building into PhotoStructure is typically called "transfer learning."

https://en.wikipedia.org/wiki/Transfer_learning

PhotoStructure is entirely self-hosted, including model training and application: the public domain base models (trained on huge datasets) are fetched and cached locally.

By design, none of your data (or even metadata) leaves your server.

(I expect to ship this in an upcoming beta next month.)


I want to label all the faces in the photos I've taken since 1997, and save them in the metadata. I'll be glad to run it against my photos. Windows 10, WSL, and/or Virtual Machine with Linux of your choice.


I've got desktop builds for macOS, Windows, and Linux, as well as "headless" builds for Docker and even "directly" via Node.js. Instructions here: https://photostructure.com/install


Nice! Will try this out. Are you planning on taking advantage of in-built neural engines like that in Apple M1 for speeding up object/facial recognition?


I'd like to, but practically speaking, I'm at the mercy of native support in the libraries I'm using. If support is added, though, it's trivial for me to add the switch as a user-definable setting.


Yes, you're talking about federated learning.

https://en.wikipedia.org/wiki/Federated_learning


> If there were offline image recognition we could train on our own data privately...

Apple does all face recognition and image processing stuff on the edge. On your iPhone or Mac.

I wondered why my phone got frighteningly hot while charging sometimes. Then I saw the note after adding some faces manually for it to recognize, which was in the line of "Your phone will update faces when the phone is charging". My all photos are backed up to iCloud, btw.


While possible, only the tech-savvy people would take part in this "collective", which is of-course a minor fraction of the data which Google has access to. This is the same argument as saying that if you care about privacy "just" don't use Google, easier said than done for the vast majority of people on earth.


I am not an expert on the field. But my hope was that this could be facilitated by Transfer Learning. Still don’t know how the scale economies could be achieved. Maybe just out of the sweat and network of passionate people like in the case of open source.


I work in the field. Transfer learning helps get you decent/good models, but the best models remain ones trained on large amounts of data. You may be able to get away with good performance and not great on your task. For some areas you really care a lot about long tail performance (like self driving) that you will need massive dataset. For other areas if your goal is to be the best relative to other large companies you will need a massive dataset.

Transfer learning best use cases are for fast prototypes or for ml tasks that do not need state of the art performance.


If there are hundreds of people with 100,000 photos each, that collectively is a massive training database, with a lot more labels and diversity of subjects.

By keeping the training data itself private, distributed and outsourced, you might be able to get otherwise unachievable levels of performance.


This isn't going to solve the data ownership issues though, since they contaminate the program trained on them (and its blackbox nature only makes it worse)... though I guess that specifically for copyright it's going to depend on the final usage of that tool ?


While interesting, we don't know enough about how models are learning to where we would be able to consider doing this.


There would still need to be a central model (and centralized management thereof) if I understand correctly.


Photoprism, digiKam, shotwell all have image recognition features, with varied levels of sophistication.


You don't encrypt your data before uploading to backblaze?


Oh heck no, I never encrypt data.

I run windows. It can't ever be secure, anyone who wanted to hack me could.

Scrambling the data really makes things worse as any accident requiring recovery of my data is also probably going to lose the encryption key.

The only time I ever lost any significant chunk of data (a persons lifetime set of photos!) was because Windows encrypted data at rest, and thus it couldn't be recovered after a disk crash.

Unless there is some corporate or legal requirement to do so, I'll never encrypt a whole disk, or backup.


> any accident requiring recovery of my data is also probably going to lose the encryption key.

... why?

i'd hate encrypting too if I threw away all best-practices regarding it -- losing a key with the failed system is a "problem exists between chair and keyboard" type of issue.

Encryption protects your data from yourself, from your adversaries, from serendipitous grey-moral types, and from the prying eyes of over-zealous data-collection conglomerates.

You seem experienced in the field, so I won't presume what your best practices are -- but to be enthusiastic against encryption is a form of cheer-leading that I think I cannot ethically support; the longer I live and the more pervasive companies get to be with their data collection policies then the more powerful and required tools like encryption seem to become.


Agreed. Everybody talks about encrypting backups like it's common sense, but almost nobody talks about the risks involved with failing to back up the encryption key itself properly. The entire integrity of the backup then depends on that sensitive piece of data, and it's not something that can be openly shared by its nature, or included in the encrypted backup itself. It's even deceptive if your measure for success is restoring the backup to make sure it works properly, because there is now an implicit assumption that the encryption key is still valid and undamaged the next time you restore.

I wish backup tools like Duplicity would warn you about the risks of encrypting backups instead of warning the user if they disable encryption, because encryption has the possibility of rendering all those backups useless when the moment to use them finally comes.

I have a similar feeling that large swathes of my digital life would be rendered permanently inaccessible if 2FA was enabled and my device was rendered inoperable. (That's why I keep meticulous physical backups of emergency keys.) I think 2FA and the like should be considered a tradeoff with its own inherit risks and benefits, instead of a universally better option than randomly generated 80-character passwords alone.


The thought that data monopolization will be a moat against competitors is actually argued against by VC firms specializing in AI companies, who claim that after a certain amount of data (which is accessible to most people) the additional data isn't going to improve the model much.

https://a16z.com/2019/05/09/data-network-effects-moats/

https://a16z.com/2020/02/16/the-new-business-of-ai-and-how-i...


And in case of Copilot, the training data isn’t a moat anyhow. Last I looked, everyone could freely access GitHub public repositories.


> If we predict that ultimately AI will change virtually every aspect of society, these companies will become omnipresent, "everything companies". God companies.

What we currently call AI is very from AGI, and it's not clear that sitting on piles of proprietary data gives an edge towards AGI. If the goal is human-level intelligence, that has been demonstrably achieved with the far lesser resources of the public school system. :)

Current DL systems need huge amount of data, because they are very primitive: they work with immediate associations, so they require seeing data very similar to all possible inputs to generalize well.

As we develop more sophisticated systems, I expect that the leverage from data will tip over to engineering finesse, and nothing is better at fostering great engineering than the permissionless tinkering environment of open source.


> If the goal is human-level intelligence, that has been demonstrably achieved with the far lesser resources of the public school system.

Pretending that the scientifically managed public school system, that attempts to manufacture uniform educated humans on a conveyer belt, is responsible for human education is fairly ridiculous.

Children have a remarkable capacity to learn, and do so automatically through free play and exploration until public education wrings that curiosity out of them and turns education into a job.

Humans get educated despite the public education system, not because of it.


> Pretending that the scientifically managed public school system,

Say what now? There may be places on Earth that practice scientific management, there are definitely some that pretend to, but IME public school systems are neither.


Schools (at least American public schools) are one of the last bastions of Taylorism in the west. They treat students like uniform widgets on an assembly line.

You can read for yourself: https://files.eric.ed.gov/fulltext/ED566616.pdf https://radicalpedagogy.icaap.org/content/issue3_2/rees.html


“Treating X as uniform widgets” (where X are not uniform widgets) and “scientific management” are not only not the same thing, they are anticorrelated.


Yeah everyone knows children will innately learn calculus from flinging mud at eachother.


> If the goal is human level intelligence, that has been demonstrably achieved with the far lesser resources of the public school system.

Seems unlikely human education costs less than AI education in total.


For years we thought Google Translate was the best machine translation we would ever get. Then DeepL just popped up out of nowhere and today other services still didn't manage to catch up.

Every now and then you get someone to think about an old problem on a clean sheet of paper and you might get a better result with less training data / investment.


Google doesn’t have the best (publicly) available reverse image search AI. That would be Yandex.

Google is actually pretty crappy at reverse image searches.

https://www.bellingcat.com/resources/how-tos/2019/12/26/guid...


The point still stands though, Yandex is also a behemoth with access to a massive amount of data.


It's pretty clearly intentionally hobbled for various reasons (e.g. privacy, obscenity, etc). It used to work a lot better.


Which would explain why Yandex, specifically, is the best-in-category: being based in a country where your government enjoys trolling developed world's idea of decency and responsibility can have its advantages.

(Until, of course, they force you out and give company to some crony oligarch. But that idea is also not unknown to Yandex, I believe?)


On the other hand, DeepL (made by a small German company) is better than Google Translate.


Makes it sound like DeepL exists in isolation. It's good because the company behind it has the largest hyperlocal (small phrases with confirmed usages) translation data set.


There is a second best though. Apple offers image AI which is worse than googles but wins because it works offline.


I've got 70,000 photos in my library, with AI search and recognition, all done on my device. Thanks Apple.

In fairness it's not quite as good, but, it's good enough for the searches I've wanted to do so far and gets better all the time. And they're adding searching text in photos this release. I'm happy to wait a little for this better implementation.


I largely agree. But there are still some fun opportunities around. One are things google would never touch because of PR reasons (e.g. state of the art scalable face identification). Another is just silly out-of-the-box creative uses of AI which wouldn't fit well with Google's brand.


If Google makes an amazing model that no-one can beat it will only be dominate as long as others get access to it freely. But if there are restrictions on access or if it's too expensive, other options will appear and even if they're not as perfect, they'll still be very usable. Imagine a coalition of companies all feeding data, that could compete just as well.


Google has all the data of all the users though. I'd wager that they won't just let AI companies scrape it.


I don't think Google uses user photos to train their photo search algorithm.

They use photos from the web for training, and then user photos are only used for the actual indexing.


I think it's an innate quality of technology.

Yes sophisticated AI tech concentrates power for those who already have power.

And the technology we all (presumably readers of HN) create can enhance the impact of the user. And this can result in unfair circumstances, in reality.

Law and force can prevent disproportionate use of power. Of course one must define the law, which may be done AFTER the offense has been committed. Further, if those who make the laws are corrupted by those with e.g. this AI tech power, then no effective law may be enacted and the hypothetical abuse will continue.


The final step is to break down these monopolies. The government can do that and has done it before.


Interesting, given that HN thinks that it is yandex who has SOA image search, not google https://news.ycombinator.com/item?id=23976172 which kinda counters your logic.

It is yandex who now collects massive amounts of data to improve their image search now, while google apparently doesn't.

Yandex is a giant, for sure, but google is, like, 10 times bigger and still doesn't provide the best service.


They are not hoarding the latest results, except for a few cases where the general public is a year behind their secret sauce. Take a look at the huge zoo of planetary-scale models that are published by the big companies and universities (HuggingFace, https://modelzoo.co/, ...)

The problem with the huge models like GPT-3 is that they are too expensive even to run by regular people, not train.


Regular people yes, but no problem for decently funded startups.


This seem to be inevitable. An individual doesn't have horizontal scalability, you know... So, unless we'll have some kind of brain extension capabilities, there is no other choice but to build such technologies collectively.

Also, I think you are overdramatizing this. Governments used to be omnipresent (maybe still are), in a different way, more threatening to individuals and probably as threatening to societies as "everything companies" could be.


We can decide to stop using some (or most of) Google services. It's hard, but it's not that they are pointing us with a gun in order to use their services, right? Sure, for the cases when one cannot escape Google, use it; but for the rest of scenarios? It's all about tradeoffs: Can I live without YouTube? Can I live with DuckDuckGo (Google Search is "better" but I don't mind)? etc.


Google search would be hard to replace. In fact, if Google search was turned off over night, the world would probably see a major economic downturn, caused by a sudden drop in productivity.


You, a person knowledgeable in this field, may choose to stop using Google services, but that won’t have any societal impact if you can’t also convince the “average” user to do the same.


And I have yet to see a single life-changing AI application. I haven't tested Copilot yet, but I'll bet it is so precariously useful that a lot of people will feel more productive without it. (BTW, the last time I opened VSCode, it could not even autocomplete Numpy, so I am not holding my breath for AI autocomplete.)


Well of course only the huge companies can develop products that require enormous resources.

But I'm not too worried here because everyone gets access to larger datasets every year, and it gets cheaper to process every year, so whatever Microsoft or Google is capable of doing now, smaller companies will be capable of doing in a few years.


It’s also a huge call for innovation. When a student learns to code, (s)he doesn’t need to analyse millions of Git repositories to get good at it. Throughout their entire career most developers will probably only see comparatively little code. Perhaps the equivalent of the Linux kernel, if that. And yet, we’re able to learn from the little we see and get reasonably good at coding. It even stands to reason how much better one gets by reading more code (most of which is pretty crappy anyway).


I believe this is actually powered by OpenAI, which while large (now), is nowhere near the behemoth that Microsoft or Google is.

This suggests that seeing the future a bit ahead of the rest of the world, and then assembling a motivated all-star team is (perhaps in the short term at least) one way of out-competing the "super AI" of the giants.


Not only Microsoft basically bough OpenAI a couple of years ago, they also made the GPT-3 closed thing that you can only access via API.

Don't let the name fool you, OpenAI is anything but Open.


Last I've checked, Microsoft pretty much owns OpenAI ?


I don’t think that your premise concerning “planetary-scale AI” (and the ability to pull it off) holds up. If Google and Microsoft are so dominant and had such an insurmountable head start, why are we seeing such an enormous number of AI startups? In fact, there are countless startups busy figuring out how to make AI work for software development. I’d even argue that copilot was not that expensive to build. I very much doubt that GitHub (or Microsoft for that matter) had a huge team working on this or has spent such a vast amount on hardware resources that they’d outcompete the rest of the market by virtue of their cash reserves. Any decently funded startup should be able to finance such as effort. Especially since in this case, the training data is cheap (and legal) to access for anyone.

Where Microsoft does have an “unfair advantage” is in their marketing and sales firepower. Replicating their B2B and B2C sales channels is indeed very expensive. GitHub will be able to monetise Copilot by some upselling campaign. Then again, startups regularly manage to break into markets that are supposedly locked down by the likes of Microsoft.


If the training set contains verbatim (A)GPL code does this mean that Copilot also should be distributed by Microsoft under GPL? Because without it Copilot (as it is distributed by Microsoft) couldn't be built, wouldn't it make it a derivative work of GPL'd code (and obviously every other license)?

I see a lot of people comparing human learning to machine learning in the comments, but there is a huge difference - we don't distribute copies of humans


No, see Authors Guild v. Google. Even without a license or permission, fair use permits the mass scanning of books, the storage of the content of those books, and rendering verbatim snippets of those books. The Google Books site is not a derivative work of the millions of authors they copied from, and if they did copy any coincidentally GPL, AGPL, or creative commons copyleft work, the fair use exception applies before we reach the question of whether Google is obligated to provide anything beyond what it is doing.

By comparison, Copilot is even more obviously fair use.

I've had this conversation quite a few times lately, and the non-obvious thing for many developers is that fair use is an exception to copyright itself.

A license is a grant of permission (with some terms) to use a copyrighted work.

This snippet from the Linux kernel doesn't make my comment here or the website Hacker News a GPL derivative work:

    ret = vmbus_sendpacket(dev->channel, init_pkt,
        sizeof(struct nvsp_message),
        (unsigned long)init_pkt, VM_PKT_DATA_INBAND,
        VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
This snippet from an AGPL licensed project, Bitwarden, does not compel dang or pg to release the Hacker News source code:

    await _sendRepository.ReplaceAsync(send);
    await _pushService.PushSyncSendUpdateAsync(send);
    return (await _sendFileStorageService.GetSendFileDownloadUrlAsync(send, fileId), false, false);
Fair use is an exception to copyright itself. A license cannot remove your right to fair use.

The Free Software Foundation agrees (https://www.gnu.org/licenses/gpl-faq.en.html#GPLFairUse)

> Yes, you do. “Fair use” is use that is allowed without any special permission. Since you don't need the developers' permission for such use, you can do it regardless of what the developers said about it—in the license or elsewhere, whether that license be the GNU GPL or any other free software license.

> Note, however, that there is no world-wide principle of fair use; what kinds of use are considered “fair” varies from country to country.

(And even this verbatim copying from FSF.org for the purpose of education is... Fair use!)


You're strongly and incorrectly implying that "Fair Use" is a clear (and relatively immutable) concept within copyright law, which couldn't be further from the truth. Even if this or that particular case sets out what appears to be solid grounds, one shouldn't take that as gospel by any means.

This mostly has to do with the nature of the wishy-washy nature of the 4 part Fair Use test, which, unlike decent legal tests, doesn't actually have discrete answers. The judge looks at the 4 questions, talks about them while waving her hands, and makes a decision.

Comparing to, e.g., Patent, where you actually do have yes-or-no questions. Clean Booleans. Is it Novel? Is it Non-Obvious? Is it Useful? If any of the above is "No", then no patent for you.

As for the execution of Fair Use, while I haven't gone too deep into Software, I can assure that for music, the thing is just a silly holy-hell mess; confirmed most recently by the "Blurred Lines" case, where NO DIRECT COPYING (e.g. sampling or melody taking) was alleged, merely that the song sounded really similar to "Got to give it up" and that was enough.

So then, I'd say everything either is, or should be, up in the air, when it comes to Fair Use and software.


Most law is wishy washy. There are very few cut and dry answers in the law (If there were, we wouldn't need lawyers and a court system based on deciphering the law).

All that said, the one thing I'd add about fair use is that it isn't permission to use anything you like, but rather a defense in a legal proceeding about copyright. It's pretty much all about being able to reference copyrighted material with the law later coming in and making final decisions on whether or not that reference went too far. (IE, copying all of a disney movie and saying "What's up with this!" vs copying 1 scene and saying "This is totally messed up and here's why".)

That was a big part of the google oracle lawsuit.


> Is it Novel? Is it Non-Obvious?

Those questions for patents are barely more clear-cut than copyright fair use tests, there is lots of room for disagreement.

It's definitely true that a fair use defense against copyright infringement varies a lot by the field of work and norms can develop which are relevant to court cases. The music field is a mess, the "Blurred Lines" judgement was total bullshit. But the software field is not without its own copyright history and norms so there's no reason to expect everything to go to hell.


But there's no reason not to either - I suppose my point is, don't take too much as gospel and think about everybody's best "end-goals" and push or pull with or against the law as needed.


There’s also an aspect of this that varies by size, budget, political clout, etc etc, of the individual or organisation.

The big guns like Microsoft, Google, Oracle, do this sort of thing as a matter of course in their business activities, they have the lawyers, the money, and the ear of members of parliaments, senators etc.

Whereas an individual or small business probably wants to conduct themselves within a more narrow set of adherences.


Unanswered question, as far as I know: is a trained model a derivative work? If the model accidentally retains a copy of the work, is that an unauthorized copy?


In my opinion, the model would not be an unauthorized copy given that it's primary purpose was for some other task and the inclusion of the work was merely incidental.

The unauthorized copy arises when someone gets the work out of the model.

Of course if you make a model explicitly for the purpose of evading copyright then the courts will see through that ploy.


I think it would be pretty easy to stake opinions on those "boolean questions."

Is (was?) a swipe gesture novel? Is it non-obvious?


I think what the parent is stating is that even though the patent questions can have debate, once you settle the question "Is it Novel" as yes or no you can determine if the item is patentable... wheras for fair-use, the questions themselves aren't yes/no questions, and further, they are just used as balancing factors, so even if everyone agrees on "the effect of the use upon the potential market for or value of the copyrighted work" it's only weighed as a factor for how fair the use is, and broadly left up to the hand-waving of the particular judge.


Oh, absolutely. Kind of furthers my point. Patent is a silly mess in a lot of ways, but at least there's something like Booleans in it. "Fair use" doesn't even have THAT.


Yes to all this.

I think the factor most at risk in a fair use test with Copilot is whether it ever suggests verbatim, code that could be considered the "heart" of the original work. The John Carmack example that's popped up here at least gets closer to this question, it was a relatively small amount but it was doing something very clever and important.

One can imagine a project that has thousands of lines of code to create a GUI, handle error conditions, etc. that's built around a relatively small function; if Copilot spat out that function in my code, it might not be fair use because it's the "heart" of the original work. Additionally, its inclusion in another project could affect the potential market for the original, another fair use test.

But Copilot suggesting a "heart" is unlikely, something that would have to be ruled on in a case-by-case basis and not a reason to shut it down entirely. Companies that are risk-averse could forbid developers from using Copilot.


This is an excellent comment because it captures some important nuance missing from other analysis on HN.

I agree with you that the relative importance of the copied code to the end product would be (or should be) the crux of the issue for the courts in determining infringement.

This overall interpretation most closely adheres to the spirit and intent of Fair Use as I understand it.


For any discussion on copyright and fair use, we should distinguish between the implications to Copilot the software itself and the implications to users of Copilot.

For Copilot itself, I do see the case for fair use, though it gets fuzzy should Microsoft ever start commercializing the feature. Nevertheless it remains to be seen whether ML training fits the same public policy benefits public libraries and free debate leverages to enable the fair use defense.

For Copilot users, I don't see an easy defense. In your hypothetical, this would be akin to me going on Google books and copying snippets of copyrighted works for my own book. In the case of Google books, they explicitly call out the limits on how the material they publish can be used. I'm contrast, Copilot seems to be designed to encourage such copying, making it more worry some in comparison.


>In your hypothetical, this would be akin to me going on Google books and copying snippets of copyrighted works for my own book.

A book completely written by pasting passages of other books would actually be a pretty interesting transformative work.


Yeah, but a book like this would be an artistic work.

While software is in this limbo between copyrights and patents...


The world is global. That's a US court ruling from one court of appeals. Most countries have narrower fair use rights than the US. Even if Copilot would fall within that legal precedent (far from guaranteed), a legal challenge in any jurisdiction worldwide outside the US states covered by that particular court of appeals, or which reaches the US Supreme Court, or which goes through the Federal Circuit Court of Appeals due to the initial complaint including a patent claim, would not be bound by that result and (especially in a different country) could very plausibly find otherwise.

What's more, if any of the code implements a patent, fair use does not cover patent law, and relying on fair use rather than a copyright license does not benefit from any patent use grant that may be included in the copyright license. If a codebase infringes a patent due to Copilot automatically adding the code, I can easily imagine GitHub being attributed shared contributory liability for the infringement by a court.

Not a lawyer, just a former law student and law feel layman who has paid attention to these subjects.


> law feel layman

What a weird autocorrect typo. This should have read "law geek layman." (And it initially autocorrected again as I was typing this paragraph.)


> No, see Authors Guild v. Google.

That case required that the output be transformative, in that "words in books are being used in a way they have not been used before".

Copilot only fits the transformative aspect if it is not directly reciting code, that already exists in the form that it is redistributing. So long as it does so, it fails to meet the criteria.


I think you might be considering two different acts here:

1. The act of training Copilot on public code

2. The resulting use of Copilot to generate presumably new code

#1 is arguably close to the Authors Guild v. Google case. You are literally transforming the input code into an entirely new thing: a series of statistical parameters determining what functioning code "looks like". You can use this information to generate a whole bunch of novel and useful code sequences, not just by feeding it parts of it's training data and acting shocked that it remembered what it saw. That smells like fair use to me.

#2 is where things get more dicey - just because it's legal to train an ML system on copyrighted data wouldn't mean that it's resulting output is non-infringing. The network itself is fair use, but the code it generates would be used in an ordinary commercial context, so you wouldn't be able to make a fair use argument here. This is the difference between scanning a bunch of books into a search engine, versus copying a paragraph out of the search engine and into your own work.

(More generally: Fair use is non-transitive. Each reuse triggers a new fair use analysis of every prior work in the chain, because each fair reuse creates a new copyright around what you added, but the original copyright also still remains.)


Is there any evidence of Copilot producing substantial (100s of lines) verbatim copies of copyrighted works?

Absent this, I don't think there's a case. The courts have given extraordinarily wide latitude to fair use and ML algorithms are routinely trained on copyrighted works, photos, etc. without a license.

I understand that this feels more personal because it involves our field, but artists and authors have expressed the same sentiment when neural nets began making pictures and sentences.

The question here is no different than "Is GPT-3 an unlicensed, unlawfully created derivative work of millions, if not billions of people?"

No, I'm quite confident it is not.


> Is there any evidence of Copilot producing substantial (100s of lines) verbatim copies of copyrighted works?

It doesn't need to be substantial. In Google v. Oracle a 9-line function was found to be infringing.


If I recall correctly, the nine line question wasn't decided by the supreme court, but the API question was.

The Supreme Court did hold that the 11,500 lines of API code copied verbatim constituted fair use.

https://www.supremecourt.gov/opinions/20pdf/18-956_d18f.pdf


> The Supreme Court did hold that the 11,500 lines of API code copied verbatim constituted fair use.

Yes, because it was _transformative_, in a clear way. Because an API is only an interface. Which makes that part of that decision largely irrelevant to the topic at hand.

> Google’s limited copying of the API is a transformative use. Google copied only what was needed to allow programmers to work in a different compu-ting environment without discarding a portion of a familiar program-ming language. Google’s purpose was to create a different task-related system for a different computing environment (smartphones) and tocreate a platform—the Android platform—that would help achieve and popularize that objective.

> If I recall correctly, the nine line question wasn't decided by the supreme court, but the API question was.

It was already decided earlier, and Google did not contest it, choosing instead to negotiate a zero payment settlement with Oracle over the rangeCheck function. There was no need for the Supreme Court to hear it.


A $0 settlement means there is no binding precedent and signals to me that Oracle's attorneys felt they didn't have a strong argument and a potential for more.

If they felt the nine line function made Google's entire library an unlicensed derivative work, they would have pressed their case.


> A $0 settlement means there is no binding precedent and signals to me that Oracle's attorneys felt they didn't have a strong argument and a potential for more.

That's not the case. It wasn't an out-of-court-settlement, but an agreement about the damages being sought, the court had already found it to be infringing, and that was part of the ruling.

But none of that changes that 9-lines is substantial enough to be infringing. It isn't necessary to be a large body of work.

> If they felt the nine line function made Google's entire library an unlicensed derivative work, they would have pressed their case.

No... It means the rangeCheck function was infringing. The implication you seem to have inferred here wouldn't be inferred by any kind of plagiarism case.


I think we agree then, and appreciate the correction on the lower court settlement.

If Copilot is infringing, I suspect it's correctable (by GitHub) by adding a bloom filter or something like it to filter out verbatim snippets of GPL or other copyleft code. (And this actually sounds like something corporate users would want even if it was entirely fair use because of their intense aversion to the GPL, anyhow.)


It may be correctable... It doesn't change that Copilot is probably infringing today, which may mean that damages against GitHub may be sought.


The point of Copilot -- its entire value as a product -- is to produce code that matches the intent and semantics of code that was in the input. In other words, very deliberately not transformative in purpose.


Why did you choose the standard of "substantial" = "100s of lines"? Especially since we've already seen examples of verbatim output in the dozens of lines range, that choice of standard is rather conveniently just outside what exists so far. If we find a case with 200 lines of verbatim output will you say the only reasonable standard is 1000s of lines?

I don't think your argument is as strong as you're making it out to be.


Just a fairly arbitrary number. It's easy to produce a few lines from memory, up to 10s of lines and that's "obviously" fair use. I would be surprised if many of haven't inadvertently "copied" some GPL code in this way!

This goes to the "substantial" test for fair use. Clips from a film can contain core plot points, quotes from a book can contain vital passages to understanding a character, screen captures and scrapes of a website can contain huge amounts of textual detail, but depending on the four factors for fair use, still be fair use. (There have been exceptions though.)

The reaction on Hacker News to a machine producing code trained on their works is no different than the reactions artists and writers have had to other ML models. I suspect many of us are biased because it strikes at what we do and we think that our copyrights (because we have so many neat licenses) are special. They are not.

I think it would need to get to that level of "Copilot will emit a kernel module" before it's not obviously fair use.

After all, Google Books will happily convey to me whole pages from copyrighted works, page after page after page.

https://www.google.com/books/edition/Capital_in_the_Twenty_F...


> Just a fairly arbitrary number. It's easy to produce a few lines from memory, up to 10s of lines and that's "obviously" fair use.

it's anything but obvious. https://www.copyright.gov/fair-use/

> there is no formula to ensure that a predetermined percentage or amount of a work—or specific number of words, lines, pages, copies—may be used without permission.

9 lines of very run-of-the-mill code in Oracle / Google weren't considered fair use.


A big difference is that software is both is and isn't an artistic work.


It's not possible to get copilot to output a transformed version of the input?


Transformed output _may_ fall under fair use.

However - Copilot directly recites code. That is _very unlikely_ to fall under fair use.

Redistributing the exact same code, in the same form, for the same purpose, probably means that Copilot, and thus the people responsible for it, are infringing.


> However - Copilot directly recites code.

You make that statement as an absolute, but in the interests of clarity, all evidence so far shows that it directly recites code very rarely indeed. Even the Quake example had to be prompted by the specific variable names used in the original code.

In practice, the output code is heavily influenced by your own context — the comments you include, the variable names you use, even the name of the file you are editing — and with use it’s obvious that the code is almost certainly not a direct recitation of any existing code.


> all evidence so far shows that it directly recites code very rarely indeed.

_Once_ is enough for it to be infringing. The law is not very forgiving when you try and handwave it away.


You sound quite sure that the outlying instances of direct copying wouldn't be covered by the Fair Use copyright exemption. Any particular reason for that?

I tend to think it would be covered (provided it there were relatively small snippets and not entire functions).


I'm not the person you're replying to, but one strong reason is that the global reach and standardization of copyright law is far broader than the global reach and standardization of the fair use exception. A single non-US country in which GitHub Copilot is used in a way that would be infringing without the US fair use exception, and outside the scope of any such exception in that law, would be enough to cause GitHub/MS a legal hassle. There could well be more than one such country.


Oh, absolutely.

I'm not American, but like others around here — I was just restricting the discussion to American law for simplicity's sake.


Fair, but GitHub/MS (same company now) can't afford to ignore other countries' law in their internal evaluations of whether globally* available products like Copilot are legal.

* Minus a few countries/regions targeted by US sanctions, I assume, though they've gradually broadened their services in sanctioned countries with the necessary licenses from OFAC.


Precedent. Google v. Oracle found 9 lines, of an "obvious" implementation to be infringing.


Right, but would 3-4 lines in the middle of a 50 line function also be infringing? What about 2 lines?

I don't know the answer. I was only surprised that the commenter seemed dead sure that any and all copying (no matter how small) would be infringing.

That just doesn't correlate with my understanding of how Fair Use works: The "amount" of the infringement is one (of several) factors in determining if something falls under Fair Use:

>The third factor assesses the amount and substantiality of the copyrighted work that has been used. In general, the less that is used in relation to the whole, the more likely the use will be considered fair.

From https://en.wikipedia.org/wiki/Fair_use


So if a foreign company pilfers the source code to Windows, can they add it to a training set and then 'prompt' the machine learning algorithm to spit out a new 'copyright free' Windows, just by transforming the variable names?


I think that's my question regarding this whole thing:

If it's so fair use, why not train it on all Microsoft code, regardless of license (in addition to GitHub.com) ? Would Microsoft employees be fine with Copilot re-creating "from memory" portions of Windows to use in WINE ?


Well no, because only GitHub has access to the training set. But more importantly this misunderstands how Copilot even works -- even if Windows was in the training set, you couldn't get Copilot to reproduce it. It only generates a few lines of code at a time, and even then it's almost certainly entirely novel code.

Now, if you knew the code you wanted Copilot to generate you could certainly type it character by character and you might save yourself a few keystrokes with the TAB key, but it's going to be much MUCH easier to simply copy the whole codebase as files, and now you're right back where you started.


GPT-3 is still Microsoft licensed, but a similar model can be put together with the freely available GPT-2 and source code -- especially if your intent is copyright transfer.

As Francois Chollet points out in this talk, ultimately deep neural network models are locally sensitive hash tables, so the examples of people pulling out source code is an inherent shortcoming of deep learning models in general. Give the right 'key' and you can 'recall' the value you are looking for.

https://www.youtube.com/watch?v=J0p_thJJnoo


> "However - Copilot directly recites code."

Sounds like that wouldn't be difficult to fix? Transform the code to an intermediate representation (https://en.wikipedia.org/wiki/Intermediate_representation) as a pre-processing stage, which ditches any non-essential structure of the code and eliminates comments, variable names, etc., before running the learning algorithms on it. Et voila, much like a human learning something and reimplementing it, only essential code is generated without any possibility of accidentally regurgitating verbatim snippets of the source data.


At that point, can we all just agree IP is the stupidest concept to ever be layered on top of math (which programming is) and move on with non-copyrightable code?


Only if you agree that copyleft licenses are also stupid; without copyright, there's no way to prevent companies from making closed-source forks of code you wrote and intended to stay open.


The whole point of copyleft was as a stepping stone to get to RMS's four freedoms (https://www.gnu.org/philosophy/free-sw.en.html) which effectively eliminates copyright for software.


Freedom 1: “Access to the source code is a precondition”

With no copyright/copyleft, how do you enforce the rule that derived works must provide access to the source code? I’ve never heard that copyleft was a stepping stone—rather, it’s the stick that fully realizes the four freedoms.


Correct. Copyleft is idiocy as well. You don't really need a pay for a proprietary fork of a tool when no one can keep you out of the free one, and the proprietary stuff diffuses into the free option.


Yes, sure. Without copyright there's no need for copyleft left, right?


No...? Not unless that closed-source project's source code is leaked?


You don't care about attribution and other moral rights ?

(I guess these are going to depend a LOT on the jurisdiction that you're in ?)


I care, but in the long run, I care more about our descendants not having tools locked out of their hands. Facilitated information asymmetry is the root of far too many evils.

Where is your ego when you're dead and gone? Where could we be if the majority of human advancement we're not tightly clutched as trade secrets?

As someone who has done paid software engineering (yes, you can feel free to call me a hack or sell out if you wish), I've come to find that the salary I've pulled over the years has not gone to me... But keeping a roof over those I love, helping other people's projects grow, giving people a shot, etc.

My time on the other hand, gets dumped into implementing the same handful of processes doing the same damn thing, but different this time, because you can't just bloody make "Here ya go, here's your Enterprise-in-a-box".

I'd like people more people able to solve novel problems than necessarily need to retread the same path over and over. Some degree of that will always have to be done to keep the skills fresh in the population, but we could do way better at marshaling that split, and I'm convinced part of what necessitates it is creating artificial barriers through things like enforced implementation monopolization. Yes. It ensures a minimum level of novelty and variance across populations, but it also does terribly at not consuming the finite amount of human capacity for truly novel thought to innovate.

It may make societies that function based on greed and economic/fiscal measures work, but I'm not convinced other incentive structures won't keep the rolling stone of innovation from accruing moss.


I don't understand what you're talking about, I'm talking about the non-commercial parts of the monopoly rights that are copyrights and patents, the non-commercial parts arguably aren't going to restrict the users much, and their commercial parts are temporary by design.

(Copyright has went IMHO overboard with its duration, we should scale to back to the original 14 years renewable once, just like patents, but copyright doesn't apply to processes anyway, and so arguably it shouldn't apply to software that can't claim to have any artistic merit.)


> By comparison, Copilot is even more obviously fair use.

Not sure I see it that way.

If I take your hard work that you clearly marked with a GPL license and then make money from it, not quite directly, but very closely, how is that fair use? Or legal?

Copying and storing a book isn't recreating another book from it. Copilot is creating new stuff from the contents of the "books" in this case.

Edit: I misunderstood fair use as it turns out...


Google did not scan those books and use it to build new books with different titles. The comparison doesn't hold up at all.


> Google did not scan those books and use it to build new books with different titles. The comparison doesn't hold up at all.

Not sure if you meant to reply to me but I agree with you: you can't compare what Google did to what Copilot does.


Copilot just suggests code.


And someone accepts it. Even if suggesting derivatives of licensed code is not a license infringement, then Copilot sure is a vector for mass license infringement by the people clicking "Accept suggestion". And those people are unable to know (without doing extensive investigation that completely nullifies the point of the tool) whether that suggestion is potentially a verbatim copy of some existing work in an incompatible license.


If I suggest whole lines of dialogue to you, the screenwriter, did I write those lines or you? If you change names in those lines of dialogue to fit your story, do you now gain credit for writing those lines?

Suggesting code is generating code


> did I write those lines or you

Neither. Someone else did, and published it. Copilot copied the dialog and suggested it.

> If you change names in those lines of dialogue to fit your story, do you now gain credit for writing those lines?

It depends. Talking generalities isn't productive or interesting. Can you give an example and we can discuss specifics?

> Suggesting code is generating code

This isn't even superficially true


There are situations where the question is are the mishmashes from Copilot 'fair use'.

But the other, more direct question is ... what about the instances where Copilot doesn't come up with a learned mishmash result? What happens when Copilot just gives you a straight up answer from it's learning data, verbatim?

Then you, as a dev, end up with a bunch of code that is effectively copied, via a 'copying tool', which is GPL'd?

It's that specific case that to me sticks out as the 'most concerning part'.

Please correct me if I'm wrong.


For your specific case, “take your hard work that you clearly marked with a GPL license and then make money from it”, you don’t even need to rely on fair use. As long as you comply with the terms of the GPL, making money with the code is perfectly acceptable, and the FSF even endorses the practice. [1] Red Hat is but one billion-dollar example.

[1] https://www.gnu.org/licenses/gpl-faq.en.html#DoesTheGPLAllow...


But the person making money from the GPL code has to follow the terms of the license. Attribution, sharing modifications, etc.


Correct. That's why I said "As long as you comply with the terms of the GPL".


I've edited my comment with examples and a clarification.

Fair use is an exception to copyright and, by definition, copyright licenses.


I understand the concept of fair use (I think) but I can't see how it applies to Copilot.

Google didn't create new books from the contents of existing ones (whether you agree that they should have been allowed to store the books or not) but Copilot is creating new code/apps from existing ones.

Edit: I guess my understanding of fair use was wrong. I stand corrected.


If Google Books were creating new books, that would only help their argument. Transformativeness is one of the four parts of the fair use test.

Copilot producing new, novel works (which may contain short verbatim snippets of GPL works) is a strong argument for transformativeness.


It would help the transformativeness, but it would substantially change the effect upon the market. By creating competing products with the copyrighted material, there is a higher degree of transformative, but you also end up disrupting the marketplace.

I don't know how a court would decide this, but I do think the facts in future GPT-3 cases are sufficiently different from Author's Guild that I could see it going any way. Plus, I think the prevalence of GPT-3 and the ramifications of the ruling one way or another could lead some future case to be heard by the Supreme Court. A similar case could come up in California, or another state where the 2nd Circuit Artist Guild case isn't precedent.


> short verbatim snippets of GPL works

Define short


[flagged]


Yeah, I realise that now.

However, where does one draw the line between fair use and derivative works?

Creating something based on other stuff (Google creating AI books from the existing ones for example) would possibly be fair use I think but would it not also be derivative works?


There's no clear line and there can never be because the world is too complex. We leave up determination to the court system.

Google Books is considered fair use because they got sued and successfully used fair use as a defense. Until someone sues over Copilot, everyone is an armchair lawyer.


I don’t disagree with your point but was it necessary to make it in such a snarky way?


[flagged]


Would you please stop breaking the site guidelines? You've been doing it repeatedly and it's not cool. Please just be kind.

https://news.ycombinator.com/newsguidelines.html


This is the clearest display yet that moderation on HN has absolutely nothing to do with your purported values like constructive criticism, and has everything to do with whether dang agrees with you or not.


I actually have no idea what you were arguing about, nor which side you were on, nor what your argument was. I haven't paid enough attention to know those things, because (a) I don't want to, (b) I don't need to, and (c) not doing it leaves me in the desirable state of being incapable of agreeing or disagreeing.

It's a happy fact that figuring out people's arguments is often unnecessary for moderating the threads, especially in cases where people are breaking the site guidelines. Everyone needs to follow the site guidelines regardless of what the topic is, what their argument is, and how right they are or feel they are. Please stick to the rules when posting here.

https://news.ycombinator.com/newsguidelines.html


I don't think that's an accurate description...

Fair use is a defense for cases of copyright infringement, which means you're starting of from a case of copyright infringement, which sort-of muckys up the whole "innocent until proven guilty" thing. And considering it's a weighted test, it's hardly very cut-and-dry at that.


If you view GPL code with your browser would that mean that your browser now has to be GPL as well? In the sense that copilot is not much different than a browser for Stack Overflow with some automation, why would it need to be GPLed? Your own code on the other hand…


For sake of discussion, it would be clearer to split copilot code (not derived from GPL'd works) and the actual weights of the neural network at the heart of copilot (derived from GPL'd works via algorithmic means).

For your browser analogy, that would mean that the "browser" is the copilot code, while the weights would be some data derived from GPL'd works, perhaps a screenshot of the browser showing the code.

I'd think that the weights/screenshot in this analogy would have to abide by the GPL license. In a vacuum, I would not think that the copilot code had to be licensed under GPL, but it might be different in this case since the copilot code is necessary to make use of the weights.

But then again, the weights are sitting on some server, so GPL might not apply anyway. Not sure about AGPL and other licenses though. There is likely some illegal incompatibility between licenses in there.


As I understand it the things copilot tries to do is automate the loop of “Google your problem, find a Stack Overflow answer, paste in the code from there into my editor”. In that sense, the burden of whether the license of the code being copy pasted is on the person who answered the SO question and on me. If this literally was what copilot did, nobody would bat an eye that some code it produced was GPL or any other license because it wouldn’t be copilot’s problem.

No let’s substitute a different database of for the code that isn’t SO. It doesn’t really matter if that database is a literal RDBMS, a giant git repo or is encoded as a neural net. All copilot is going to do is perform a search in that database, find a result and paste it in. The burden of licensing is still on me to not use GPL code and possibly on the person hosting the database.

The gotcha here is that copilot’s database is a neural network. If you take GPL code and feed it as training data to a neural network to create essentially a lookup table along with non-GPL code did you just create a derived work? It is unclear to me whether you did or not. In particular, can they neural network itself be considered “source code”?


> If you view GPL code with your browser would that mean that your browser now has to be GPL as well?

Some good responses in sibling comments already, but I don't see the narrow answer here, which is: No, because no distribution of the browser took place.

If you created a weird version of the browser in which a specific URL is hardcoded to show the GPL'd code instead of the result of an HTTP request, and you then distributed that browser to others, then I believe that yes, you'd have to do so under the GPL. (You might get away with it under fair use if the amount of GPL'd code is small, etc.)


If you use your browser to copy some GPL code into your project your project must now be GPL as well.

So following your own argument, even if Copilot is allowed, using it still risks you falling under GPL


My point exactly. Copilot is innocent in that case just like the browser.


Or if you simply read GPL code and learn something from it - or bits of the code are retained verbatim in your memory, are you (as a person) now GPL'd? Obviously not.


That probably depends on how large and how significant the bits you remember are. Otherwise one could take a person with photographic memory and circumvent all GPL licenses easily, by making that person type what they remember.


> Or if you simply read GPL code and learn something from it - or bits of the code are retained verbatim in your memory, are you (as a person) now GPL'd? Obviously not.

I do not find that to be obvious at all.


You do not find it obvious that a human being would not become a GPL'd work?


To build a browser you don't need a verbatim GPL code, so it's not a derivative work in the same sense copilot is.

Stackoverflow on the other hand is much trickier question...


SO clearly doesn’t need GPL code to be useful. The wider SE network is evidence of that.


> If I take your hard work that you clearly marked with a GPL license and then make money from it, not quite directly, but very closely, how is that fair use? Or legal?

If I'm Google, and I scan your code and return a link to it when people ask to find code like that (but show an ad next to that link for someone else's code that might solve their problem too), that's fair use and legal. My search engine has probably stored your code in a partial format, and that's fine.


It's fine because a search engine is a generic tool the main purpose of which is not to replicate the code verbatim to be used as code.


>If I take your hard work that you clearly marked with a GPL license and then make money from it, not quite directly, but very closely, how is that fair use? Or legal?

You can wipe your ass with the GPL license if your use of the product falls within Fair Use.

You can actually take snippets from commercial movies and post them onto YouTube if your YouTube video is transformative enough for your usage to be considered fair use. Well, theoretically at least - in reality YouTube might automatically copyright strike it.

>Copying and storing a book isn't recreating another book from it.

That doesn't mean that GitHub has to redistribute Copilot under GPL. However, the end user could potentially have to if they use Copilot to generate new code that happens to copy GPL code verbatim.


> You can wipe your ass with the GPL license if your use of the product falls within Fair Use.

Is Copilot fair use? It's reading code, generating other code (some verbatim) and making money from it all while not having to release its source code to the world?

> That doesn't mean that GitHub has to redistribute Copilot under GPL

I wasn't saying that was the case: some of the code that Copilot used may not allow redistribution under GPL.

But let's say that all of the code it scanned was GPL for the sake of argument. Why would they not have to distribute their Copilot source yet, if I use it to generate some code, I'd have to distribute mine?

My spidey-sense it tingling at that one!


> Is Copilot fair use? It's reading code, generating other code (some verbatim) and making money from it all while not having to release its source code to the world?

Again, fair use is an exception to copyright protection. If something is fair use, the license does not apply. The fact that Copilot does not release its source code is related only to a specific term of a specific license, which does not apply if Copilot is indeed fair use.


Making money is irrelevant to fair use



Irrelevant to GPL maybe.


> By comparison, Copilot is even more obviously fair use.

You are correct about (US specific) the fair use exception, but it is in no way as clear as you suggest that what copilot is doing entirely falls under fair use. Fair use is always constrained.

I suspect some variant of this sort of thing will have to be tested in court before the arguments are really clear.


> ...the non-obvious thing for many developers is that fair use is an exception to copyright itself.

More precisely, fair use is an affirmative defense to an claim of copyright infringement. A fair use defense basically says, "Yes, I am copying your copyrighted material and I don't have a license (or am exceeding a licensed use), but my usage is allowed under the fair use doctrine (codified in 17 USC 107 in US law)."


Thanks for this, but can you answer the question:

Would it be 'fair use' for the devlopers to simply copy code from those repos - even just 10 lines, and claim 'fair use' - i.e. circumventing Copilot?

Even if Copilot is 'fair use' ... does that mean the results are 'fair use' on the part of AutoPilot users?

And a bigger question: is your interpretation of those statues and case law enough to make the answer unambiguous?

I don't have legal background, but I do have an operating background with lawyers and tech ... and my 'gut' says that anyone using Copilot is opening themselves up to lawsuits.

If the code you put in your software comes, via Copilot, but that code is verbatim from some kind of GPL's (or worse, proprietary) ... there's a good chance you could get sued if someone gets the inclination.

Maybe it's because of my personal experience, but I can just see corporate lawyers banning Copilot straight up as the risks are simply now worth the upside. That's now what we like to hear in the classically liberal sense i.e. 'share and innovate' ... but gosh it doesn't feel like a happy legal situation to me.

Looking forward to people with more insight sharing on this important topic.


> Would it be 'fair use' for the devlopers to simply copy code from those repos - even just 10 lines, and claim 'fair use' - i.e. circumventing Copilot?

Only a lawyer (and truly, only a court) could answer that question.

If you copy 100 lines of code that amounts to no more than a trivial implementation in a popular language of how to invert a binary tree, it's likely fair use.

If you copy 10 lines of code that are highly novel, have never been written before, and solve a problem no one outside the authors have solved... It may not be fair use to copy that.

Other people who have replied have mentioned "the heart" of a work. The US Supreme Court has held that even de minimis - "minimal", to be brief - copying can sometimes be infringement if you copied the "heart" of a work.


If this issue is eventually litigated, we will see. The law in the Second Circuit (where the final judgment was rendered before the case was eventually settled) may well be different than the law in a different circuit. If there is a split in the circuit courts, then the Supreme Court may have to weigh in on this issue.

When fair use is an issue, the courts look at the facts in context each time. These are obviously different facts than scanning books for populating a search index and rendering previews; and each side is going to argue that the facts are similar or that they are dissimilar. How the court sees it is going to be the key question.


This could either be:

1. a fascinating Supreme Court opinion.

2. a frustrating ruling because SCOTUS doesn't understand software and code.

3. the type of anti-anticlimactically(?) narrow ruling typical of the Roberts court.

While our Congresspersons can't seem to wrap their minds around technology/social media, I think SCOTUS would understand this one enough to avoid (2).


Fair use cases tend to produce narrowly-written law because the outcomes hinge on how the court judges the facts against the list of factors codified in the Copyright Act (17 U.S.C. section 107). The courts don't really have breathing room to use a different test. I don't recall any cases in which the courts have set binding guidelines for interpretation of these factors.


The Google vs Oracle case showed that SCOTUS can handle technical topics


Next up, Copilot for college papers! Who needs to pay a professional paper-writer (ahem, I mean write the paper) when you can have an AI write your paper for you! It's fair use, so you're entitled to claim ownership to it, right?


I think you are confusing legal protections for intellectual property with plagiarism. (At least that's what I think you're doing if I read your comment as sarcasm and guess what you're trying to say non-sarcastically?) But they are entirely different things.

You can be violating copyright without plagiarizing, so long as you cite your source, but if you copy a copyright-protected work in an illegal way when doing so.

And you can be plagiarizing without violating copyright, if you have the permission of the copyright holder to use their content, or if the content is in the public domain and not protected by copyright, or if it's legal under fair use -- but you pass it off as your own work.

Two entirely separate things. You can get expelled from school for plaguriism without violating anyone's copyright, or prosecuted for copyright without committing any academic dishonesty.

You can indeed have the legal right to make use of content, under fair use or anything else, but it can still be plagiarism. That you have a fair use right does not mean "Oh so that means you are allowed to turn it in to your professor and get an A and the law says you must be allowed to do this and nobody can say otherwise!" -- no.


Yeah, I was being sarcastic. But you make a good point about the legality of plagiarism.


Copilot is not doing what your example does.

If Github had a service that automatically mirrored public repositories on Gitlab, that would be equivalent to the example you gave.

But Github is taking content under specific licenses to build something new for commercial use.

I'm not sure if what Github does falls under Fair Use, but I don't know that it matters. I can read fifty books and then write my own, which would certainly rely—consciously or not—on what I had read. Is that a copyright violation? It doesn't seem like it is but maybe it is and until now has been impossible to prosecute?


GitHub isn’t building anything.

The end user is.

By this logic any and all neural nets that draw pictures are copyright infringing as well.


If they create exact copies of copyrighted pictures, then yes, they do.


> Fair use is an exception to copyright itself. A license cannot remove your right to fair use.

...and if you're outside the USA?


Read the Authors Guild v Google dismissal. The court considered it fair use because Google's project was built explicitly to let users find and purchase books, giving revenue to the copyright holders. Copilot does not do that.


> ... giving revenue to the copyright holders.

That's a reference to factor four of the fair use test, "the effect of the use upon the potential market for or value of the copyrighted work." (17 USC 107).

None of the factors are dispositive, however. For example, a scathing book review that quotes a passage to show how bad the writing is might eviscerate sales of the book, but such a use is usually protected. For a counter-example, see Harper & Row v. Nation Enterprises 471 U.S. 539 (1985).


> Note, however, that there is no world-wide principle of fair use; what kinds of use are considered “fair” varies from country to country.

Exactly the point I came to make.

The Authors’ Guild is a US entity, and so is Google, so only US law applies. And thus, we have the Fair Use exception.

But developers sharing code on GitHub come from and live all over the world.

Now, Github’s ToS do include the usual provision stating that US & California law applies, et cætera, et cætera [1], but… and even they acknowledge it may be the case, such provisions usually aren’t considered legal outside of the US.

So… developers from outside the US, in countries with less lenient exceptions to copyright, definitely could sue them.

Identifying these countries and finding those developers, however, is a different matter altogether.

[1]: https://docs.github.com/en/github/site-policy/github-terms-o...


This was a good point. Really enjoying this discussion. Interesting stuff.

I'm really out of my depth in giving my own opinion here, but I'm not sure that either the "distribution != derivative" characterization, or that "parsing GPL => derivative of GPL" really locks this thing down. The bit that I can't follow with the "distribution != derivative" argument is that the copilot is actually performing distribution rather than "design". I would have said that copilot's core function is generating implementations, which to me does not seem like distribution. This isn't a "search" product, and it's not trying to be one. It is attempting to do design work, and I could see a case where that distinction matters.


I buy the argument about copilot itself and this comment. But when someone goes to release software that uses the output of Copilot, I fail to see how they wouldn’t be a GPL derivative work if enough source was used. Copilot is essentially really fancy copy/paste in that context.


I think this is the correct answer. IANAL but the copilot code vs the copilot training data are different things and licensing for one shouldn’t affect the other, right? And the fact that training data happens to also be code is incidental.


One view would be that copilot the app distributes GPL'd code, in a weird encoding. Training the model is a compilation step to that encoding


I assume the code is a derivative work of training data because given different data code would be also different (neuron weights)


If I read a GPL implementation of a linked list and then write my own linked list implementation, was my neural network in my brain a derivative work of the GPL code?


Sure it is, you brain is not software though


So as long as I read GPL code, then rewrite it from memory and feed it to copilot to train it I can unGPL anything?


If fair use memorising whole source code byte-by-byte, storing it as ie. some non-100%-lossless compression for subsequent retrieval or arbitrary size snippets?


If copilot was trained using the entirety of the linux kernel, wouldn't the neural network itself need to be GPLed, if not its output.


> Even without a license or permission, fair use permits the mass scanning of books, the storage of the content of those books, and rendering verbatim snippets of those books.

For commercial use and derivative works?

Authors won't incorporate snippets of books into new works unless they're reviews. Copilot is different.


Google Books is a commercial site which incorporated the snippets of millions of copyrighted works. And of course, sitting in thousands of Google servers/databases are full copies of each of those books, photos of each page, the OCRed text of each page, and indexes to search them. Even that egregious copying without a license or permission was considered fair use.

If anything, the ways in which Copilot is different aid Microsoft/GitHub's argument for fair use. Because Copilot creates novel new works, that gives them a strong argument their system is more transformative than Google Books, which just presents verbatim copies of books.


The Google books example really misses the point, one of the reasons why the judges considered it fair use was because it was pointing back to the original sources (and thus potentially increasing publishers earnings).

Copilot does none of that. If all the ML companies are so sure this is fair use I encourage them to train an AI on Disney movies to generate short cartoon snippets based on some description. There sure would be a court case.


The main issue here is less doing it, but getting sufficiently nice results. I've done work in generative AI before and right now the state of the art is passable on single images with some but not enough control and is still weak on videos without heavy structure requirements. I expect in 5-10 years we will have good enough models (or hardware) to do short video generation and the question will get tested then. I also think a meaningful good video requires audio and have fun making well aligned text (for dialogue) audio of that text, and video frames. Aligning all that generation together is still challenging today.


> Authors won't incorporate snippets of books into new works

Of course they do, previous works are quoted all the time.


But that's another thing - co-pilot doesn't quote it encourages something more akin to plagarism, doesn't it?


Plagiarism, pretending you made a work entirely yourself when you didn't, is rarely a matter for a court to decide and the standards for what constitutes plagiarism can vary a lot. When I turn in projects for a course, a cite sources in the comments a lot, even if what I turn in is substantially modified. An employer generally doesn't care if you copied and pasted code from StackOverflow or wherever, so long as you don't expose them to a suit and you don't lie if asked "Did you write this 100% yourself?"

Citing your source is not a get out jail free card for copyright infringement, it doesn't really matter.


> Citing your source is not a get out jail free card

No, but it's a requirement of the license stackoverflow.com uses, which is unfortunate, for code (as opposed to text, where a quote can be easily attributed).


...with attribution.


And without. Attribution isn't a "copyright escape clause", copying a work without permission is still infringement - unless it's fair use.

Plagiarism is not the same as infringement.


Can you still apply Fair Use if they make Copilot a payed service?


Does intent not matter? Pasting code for explanatory reasons and citing the source seems different than silently incorporating it directly into a commercial work product.


> Fair use is an exception to copyright itself.

And copyright itself is an exception to the normal state of things : the public domain, copyright being only a temporary monopoly.


Assuming that Copilot's use of GPL'd code to provide snippets to a developer is fair use, what rights does the developer have to using that snippet?


Can you copy 10 lines of code from a open source project in your software? Yes you an, it's considered fair use. Nobody will ever sue for that. If it was, websites like StackOverflouw where developers post code probably taken by project with some restrictive license and other developer copy it in their projects would not exist.

Copilot will not write an entire software module, it will provide you with snippets. I see using GPL code for training fair use. If a developer reads the source code of a project to take inspiration and possibly copy some small parts does it violate the license?


When the recent Github v. youtube-dl fiasco happened, I remember reading similarly strongly-worded but dismissive comments regarding fair use, stating how it is quite obvious that youtube-dl's test code could never be fair use and how fair use itself is a vague, shaky, underspecified provision of the copyright law which cannot ever be relied on.

To me, seeing youtube-dl's case as fair use is so much easier than using hundreds of thousands source code files without permission in order to build a proprietary product.


How would you feel about a paid-for search engine using hundreds of millios of web pages without permission in order to build a proprietary product?


There is a crucial difference though, the search engine links back to the content. If Google would just display the content on their verbatim, it would definetly not be considered fair use. Even like this several countries have restricted what Google can do when displaying e.g. News.


Somehow building a list of pointers to original content does simply not have the same ring to me as a product that rehashes all of the content. A rehashing of content sounds to me much more like, for example, publishing a sequel to my favourite book. After all, a sequel is just a rehashing of the same characters in new adventures. If we can't do that, why should Copilot be fine?

My point was however that I'm just utterly failing to see how the youtube-dl test thing could be more of a copyright problem than this entire thing based on millions of others' works that is Copilot.


You mean like a search engine?


This is a thoughtful and insightful reply. Thank you.


Books (mostly) are not distributed under the GPL.


True. But pretty good privacy might be worth considering in this context - it was at one point published as a book after all...

https://philzimmermann.com/EN/essays/BookPreface.html


The GPL only gives you additional permissions relative to what you would have by default. The books included in that suit were more strongly restricted, since there was no license at all.


There are certainly some interesting additional conditions the GPL creates by taking the license away if you violate certain clauses. Regardless, the interesting part of this is that this looks different from the user's point of view and Microsoft's. Sure, 5 lines out of 10,000 is probably fair use. For Microsoft, their system is using the whole code base and copying it a few lines at a time to different people, eventually adding up to potentially lots more than fair use.

The question on this one will be about the difference between Microsoft/Github's product and a programmer using copilot's code:

"If I feed the entire code base to a machine, and it copies small snippets to different people, do we add the copies up, or just look at the final product?"


Does the GPL forbid fair use? Why don't book publishers use a license that forbids fair use?


Because fair use is an exception to copyright itself. A copyright license can't take away your legal right to fair use.


> Why don't book publishers use a license that forbids fair use?

They couldn't do it with a license, which only imposes conditions for the license to be valid. Fair use applies even if the copier has no license at all.

Potentially they could do it with a contract. A license is not a contract and imposes no covenants on the parties.


While I agree you are correct about (in the US anyway) fair use being an exemption from copyright, thus superceeds licensing

I disagree that Copilot is "more obviously fair use.", some parts might be, but we have seen clear examples (i.e verbatim code reproduction) that would not be.

I dont believe the question of "is this fair use" is as clear as you believe it to be


Just for reference, the hackernews source is public.


Not the current version? AFAIK there's some security-by-obscurity in the measures against spam, voter rings etc ?


I think the bigger issue is that use of Copilot puts the end user at risk of using copyrighted code without knowing it.

Sure one could argue that Copilot learned in the way a human does. There is nothing that prevents one from learning from copyrighted work, but snippets delivered verbatim from such works are surely a copyright violation.


More interestingly, if we can trick it into regurgitating a leaked copy of the windows source code, Microsoft apparently says that’s fair use.


This is pretty interesting for AI in general. Should you be able to train with material you don't own? Can your training benefit from material that has specific usage licenses attached to it? What about stuff like GameGAN?


> ...Should you be able to train with material you don't own?

If relating this to how humans learn, books and other sources are used to inform understanding and human knowledge. One can purchase or borrow a book without actually owning the copyright to it. Indeed, a given passage may be later quoted verbatim, provided it is accompanied with a reference to its source.

Otherwise, a verbatim use without attribution in authored context is considered plagiarism.

So, sure one can use a multitude of material for the training. Yet, once it gets to the use of the acquired "knowledge" - proper attribution is due for any "authentic enough" pieces.

What is authentic enough in this case is not easy to define, however.


"If relating this to how humans learn" seems like a big IF though right? Are we going to treat computer neural nets as human from a legal standpoint?

At some point Neural Nets like GameGAM might be good enough to duplicate (and optimize) a commercial game. Can you then release your version of the game? Do you just need to make a few tweaks? Are we going to get a double standard because commercial interests are opposed depending on the use case?

It would be pretty funny if Microsoft as a game publisher lobbies to prevent their IP being used w/ something like GameGAN, but then takes the opposing stand point for something like their CoPilot! Although I'm sure it'll be spun as "These things are completely different!".


This is the key question. In school I was taught to be careful to always cite even paraphrased works. If Copilot regurgitates copyrighted fragments without citation or informing acceptors of licenses involved then it's facilitating infringement.


> Are we going to treat computer neural nets as human from a legal standpoint?

Maybe we will some day, but for now this isn't the case, where the law is concerned :

https://ilr.law.uiowa.edu/print/volume-101-issue-2/copyright...


Assuming that copilot is a violation of copyright on GPL works, it would also be a violation of non-GPL copyrighted works, including public but, but fully copyrighted works. Therefor relicensing others source code under GPL would violate even more copyright.


So in that case, of course copilot would have to give license info for every. single. snippet. Case solved. Only, that they will probably not do that.


Probably they get away with it, but it definitely seems against the spirit of the GPL just as closed source GitHub existing because of open source software seems quite hypocritical.


IANAL but as I understand it, ruling in the US is that machines can not produce "derived works" of copyrighted works. If it replicates (A)GPL code verbatim, it's up to the user to comply with its license.

Of course the interesting part is that the user not only has no idea what that license is but also where the code came from and if it is in fact copied verbatim. It's unlikely a court would agree that putting licensed code through a machine strips the licensing requirements of the code, of course, but that doesn't seem to be Microsoft's problem.

I think Microsoft's use of public code hosted on GitHub is covered by the terms of service but if this use includes granting a license more permissive than the license indicated on the code itself, this would probably put every GitHub user who ever committed less permissively licensed code to GitHub that they didn't control in violation of those licenses.

There's really only three ways this can go:

1) Machine learning does legally become a license-stripping black box, which would allow creating a machine generated commons by feeding arbitrary copyrighted works into sloppy AIs that mostly just replicate their input without changes.

2) Copyright law is extended to consider the output of machine learning as derived works from its inputs, massively extending the reach of copyright and creating massive headaches for everyone (e.g. depending on the exact ruling this would effectively make it impossible to reproduce a digital artwork as merely rendering it on a screen would create a derived work).

3) The original licenses are upheld and remain in effect, rendering the output of Copilot useless by creating a massive legal headache for anyone trying not to violate copyright.

I think outcome 2 is unlikely but 1 and 3 aren't mutually exclusive.


If Github hosts AGPL code, does that mean that github's own code must be AGPL? Obviously not. What's the difference?

There's no point to copilot without training data, some but not all of the training data was (A)GPL. There's no point to github without hosting code, some but not all of the code it hosts is A(GPL).

The code in either cases is data or content, it has not actually been incorporated into the copilot or github product.


> If Github hosts AGPL code, does that mean that github's own code must be AGPL? Obviously not. What's the difference?

GitHub's TOS include granting them a separate license (i.e., not the GPL) to reproduce the GPL code in limited ways that are necessary for providing the hosting service. This means commonsense things like displaying the source text on a webpage, copying the data between servers, and so on.


Code isn't to GitHub what training data is to this model, or at least even if you could argue that it is within a current framework it shouldn't be.


> we don’t distribute copies of humans

A bit of a tangent and it’s fictional, but I really have to recommend the tale of MMAcevedo. https://qntm.org/mmacevedo


This is a great argument.


copilot isn't distributing copies of itself either.


I am really confused by HN's response to copilot. It seems like before the twitter thread on it went viral, the only people who cared about programmers copying (verbatim!) short snippets of code like this would be lawyers and executives. Suddenly everyone is coming out of the woodworks as copyright maximalists?

I know HN loves a good "well actually" and Microsoft is always suspect, but let's leave the idea of code laundering to the Oracle lawyers. Let hackers continue to play and solve interesting problems.

Copilot should be inspiring people to figure out how to do better than it, not making hackers get up in arms trying to slap it down.


I am really confused by HN's response to copilot.

If you're asking about the moral reaction here, I think it depends on how one views Copilot. Does Copilot create basically original code that just happens to include a few small snippets? Or does Copilot actually generate a large portion of lightly changed code when it's not spitting out verbatim copies of the code? I mean, if you tell Copilot, "make me a QT compatible, crossplatform windowing library" and it spits out a slightly modified version of the QT source code and if someone started distributing that with a very cheap commercial license, that would be a problem for the QT company, which licenses their code commercial or GPL (and as QT a library, the QT GPL forces user to also release their code GPL if they release it, so it's a big restriction). So in the worst case scenario, you can something ethically dubious as well as legally dubious.

Copilot should be inspiring people to figure out how to do better than it, not making hackers get up in arms trying to slap it down.

Why can't we do both? I mean, I am quite interested in AI and it's progress and I also think it's important to note the way that AI "launders" a lot of things (launders bias, launder source code, etc). AI scanning of job applications has all sorts of unfortunate effects, etc. etc. But my critique of the applications doesn't make me uninterested in the theory, they're two different things.


A naive developer thinks that they are the source code they write (you're not), and their source code leaking to the world makes them worthless. (Which isn't true, but being that invalidated explains a lot of the fear. Which, welcome to the club, programmers. Automation's here for your job too.)

Still, some of the moral outrage here has to do with it coming from Github, and thus Microsoft. Software startup Kite has largely gone under the radar so far, but they launched this back in 2016. Github's late to the game. But look at the difference (and similarities) in responses to their product launch posts here.

https://news.ycombinator.com/item?id=11497111 and https://news.ycombinator.com/item?id=19018037


A naive developer thinks that they are the source code they write (you're not), and their source code leaking to the world makes them worthless.

Maybe Github isn't violating the licenses of the programmers who host on them. Maybe Copilot doesn't just spit out code that belongs to other people. Those are matters of interpretation and debate.

But if Github was doing this with Copilot, virtually an open source programmer would have a reason to be upset. Open source programmers don't give their code out for free they license it. This is a legal position, not a feeling. "Intellectual property" may be a pox on the world but asking open source developers to abandon their licenses to ... closed source developers, is legitimately a violation.

And before the spitting out source code problem appeared, I recall quite a few positive responses to Copilot. Lots of people still seem excited. And yeah, people are looking at the downside given Microsoft's long abusive history but hey, MS did those thing.


You've answered your own question. They went under the radar and nobody cared about them. They're not the multibillion company that sued Mike Rowe and keeps ReactOS developers awake at night.


Try doing any type of deal (fundraising, M&A) where you can't point to the provenance of your application's code. This isn't good for programmers, programmers WANT clean and knowable copyrights. This is good for lawyers, who'll now have another way to extract thousands of $$ from companies to launder their code.


If you do get sued, the Copilot page is written in a way that would make Github legally responsible for it, not you. "Just like with a compiler, the output of your use of GitHub Copilot belongs to you."


Yeah, right... This isn't going to fly in court any more than if the Pirate Bay page was written in a way that says that it's solely responsible for what you do with the magnet links that they share.


The pirate bay is very clear to not claim any responsibility for what people post on their site. That's how they get away with it.


I know, it's an hypothetical.


On many ML posts, you get arguments about IP, and there's a long history of IP wars on this forum, especially when licensing comes up. Then you add the popular Big Tech Is Evil arguments you see. I think it's a variety of factors coming together for people to be upset about someone else profiting from their own work in ways they didn't mean to allow.

I expect that we'll need new copyright law to protect creators from this kind of thing (specifically, to give creators an option to make their work public without allowing arbitrary ML to be trained on it). Otherwise the formula for ML based fair use is "$$$ + my things = your things" which is always a recipe for tension.


I think the real issue is less about the "copying short snippets", and more about how it was done, i.e zero transparency, default opt in without any regards to licensing (with no way to opt out??) and last but not least - planning to charge money for it.


I've always cared but never talked about it. Someone copy and pasting code from a source that is clearly forbidden (free software, reverse engineered code, leaked source code, etc) isn't an interesting thing to talk about. It's obviously wrong.

Also people rarely do it; I've caught maybe a couple instances of it in my career and I never really thought too much about them again. This tool helps make it a lot easier and more common. I have a feeling other people chiming in are also in the camp of "Oh, this is going to be a thing now, huh?"

I also can't help but to think that my negative opinion of it isn't solely based on this provenance issue. While it's cool it seems questionable about how practical it is. If the value was more clear I think I could stomach the risk a bit better.


Firstly it's important to remember that HN is not a single person with a single opinion, but many people with conflicting opinions. Personally I'm just interested in the copyright discussion for the sake of it because I find it interesting. Though, I imagine there's also an amount of feelings of unfairness.


As a mature, skilled engineer, you wouldn’t mind sharing your knowledge—but you’d really prefer to do this on your own terms.

First, you might choose to distribute your code under a copyleft license to advance the OSS ecosystem. Second, the older you get, the more experience you accumulate, paradoxically the harder it is for you to find a job or advance your career in this industry—so, to maintain at least some source of motivation for tech companies to hire you, you may choose to make some of the source available, but reserve all the rights to it.

You’re fine making the source of your tool or library open for anyone to pass through the lens of their own consciousness and learn from it, but not to use as is for own benefit.

Now with GitHub Copilot suddenly you see the results of your labour you’ve previously made (under the above assumptions) public being passed through some black box, magically stripped from your license’s protections, and used to provide ready-made solutions to everyone from kids cheating at college tests to well-paid senior engineers simply lacking your expertise.

I hope it’s easy to spot how engineer’s interests in the above example are not necessarily aligned with GitHub’s, how this may be perceived as an unfair move disadvantaging veteran rank-and-file software engineers while benefitting corporate elites and investors, and subsequently has the potential to disincentivize source code sharing and deal a blow to OSS ecosystem as a whole.


Perhaps people on HN start sensing that successors of Github Copilot will take their programming job. Rightly so.

Personally, I think that in the age of AI programming any notions of code licensing should be abolished. There is no copyright for genes in nature or memes in culture; similarly, these shouldn't be copyright for code.


> Perhaps people on HN start sensing that successors of Github Copilot will take their programming job. Rightly so.

I still think we're a long way from that. Copilot will help write code quicker, but it's not doing anything you couldn't do with a Google search and copy/paste. Once developers move beyond the jr. level, writing code tends to become the least of their worries.

Writing the code is easy, understanding how that code will affect the rest of the system is hard.


Based on the responses I've seen, people have it in their heads that Copilot is a system where you describe what kind of software you want and it finds it on Github and slaps your own license on it.

It's just a smarter tab-completion.


Depends on your definition of "a long way". Some of the GPT3 based code generation demos (which, explicitly, are just that - demos - we aren't shown the limitations of the system during the demo) say that's closer than I think.

https://analyticsindiamag.com/open-ai-gpt-3-code-generator-a... has a bunch of videos of this in action.


That's because the training set had that specific demo, not because copilot imagined up a demo.


> Perhaps people on HN start sensing that successors of Github Copilot will take their programming job. Rightly so.

I feel like this comment misunderstands what a software developer is doing. Copilot isn't going to understand the underlying problem to be solved. It's not going to know about the specific domain and what makes sense and what doesn't.

We're not going to see developers replaced in our lifetime. For that you need actual intelligence - which is very different from the monkey see monkey do AI of today.


The thing is that understanding the domain and thinking out a fairly efficient or elegant solution is something a lot of industry specialist and scientists can do, and only part of programming. Another part is dealing with all the language syntax and specialist lego bits/glue code, and that's something domain specialists tend to be less good at and not enjoy spending time on; it's its own craft.

Having a semi-intelligent monkey that can fetch obvious things off the shelf, build very basic control structures, and do the boring little housekeeping tasks is bad for the craft of programming but very good for the good-enough-solution situation. I can see it having the same impact as cheap and widely available digital cameras; anyone can be a kinda decent photographer now, but if you want to be a professional you're probably going to have to work a lot harder to stand out, whether that's by development of craft, development of narrow technical expertise and fancy equipment, or development of excellent business skills.


The funny thing with "good enough" solutions is that at some point it becomes unmanageable. I've basically spent a good part of my career cleaning up these solutions to make way for scalable, maintainable solutions that don't introduce security holes.

Photography is a good analogy - with everyone having fancy cameras you could think that a photographer is now not necessary. But yes there are still photographers about - they see things that the average person doesn't. The camera doesn't tell them what type of photos to take, what composition the photo should have or what poses a model should have.


You have excellently described the job of business analysts and system architects, but this is not the job of 90% of programmers today, including senior-level. Part of this is already done by other people and doesn't require specific programming skills, hence, at the very least, programmers will lose their privileged position. Another part of it is actually too hard for most people who are currently employed as programmers to do on a decent level (such as meaningfully hacking on Linux kernel).


Memes are absolutely copyrightable, heard of Grumpy Cat?

New genetic sequences are patentable, not copyrightable, but that because of the process involved in creating new genetic sequences more then the genes themselves.

Sure naturally occurring genes aren't patentable, but it's not like we have code growing on trees. So that's a terrible comparison.


The problem with Copilot is, that so far it doesn't seem to be much of an AI and more of an copy-bot. If you are just copying code, you quickly run into copyright issues with your sources. A true AI based on training on open source software would be something different.


Patents on genes actually are a thing. So that example is pretty false. Whether they should be a thing is a separate question, but right now discovery of a gene and it's usefulness can be patented and is done for medical patents.


People aren't happy because Microsoft is exploiting open source. They're training it on open source code and keeping the service for themselves.

If they made the trained model public (and also trained it on private code) the response would be completely different.


>There is no copyright for genes in nature

Since when are humans not a part of nature?


You don't have to be a copyright maximalist to worry about a company taking snippets of code that used to be under an open license and using them in a closed-source app.


In addition, this is extremely hard to enforce. I think the amount of code running in closed systems that does not exactly respect the original license is shocking. What was the last case you know where this was a "scandal"?

It only happens at boss level when tech giants litigate IP issues.


I don't know about HN in general but my impression has been that anyone copying random code off the internet or adding dependencies without understanding the license (e.g. just blindly adding AGPL code) would be very much frowned upon in any remotely professional setting because a basic understanding of copyright and open source licensing is expected of even junior developers.

"Hackers" "playing" and ignoring copyright is fine, but Copilot isn't promoted as a toy, it's promoted as a tool for professional software development. And in that framing it is about as dangerous as an untrained intern with access to the production server.


I'm more surprised that people don't care about the telemetry aspect. It's an extension that sends your code to an MS service, and MS promises access is on a need-to-know basis.

I don't care if MS copies my hobby projects exactly, but I'm not sure my employer(defense contractor) would even be allowed to use a tool like this.

I think it looks cool though. I will probably try it out if it is ever available for free and works for the languages I use.


It's quite possible to do this on-prem and even on-device. TabNine, a very similar system with a smaller model (based on GPT-2 rather than 3), has existed for years and works on-device.


The difference between copilot and copy pasting from stackoverflow is consent


It's a pretty standard "big company releases new thing" reaction. HN is usually negative on everything.


Is it really confusing? It's a rich company using the fruits of our labor, provided free TO OTHER DEVELOPERS. I have never okayed "use my code to train AIs that nobody else could". It's backhanded and unfair.


Programmers love to pretend that they're lawyers, especially when it comes to copyright law. Something about the law really appeals to hackers!


Copy-left licenses are generally liked by developers, this flys very directly against that since it suggests circumvention of those type of licenses.


If very powerful companies are appropriating and reproducing code in contravention of copyright then that is something that should be called out.


Of copilot were open source I wouldn't have an issue with it. However it is closed source and a later version it's intended to be sold.


It is a large corporation eroding the integrity of open source licenses. It is perfectly reasonable to be pissed off about this.


This isn't true at all. There are stories concerning code stealing that regularly lead the front page on HN and rouse a pretty intense reaction from the community. Saying that HNers have never before cared about this issue seems pretty inaccurate or disingenuous.


Copilot violates the assumptions many people made when they open sourced their code. Moving from manual to automated use feels like a privacy violation because it dramatically changes the amount of effort it takes to leverage the work in an unintended context.


idk, I don't quite enjoy the idea of having my code stolen without any respect for its licence or even attribution

but then again I migrated away from github as soon as MS bought it

still, it's a matter of principle


> Copilot should be inspiring people to figure out how to do better than it, not making hackers get up in arms trying to slap it down.

One of the (many) problems is that GitHub/Microsoft already benefit from runaway network effects so it’s difficult to “do better”. Where will you get all of that training code if not off GitHub?

The real answer to this is to yank your projects from GitHub now while you search for alternatives.


Even if you do that, what's to stop them from using open source software from all over the web and not just what's on GitHub? The only way to stop them then is to go closed source.


I mean stop them at a larger level by threatening their success as an organization. If developers stop publishing to GitHub they have bigger problems than training ML models.

Whether or not this move is “legal”, it should serve as a wake up call that GH is not actually a service we should be empowering. This incident is just one example of why that’s a bad idea.


They make you give up some of your monopoly rights when you put stuff on Github (some parts of those ToS might or might not be legal).

You would have a much stronger case if they had taken your code from elsewhere.


Copyright defends us from some of the abuse by large corporations in the form of the GPL.

Want Linux to run on your thing? You must publish driver source then or you're violating copyright law. This was less a big deal before device vendors ratcheted the pathological behavior up to 11 with smartphones and that's why far more people seem to react far more strongly now.


Hacker News hates everything, especially if it seems to work. Don't read into it.


"Please don't sneer, including at the rest of the community."

https://news.ycombinator.com/newsguidelines.html


Ok, my curiosity has been fired here...

I have conjured up two scenarios here:

Let's say I use copilot to generate a bunch of code for an app, something substantial, and it regurgitates a load of bits and pieces from many sources it got from GitHub, I'd assume there won't be any attribution in it... it will be as if Copilot made the code itself (I know it sort of does but lets not split hairs!). I'm guessing the prevailing theory (from GiitHub anyway) is that I'm legitimately allowed to do this.

Now, let's say I generated all that code by manually copying and pasting chunks of code from a whole bunch of repos, whether they are open source, unlicensed, whatever. Would I not be ripe for legal issues? I could potentially find all the code that copilot generated and just copy and paste it from each of the sources and not mention that in my license. What if I told everyone "yeah, I just copied and pasted this from loads of Github repos and didn't put any attribution in my code". I'd assume that (morality aside) I'd be asking for trouble!

Am I missing something? Am I misunderstanding the situation, or the capabilities of copilot?


There's a decent bit of caselaw indicating that computers reading and using a copyrighted work simply "don't count" in terms of copyright infringement -- only humans can infringe copyright. This article[0] does a pretty good job of summarizing the rationale that the courts have provided. My (non-lawyer) take is that GitHub is pushing this just half a step farther -- if computers can consume copyrighted material, and use it to answer questions like "was this essay plagiarized", then in GitHub's view they can also use it to train an AI model (even if it occasionally spits back out snippets of the copyrighted training data). Microsoft has enough lawyers on staff that I'm sure they have analyzed this in depth and believe they at least have a defensible position.

[0]: https://slate.com/technology/2016/08/in-copyright-law-comput...


Makes me wonder what would happen if a similar thing was done with books. If I train an AI on all the texts of Tom Clancy, or Stephen King, or every Star Wars novel, and the books it generates every so often produce paragraphs verbatim from one of those sources, would copyright owners be up in arms? What would the distinction be between the code case and the text case?


I am not a lawyer. I do photography and have a more than passing interest in copyright as it applies to the photographs I take and the material I photograph.

Copyright on art gets more interesting / fuzzier. The key part is substantial similarity - https://en.wikipedia.org/wiki/Substantial_similarity and https://www.photoattorney.com/copyright-infringement-for-sub...

Rather than text, my AI copyright hypothetical... consider a model created based on sunset photographs. You take a regular photograph, pass it through the model, and it transforms it into a sunset. The model was trained on copyrighted works but the model is considered fair use.

Now, I go and take a photograph from some location during the day and then pass it through the transformer and get a sunset. Yea me! Unbeknownst to me, that location is a favorite location for photographers and there were sunsets from that location used in the training data. My photograph, transformed to look like a sunset is now similar to one of them in the training data.

Is my transformed photograph a derivative work of the one in the training data to which it bears similarity to? How would a judge feel about it? How does the photographer who's photograph was used in the training data feel?


What would be interesting in that case would be how the transformed image would look if photos from that location were removed from the training set. That would help reveal whether it was just copying what it had seen or it actually remembered what sunsets looked like and transformed the image using its memory of sunsets in general.


This will surely happen within the next few years; but if the "new work" contains a full paragraph from an existing novel the copyright hammer would come down hard.

Maybe it needs to be paired with another network / hunk of code that checks for verbatim copying?


> There's a decent bit of caselaw indicating that computers reading and using a copyrighted work simply "don't count" in terms of copyright infringement -- only humans can infringe copyright.

I have read variations of "computers don't commit copyright" more times than I can count in the past few days.

How is Copilot different from a compiler? (Please give me the legal answer, not the technical answer. I now the difference between Copilot and a compiler, technically.)

Isn't a compiler a computer program? How is its output covered by copyright?

Am I fundamentally misunderstanding something here?


What if I made a few tweaks to Copilot so that it is very likely to reproduce large chunks of verbatim code that I would like to use without attribution, such as the Linux kernel. Do you really think you can write a computer program that magically "launders" IP?

A compiler is run on original sources. I don't see any analogy here at all.


* They both process source code as input.

* They both produce software as output.

* They both transform their input.

* They both can combine different works to create a derivative work of each work. (Compilers do this with optimizations, especially inlining with link-time optimization.)

They really do the same things, and yet, we say that the output of compilers is still under the license that the source code had. Why not Copilot?


> Why not Copilot?

Because the sources used for input do not belong to the person operating the tool.

If you say that doesn't matter, then you are saying open source licenses don't matter because the same thing applies - I could just run a tool (compiler) on someone else's code, and ignore the terms of their license when I redistribute the binary.


No, I think that’s the point.

If I take some code I don’t have a license for, feed it to a compiler (perhaps with some -O4 option that uses deep learning because buzzwords), then is the resulting binary covered under fair use, and therefore free of all license restrictions?

If not, then how is what Copilot is doing any different?


> If I take some code I don’t have a license for, feed it to a compiler (perhaps with some -O4 option that uses deep learning because buzzwords), then is the resulting binary covered under fair use

No, the binary is not free of license restrictions. Read any open source license - there are terms under which you can redistribute a binary made from the code. For GPL you have to make all your sources available under the same terms for example. For MIT you have to include attribution. For Apache you have to attribute and agree not to file any patents on the work in Apache licensed project you use. This has been upheld in many court cases - though it is not always easy to find litigants who can fund the cases the licenses are sound.


I think you have what I am saying backwards. I am saying that the licenses should apply to the output of Copilot, like they apply to the output of compilers.


Oh sorry, my mistake! Thank you.


That only makes it worse.


You just blew my mind with that analogy. I can only imagine some hair-splitting logic to rationalize a distinction.


The analogy goes even further if you consider compiler optimizations: https://gavinhoward.com/2021/07/poisoning-github-copilot-and... .


"Computers don't commit copyright" is a complete misreading or misunderstanding of another proposition, that "computers cannot author a work".

Authoring is the act that causes a work to be copyrightable. In most jurisdictions, authoring a work automatically causes copyright to subsist in the work to some degree. The purpose of the copyright system is to encourage people to author new, original works, by rewarding those who do with exclusive rights. It is well-known that only humans can author a work. Computers simply cannot do it. If your computer (by some kind of integer overflow UB miracle) accidentally prints out a beautiful artwork, NOBODY has exclusive copyright over it, and anyone may reproduce it without limitation. Same goes for that monkey who took a selfie.

What a compiler does, on the other hand, is adapt a work. Adapting a work is not authoring it. Sometimes when you adapt a work, you also author some original work yourself, like when you translate a book into another language. When a compiler (not a linker) transforms source code, it absolutely, 100% definitely does NOT add any original work; the executable or .so/.a/.dylib/.dll file is simply an adaptation of the original work. The copyright-holder of the source code is the copyright-holder of the machine code. An adaptation is also known as a "derivative work".

(Side note; copyleft licenses boil down to some variation of "if you adapt this, you have to share everything in the derivative work, not just the bits you copied.")

Adaptation is a form of reproduction. It's copying. "Distribution" also often involves copying, at least on the internet. (Selling or giving away a book you have purchased does not constitute copying.) Copying is one of the exclusive rights you have when you own the copyright in a work, that you may then license out.

It gets more complicated when the computer uses fancy ML methods to produce images/text out of things it has seen/read. You can't simplify the law around that to a simple adage digestible enough to share memetically on HN and Twitter. One thing is certain: if the computer did it, by itself, then no original work was authored in the process. That poses a problem for people who write the name of a function and get CoPilot to write the rest; if you do that, you are not the author of that part of the program. If you use it more interactively that's a different story.

There is, however, always a question of whether the copyright in the original works the computer used still subsists in the output.

My rough framing of the licensing issues around CoPilot is therefore as follows:

1. The source code to CoPilot is an original work, and the copyright is owned by GitHub.

2. When GH trained CoPilot's models on other people's works, was that copying? (This one is partially answered. It can spit out verbatim fragments, so it must be copying to some extent, rather than e.g. actually learning how to code from first principles by reading.) If it was not all copying, how much of it was copying and how much of it was something else? What else was it?

3. If GH adapted the originals, what is the derivative work? (I.E. where does the copyright subsist now? Is is a blob of random fragments of code with some weights to a neural network?)

4. Which works is it an adaptation of? You might think "all of them, and for each one, all of the code" but I'm not so sure. For example, imagine the ML blob contains many fragments, but some are shorter than others. If your program has "int x;" in it, and CoPilot can name a variable "x", you can hardly claim that as your own. I'm most interested in whether the mere fact of CoPilot having digested ALL of it, having fed this into the mix and producing a ML blob based on all that information, means that the ML blob is a derivative work of all of them. Or whether there is some question of degree.

5. Fair use. Was it fair use to train the model? Is it, separately or not, fair use to create a commercial product from the model and sell it? Fair use cares about commercial use, nature of the copied work, amount of copying in relation to the whole, and the effect on the market for / value of the copied work. Massive question.

6. If not fair use, then GH is subject to the licenses and how they regulate use of the works. What license conditions must GH comply with when they deal with the derivative work, and how? Many will be tempted to jump straight to this question and say GH must release the source code to CoPilot. I'm not yet convinced that e.g. GPL would require this. I can't believe I'm writing this, but is the ML blob statically or dynamically linked? Lol.

7. Final question, is there some way to separate out works which were copied with no fair use (or not copied at all), from works which were copied with no fair use? People are worried about code laundering, e.g. typing the preamble to a kernel function and reproducing it in full. In that situation, it is fairly obvious that the end user has ultimately copied code from the kernel and needs to abide by GPL 2.0; moreover if they're using CoPilot to write out large swathes of text they will naturally be alert to this possibility and wary of using its output. But think of the converse: if there is no way to get CoPilot to reproduce something you wrote, what's the substance of your complaint? Is CoPilot's model really a derivative of your work, any more than me, having read your code, being better at coding now? Strategically, if you wanted to get GH to distribute the model in full, you might only need one copyleft-licensed, verbatim-reproducible work's owner to complain. But then they would just remove the complainant's code. You might be looking at forcing them to have a "do not use in CoPilot" button or something.


I think this is more cogent analysis than anything else I've seen yet on this topic. You should consider submitting a blog post so this can become a top-level topic.

Also, I loved this quote:

> Copying is one of the exclusive rights you have when you own the copyright in a work, that you may then license out.

I've been paying attention to software copyright topics for more than twenty years and never thought of it in exactly these terms. Its right there in the name - the right to copy it - and determine the terms under which others can copy it is exactly what a copyright is!


I don't doubt that an army of lawyers has poured over this but they have size on their side: the cost of litigation vs potential revenue will be a massive factor.

Edit: > There's a decent bit of caselaw indicating that computers reading and using a copyrighted work simply "don't count" in terms of copyright infringement.

That means their computer can read any code it wants, do whatever it wants with the code, then they can monetise that by giving YOU the code. Would they then be indemnified by saying "no Microsoft human read or used this code"?

However, if you then use the code and look at it, does that make you liable?


Again, not a lawyer, just a guy who likes reading this stuff. The devil is usually in the details of copyright cases. The Turnitin case hinged substantially on whether Turnitin's use of copyrighted essays was "fair use". There are four factors[0] which determine fair use; the two more relevant factors here are "the purpose and character of your use" and "the effect of the use upon the potential market". The court found that Turnitin's use was highly "transformative" (meaning they didn't just e.g. republish essays; they transformed the copyrighted material into a black-box plagiarism detection service) and also found that Turnitin's use had minimal effect on the market (this is where "computers don't count" comes in -- computers reading copyrighted material don't affect the market much because a computer wasn't ever going to buy an essay).

I would be shocked if GitHub's lawyers didn't argue that using copyrighted material as training data for an AI model is highly transformative. There may be snippets available from the original but they are completely divorced from their original context and virtually unrecognizable unless they happen to be famous like the Quake inverse square root algorithm. And I think GitHub's lawyers would also argue that Copilot's use does not affect the _original_ market -- e.g. it does not hurt Quake's sales if their algorithm is anonymously used in a probably totally unrelated codebase.

Your counterexample would probably fail both tests -- it's not transformative use if your software hands out complete pieces of copyrighted software, and it would definitely affect the market if Copilot gave me the entire source code of Quake for my own game.

[0]: https://fairuse.stanford.edu/overview/fair-use/four-factors


I thought I understood fair use but turns out I was wrong...

That being said, creating a transformative work from something else is considered fair use. So, for example, if I read a whole bunch of books and then, heavily influenced by them, create my own, similar book, that would be fair use I suppose... that makes sense.

But, where does the derivative works come in? Where do you draw the line?

If I am heavily influenced by billions of lines of other people's GPL code (ala Copilot!), then I create my own tool from it and keep my code hidden, does that not mean I am abusing the GPL license?


That's what I meant by the devil being in the details -- these gray area questions hinge on the specific facts. Lawyers on both sides will argue which factors apply based on past caselaw and available evidence, and the court renders a decision. For example, from the Stanford webpage I previously linked: "the creation of a Harry Potter encyclopedia was determined to be “slightly transformative” (because it made the Harry Potter terms and lexicons available in one volume), but this transformative quality was not enough to justify a fair use defense in light of the extensive verbatim use of text from the Harry Potter books". So you might be okay creating a Harry Potter encyclopedia in general, but not if your definitions are copy/pasted from the books, but you might still be okay quoting key lines from the books if the quotes are a small portion of your encyclopedia. The caselaw just doesn't lend itself to firm lines in the sand.


If you read a bunch of books and then create a similar book, that isn't transformative; transformative is like, you read a bunch of books and then create a machine translation service. The point of transformative is like "isn't going to conflict with the market or compete in any way with the original thing".


That’s funny, because the bedrock of copyright - insofar as software is concerned - is entirely predicated on the idea that a computer copying code into RAM to execute it is indeed a copyright violation outside of a license to do so.


I think you're right. Especially given that Copilot can reproduce significant blocks of code: https://twitter.com/mitsuhiko/status/1410886329924194309

Famous code: https://en.wikipedia.org/wiki/Fast_inverse_square_root#Overv...


I see this held up as an example a lot, but the fast inverse square root algorithm didn't originate from Quake and is in hundreds of repositories - many with permissive licenses like WTFPL and many including the same comments.

GitHub claims they didn't find any "recitations" that appeared fewer than 10 times in the training data. That doesn't mean it's a completely solved issue (some code may be repeated in many repositories but always GPL, and there are limitations to how they detect recitations), but from rare cases of generating already-common solutions people seem to be concluding that all it does it copy paste.


That may be true, although even GitHub doesn't know for sure. But the problem remains: they're reproducing other people's code without regard to license status.


Copilot is a commercial paid service that generates money for Microsoft


Yeah, that bit I realise but the point I was getting at is this: if I take someone else's code, use chunks of it in my app, say that it's mine and make money from it is that not illegal? Or, at least in violation of the license?

Superficially at least, Copilot (from my understanding) is "copying" code, letting me use it in my app, and making money from it.

I'm just trying to wrap my head around it.

Let's be clear, I am not a lawyer, but it seems... strange!


Also NAL, but I think there's far more of a case that users of Copilot might violate copyright rather than Copilot itself:

- Only a very small proportion of Copilot generated code is reproduced verbatim, so if you specifically built a product just from copied-verbatim code, your act of selecting and combining those pieces of copyrighted code would be creating a derivative work.

- GitHub is not selling the copyrighted code, they are selling the tool itself. Google is literally the same thing: you could theoretically create a product by googling for prefixes of copyrighted code and then copying the remainder straight out of the search results. It's you who would be violating copyright, not google.


I think there is an argument to be made that Copilot is producing derivative code, though. It may produce copies verbatim, and that's a violation, but far more often, it produces a mixture of things it was trained on, most of which probably have some sort of license requiring attribution at the very least.


Both the Copy machine and VCR were found to be legal because they had substantial non infringing uses. As is I don't see how Copilot does. It could, if trained on public domain or attribution free code only, unfortunately there probably isn't enough code out there to train the model adequately under such rules.


Does copilot seem strange, or maybe the concept of intellectual property does?


Copilot isn't strange from a technical prespective.

The strange bit is how they are allowed to use other peoples code to create derivative works (this is how I see it from my non-legal perspective anyway).

Even if it's legal (to the letter of the law, not the spirit) it leaves a sour taste.


Suppose Copilot was Composer and it generated personalized songs for you after being trained on Spotify's library. If you started performing the resulting song and it contained recognizable clips of others, I guarantee you'd have lawyers coming after you.

I don't see this as fundamentally different. It's unlikely that the Free Software Foundation is going to track you down for including some GNU code in your single-user repo. If you used their stuff in a popular commercial project and they got wind of it, you might expect to receive a cease and desist at best.


Copying/pasting code from open source projects it's considered fair use. Come on, who doesn't do that?

I mean, sure you don't copy an entire file, but you tend to copy a snippet, or in the end you look at how is done and you done the exact same way (that is the same of copying it!)

I would say there is not a problem in there.


If you are copy and pasting code from open source projects into your own project, then I think that is more likely to be considered copyright infringement than fair use. Fair use is generally for things like criticism, parody, teaching etc. Obviously this kind of thing would need to be judged on a case-by-case basis, but I think you are on shaky ground here.


Copilot is just a tool, legally it cannot "make code", you're the one making it.

See also : Napster, including how it was condemned for facilitating copyright infringement (what Microsoft is risking here, though the offense is likely to be much milder, of course).


"I'm guessing the prevailing theory (from GiitHub anyway) is that I'm legitimately allowed to do this."

No. Copilot is a technical preview. In the final release, if it reproduces code verbatim, it'll tell you and present the correct license.


Doesn't matter that it's a technical preview; people are using it now, GitHub has already used it internally. So if it infringes now, there is already code out there being used that does infringe.


GitHub appears to be tracking every snippet that they're generating during their trials:

https://docs.github.com/en/github/copilot/research-recitatio...

Are you doing that? If not, then I wouldn't use GitHub's use as justification to engage in copyright infringement.


Oh, I am not using Copilot. But other people not part of GitHub are. And those are still violations.


How will it find the “correct” license?

Will it check the LICENSE file? Simply having a LICENSE file is not a declaration that all the code in that repo is under that LICENSE.

What if specific lines/files are specified to be under different licenses?

What if the publisher of the repo is publishing it under an incorrect license in bad faith?

Will github be responsible if it tells me the wrong license?


Copilot isn't a retrieval model. It's a generative model. It learns the coding techniques, not retrieving snippets. Only 0.1% of code it generates is regurgitated, and even that is usually pretty common code.


Calling it “public” code feels like doublespeak. It’s most definitely NOT public domain code — it only happens to be hosted on GitHub and browsable (but not copyable) by people. “Source available for viewing” is very different from “public property” as the phrase is commonly understood: https://en.m.wikipedia.org/wiki/Public_property


Not copyable by people, but we can go through the code, learn from it and then use that knowledge to improve our coding skills.

Isn't that what autopilot is doing here? The system is merely learning how to code, and then applying it's learnings on other programming problems. It's not like it's writing software to specifically compete with other programs.


Not when it outputs large sections of unique code verbatim, as it's been shown to do.


If it's large sections, that can be fixed by either licence attribution or result filtering.

That's at best a technical issue. What way too many people claim, however, is that the machine isn't even allowed to look at GPL'ed code for some reason, while humans are.

I'd like to learn the reasoning behind that.


> What way too many people claim, however, is that the machine isn't even allowed to look at GPL'ed code for some reason, while humans are.

Why would those be the same thing? It's a matter of scale. Just like how people are allowed to read websites, but scraping is often disallowed.


> Just like how people are allowed to read websites, but scraping is often disallowed.

Hosting code on Github explicitly allows this type of usage (scraping) according to their TOS so I have to ask again - why the sudden complains?

Are we still talking about a shortcoming of the ML model, which very occasionally spits out a few lines of copied code or should we include search engines into this, because they do the exact same thing by design?

robots.txt, for example, has a non-binding, purely advisory character as well and Common Crawl [0] (also used for training GPT-3) publishes a dataset that by definition contains GPL'ed code as well, no matter where it's hosted. So is that off-limits now, too?

[0] http://commoncrawl.org


I think result-filtering (based on license of search results) is gnarly enough, and likely computationally intensive, so as to break the whole feature. But it would be interesting to see if that can be crafted to fix the shortcomings of the ML model.


There's a really philosophical question here about whether Copilot is learning or imitating.

For instance, a parrot doesn't learn to speak, it learns to imitate speech.


The word they're actually referring to here is "source available", and trying to use "public" is just to confuse people into thinking they're referring to public domain only.


Maybe they mean code that is in public (versus private) repos? And then use the word to make it seem like it's stuff in the public domain?


Movies are “public” too. That does not mean you are allowed to use them for any purpose. The term “Public” does not have specific legal consequences in copyright law outside of something being “public domain” as you say.


The question is: are you allowed to train a neural network on movies (e.g. For an automated color grading algorithm) and then sell that as a service?


The correct analogy would not be a color grading service but instead a service that produces supplementary content for movies their subscribers make.


You are allowed to watch them. Many moves take ideas from other movies, which took ideas from myths and earlier stories. In fact, I find modern movies highly highly derivative.


>it only happens to be hosted on GitHub and browsable (but not copyable) by people.

So would you say that it's publicly visible?


Publicly visible, yes. Publicly available, yes. Public code, no.


"Public code" is not a defined term. It's not short for "public domain code".


If you post your code to the public, I wouldn't be shocked if people copy it verbatim without regard to license. I'm not suggesting that is a proper thing to do, just accepting that it can happen when I post code.


I guess leaked copies of the NT kernel source on github are now "public" in the eyes of MS?


Interestingly, it is copy able... but only on GitHub ! ("forkable")

That's some nasty walled garden terms... I wonder how much these kinds of ToS are actually legal ?


Pressing that "fork" button might be illegal. It's certainly illegal to push after pressing it in many cases.


Public and public domain are not the same thing. This code is public in the same way that Google indexes publicly available information on the internet.


I think it’s pretty easy to defeat MS in court.

We just need to bring the music industry into this!

For example: Let’s train a network on Beatles music to generate new Beatles songs. I’m pretty sure music lawyers will find a way to prove that the trained network is violating the label’s copyright, as they always manage to do that.

And then we just need to use the precedent and argue that music is the same thing as code.


Potentially dumb question from a guy who isn't a lawyer:

Does Copilot infringe Google's patent(s) on the Transformer architecture? If so, then Google could potentially sue them for royalties, at least.

Further, couldn't this Copilot thing backfire for Github because customer trust is more valuable than AI training data right now? If folks don't feel they can trust Github, seems like they could move their work to other version control systems like Gitlab or Bitbucket...


Doesn't really matter, because if Google sued Microsoft, Microsoft would immediately hit back with a countersuit, since they would have little trouble finding something in their 90,000+ patent warchest that Google is infringing on. Software patents have become a matter of mutually-assured destruction for the big players. The only winning move is not to play.


> For example: Let’s train a network on Beatles music to generate new Beatles songs. I’m pretty sure music lawyers will find a way to prove that the trained network is violating the label’s copyright, as they always manage to do that.

The people making the machine that learned (and recites) beatles songs aren't infringing though (most likely). It's those that use the machine to create and distribute the new works that are.

Same here. No one will be able to say that Copilot itself is a "derived work" or somehow uses the code in a way similar to a computer program (Although such claims have already been made - I highly doubt that's the case). But those that produce a whole file full of GPL code verbatim (Which will be rare, but WILL happen), are at risk of violating the license terms if they distribute it under the wrong license.


There is an absolutely enormous archive of fan-taped Grateful Dead shows out there, someone with much more time and money than me needs to train a network on that!


username checks out lol


Wouldn’t a more accurate metaphor be “let’s train a network on all music, to generate new music”, which includes Beatles, and may generate songs that contain the same chords as the Beatles used?


Yes, but may also use the same chord progressions, lyrics, or melodies. Could even say it contains snippits of the actual recordings, depending on how you look at it


Sure but then it’ll definitely be harder to prove it’s actual copyright infringement, especially when only a very small part of the song may have some snippets of the Beatles. Could it then, perhaps, be considered fair use?


Yes you'd have exactly the same kind of lawsuits and arguments that already exist today around fair use. It doesn't matter if a tool creates the new work or a person creates it (without tools?!!?) because ultimately it is a person who claims the new work as their own and distributes it, and that is the person who will get sued by the record companies if their work is too derivative. Establishing the line for "too derivative" in any particular case is a very lucrative field already I'm sure.


Or contain new chords that it synthesized from its knowledge of the ones the Beatles used.


In ancient Rome they didn't have a police force. What they had was essentially muscle for hire, mercenary bands paid by rich and powerful folks to do their bidding. As a regular person, the only thing that could have protected you from one of these groups was another such group.

Same today with the licensing system.


They had cohortes vigilum and cohortes urbanae in Ancient Rome. Why don’t they count as police?


Or Why not find some leaked windows/office source code and try to train a model to reconstruct microsoft software, then open source it? This surely must be legal, they're doing it themselves after all :D

(Maybe bring oracle into this :D)


It would be legal! But it wouldn't "reconstruct" Microsoft software. The way Copilot works is just that, a copilot. It's not the pilot. It's your own fault for what you do with it, it's just giving you some help along the way.


So long as the "copilot" is a black box that no one can inspect, how is it substantially different than me creating a website with a link to download a licensing-stripped version of Microsoft office, but it only gives you a verbatim copy 1/10 times you try it?


There's an entire academic paper detailing exactly how it works. https://arxiv.org/abs/2107.03374


That already exists though? SongSmith and other similar tools are used by musicians a lot.


At what point is it not a derivative work?


afaik, chord progressions aren't copyrightable, and even some lyrical things aren't. Melodies are the main thing, I believe. (I could be wrong, this is just what I have been told in the past)


Note: that already exists, it's called Jukebox! https://openai.com/blog/jukebox/


Out of curiosity, how do we define license violation in that case? I, as a human being, have trained by reading code, much of which is covered by licenses that are somehow not compatible with code I'm writing. Am I violating licenses?

Asking seriously. It's really unclear to me where law and/or ethics put the boundaries. Also, I'd guess it's probably country dependent.


> I, as a human being, have trained by reading code, much of which is covered by licenses that are somehow not compatible with code I'm writing. Am I violating licenses?

As someone who has taught students in ICT a quick rule of thumb was that I picked a piece of text that I suspected, wrapped it in doublequotes and put it into a search engine.

9/10 times - possibly more - of the times I had that feeling it was true. 17 year olds don't write like seasoned reporters most of the time.

Obviously there needs to be some independent tought in there as well, but for teenagers I put the line at not copying verbatim, and to cite sources.

As we've seen demonstrated again and again copilot breaks both my minimum standard rules for teenagers: it copies verbatim and it doesn't cite sources.

I say that is pretty bad.

If the system had actually learned the structure and applied what it had learned to recreate the same it would be a whole different story.

But in this case it is obvious that the AI isn't writing the code - at least not all the time, it is instead choosing what to copy - verbatim.


> But in this case it is obvious that the AI isn't writing the code - at least not all the time, it is instead choosing what to copy - verbatim.

I still don't see any problem with that. If it's larger sections (e.g. entire NON-TRIVIAL function bodies), those can be filtered or correctly attributed after inference. So that's just a technicality.

Smaller snippets and trivial or mechanical implementations (generated code, API calls, API access patterns) aren't subject to any kind of protection anyway.

  int main(int argc, char* argv[]) {
Lines like that hold no intellectual value and can be found in GPL'ed code. It can be argued that that's a verbatim reproduction, yet it's not a violation of any kind in any reasonable context.

Where do you draw the line and how would you be able to - automatically even! - decide what does and does not represent a significant verbatim reproduction?


what about lines such as

    Idxs[i] += (Imm >> ((i * HalfLaneElts) % 8)) & ((1 << HalfLaneElts) - 1);

    double r2 = fma(u*v, fma(v, fma(v, fma(v, ca_4, ca_3), ca_2), ca_1), -correction);

    seed ^= hasher(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);

    qint32 val = d + (((fromX << 8) + 0xff - lx) * dd >> 8);
even if it's one line, it likely took some non-negligible thinking time from the programmer


What about E = mc^2 ?

Mathematics and physics equations are not copyrightable.


but those aren't only mathematics. There's the choice of variable names, the order in which things are called (maybe to optimize the performance on some CPU, we don't know), etc


Your original argument is based on the false premise that the amount of time or effort matters -- it doesn't. Not all human activity can or should be subject to copyright -- this the dangerous slippery slope of "intellectual property" -- and we are dangling by edge these days.


>I still don't see any problem with that. If it's larger sections (e.g. entire NON-TRIVIAL function bodies), those can be filtered or correctly attributed after inference. So that's just a technicality.

Today copilot does what it does.

I've never heard Microsoft defend anyone running afoul of some of their licensing details with "they can fix it later, it is just a technicality".

I think this should go both ways? No?

> Smaller snippets and trivial or mechanical implementations (generated code, API calls, API access patterns) aren't subject to any kind of protection anyway.

  int main(int argc, char* argv[]) {
> Lines like that hold no intellectual value and can be found in GPL'ed code. It can be argued that that's a verbatim reproduction, yet it's not a violation of any kind in any reasonable context.

Totally agree. Edit: otherwise we'd all be in serious trouble.

> Where do you draw the line and how would you be able to - automatically even! - decide what does and does not represent a significant verbatim reproduction?

I am not a lawyer but I guess many can agree that somewhere before copying functions verbatim, comments literally copied as well for good measure, somewhere before that point there is a line.

On the other hand: if there was significant evidence that the AI was doing creative work, not just (or partially just) copying then I think I would say it was OK even if it arrived at that knowledge by reading copyrighted works.

Edit: how could we know if it was doing creative work? First because it wouldn't be literally the same. Literal copying is liter copying regardless of if it is done using Xerox, paid writers, infinite monkeys om infinite typewriters, "AI" or actual strong AI.

After that it becomes a bit more fuzzy as more possibilities open up:

- for student works I look at how well adapted it is to the question at hand: a good answer from Stackoverflow, attributed properly and adapted to the coding style of the code base? Absolutely OK. Copying together a bunch of stuff from examples in the frameworks website? Fine. Reading through all the docs and look at how a number of high profile projects have done it in their open source solution, updating the README.md with info on why this solution was chosen? Now you are looking for a top grade in my class.

(of course IBM will probably not want you to work on their compiler though if you admit that you've studied OpenJDKs, or so I have heard.)


> Today copilot does what it does.

It's also not a commercially released product yet, but a technical preview, so uncovering and addressing issues like that is exactly what pre-release versions are for.

I'd say it succeeded greatly in sparking a discussion about these issues.


If I release a piece of software today that install Microsoft products but stripped of all attributions and without paying any licenses,

... will you defend it just because I claim it is a tech preview?


> ... will you defend it just because I claim it is a tech preview?

That's a straw man argument and you know it.

Code snippets are in no way shape or form comparable to entire software products and CoPilot neither installs anything nor is its intention to knowingly violate licences or copyright law.

Disingenuous straw manning like this doesn't help the discussion and only serves to distract from actual issues.


> That's a straw man argument and you know it.

It is absolutely not in my opinion and that particular idea did not cross my mind at all so the idea that I knew it is patently double false.

But let me try to be constructive here and be even more precise:

Would it be OK if I launched a tech preview of my AI poem writer companion that would copy lines but also complete stanzas from famous poets, rock bands and singer-songwriters?


> Would it be OK if I launched a tech preview of my AI poem writer companion that would copy lines but also complete stanzas from famous poets, rock bands and singer-songwriters?

Yes it would be if it only happened ~0.1% of the time and if quoting verbatim wasn't the intended function of the system but merely a side-effect. In fact, that's what artists sometimes do deliberately.

It's what happens with other GANs as well and all that needs to happen is to educate users about the possibility of this. As long as you don't take ownership of the output produced by your AI (and neither do Microsoft), it's at the discretion of the user what they use the generated content for and in which context.

It has been demonstrated that training data can be extracted from any large NLP model [0] so this wouldn't come as a surprise either.

[0] https://arxiv.org/abs/2012.07805

https://towardsdatascience.com/openai-gpt-leaking-your-data-...


It’s not AI it is ML. GPT-3 is a very large ML model. It does not reason. It’s a statistical machine.


ML is a subset of AI, in any defintion that I've seen. And both are needlessly anthropomorphizing what are currently simple statistical or rule-based deduction engines.

GPT-3 is no more 'intelligent' in the human sense than it is 'learning' in the human sense.


By this logic there is no such thing as AI.


There's no such thing as AI.


Can you expand on this? Clearly the term exists. I have a degree in AI, do the concepts I learned at university not exist? What does you mean when you say AI does not exist?

Do you mean that the terms, algorithms, concepts, and applications found in the field labelled "Artificial Intelligence" should not be called as such?

I have a feeling you are simply playing a semantic game, though, in which case we are likely to talk past each other.

Edit: I suspect you may be conflating artificial general intelligence[0] with AI

[0]: https://en.wikipedia.org/wiki/Artificial_general_intelligenc...


> Out of curiosity, how do we define license violation in that case? I, as a human being, have trained by reading code, much of which is covered by licenses that are somehow not compatible with code I'm writing. Am I violating licenses?

That depends, if you end up writing copies of the code you've studied then yes. You are on thin ice. Plagarization is definitely something that you can do with computer code. There have been several high profile cases around this in arts. As far as I can see it usually ends up being a question about how much of the work is similar, how similar it is, and how unique what was similar is. And added wrinkle in programing is that some things can be done in only one way, or at least any reasonable programmer will do it in only one way. So for example a swap(var1, var2) function can usually only be done in one way, and therefor you would not get in trouble if your and someone else swap function are the same.

I've been following the discussion about Copilot, and one issue that comes up again and again is that people seem to think that since Copilot is new, the law will treat it, and the code it writes differently, than what it would you or a copy machine. I think that is naive, my impression is that courts care more about what you did, not how you did it, and if you think Copilot can be used to do an end run around the law. Prepare to be disappointed.

So if Copilot memorize code and spits out copies of that code, then it is at best skating on thin ice, or at worst doing a license violation. If the code it is copying is unique, then it definitely is heading into problematic territory. I'm fairly sure sure someone in legal at Github is very unhappy about the quake fast inverse square root function.


My guess is that many people will use it on the backend where a copyright violation is hard to spot and even more difficult to prove.

As for fronted/open source etc... sure, if you don't care about copyright and licensing, use it.


> swap(var1, var2)

Well, there's also the xor way to be pedantic :)

   var1 = var1 ^ var2
   var2 = var2 ^ var1
   var1 = var1 ^ var2
But yeah, not too much wiggle room there.


Another variation (assuming no overflows):

    var1 += var2;
    var2 = var1 - var2;
    var1 -= var2;
And another:

    var1 ^= var2 ^= var1 ^= var2;
Assembly even has an instruction for it:

    xchg eax, ecx


The training question seems much more difficult.

The main problem that has been the topic is a simpler one - about the produced work. If you exactly reproduce someone's existing code (doesn't matter if you copy by flipping bits one by one or which technology you use), isn't it a copyright violation?

I'm kind of imagining a Rube Goldberg machine that spells out the quake invsqrt function in the sand, now...


Yes, if you play a video from Netflix while recording your screen, transcode that video to MPEG2 and use a red laser to write a complex encoding of that MPEG2 bitstream onto a plastic disk, then send that by mail to your friend, a court won't care about the complexity of that Rube Goldberg machine. They will just say it's a clear copyright violation since you distributed a Netflix movie by DVD.

With programming, there's the further complication what constitutes a work. But quakes invsqrt certainly qualifies, just like that one function from the Oracle vs Google case.


None of our laws were created under the assumption that computers would do so much of our jobs and effect so much of our lives. From robotic automation to social media to now computer programming. I think it’s really a mistake to ask what the letter of the law currently means in the evolving context. Laws should serve us and need to be adapted.


Who is "us" that are being served?

I'm not the biggest fan of copyright law as currently written, but I wouldn't say that MS's desire to file off the serial numbers on every piece of public code for their own profit is a good impetus to rewrite the law.


> I, as a human being, have trained by reading code, much of which is covered by licenses that are somehow not compatible with code I'm writing. Am I violating licenses?

There are many good answers from the legal side. I would also attack this side: the way human beings learn is entirely different from the way ML models are trained. We don't do gradient descent to find the slope of data points and find the most likely next bit of code.

We humans create rational models of the code and of the world, and use deduction from those models to create code. This is extremely visible in the way we can explain the reason behind our code, and in the way we are aware of the difference between copying code we've seen before vs writing new code. It's also visible in that we can be told rules and produce code that obeys those rules that doesn't resemble any code ever written before.

The difference is also easily quantifiable: humans learn to program after seeing vastly fewer code examples than Co-pilot needed, and we are much better at it.

One day, we will design an AI that does learn more similarly to how humans learn, and that day your question will be far more interesting. But we are far from such problems.


I'm not sure this is actually true. We can explain code, but the fact that we can explain code is not necessarily related to the way we actually end up writing it. Have you ever written a function "on autopilot"? Your brain has selected what you wanted it to do, and now you're just typing without thought? I don't think we're as dissimilar to this model as we'd like.


The feeling of being "on autopilot" when doing a task has to do with your, let's call it, supervisory process being otherwise occupied. It doesn't suggest that that the other mental processes which are responsible for figuring out the actions have changed their character or mode of operation.

"You" are just not paying attention to it in that moment.


The fact remains that, even on autopilot, in not writing code based on similarity with other code I've seen, in writing code to solve a task. In general, the code I'm writing is entirely novel - you could search all of the code ever written and you wouldn't find anything identical, or even similar much of the time. This puts not a brag - I work on fairly standard CRUD stuff most of the time - but just an observation about how human writing works, confirmed by code scanning tools such as Black Duck.


If you were to write large swaths of copyrighted code from memory then yes you'd be committing a copyright violation.

Most humans don't do so unintentionally though.


I’m not so sure Copilot is doing so “unintentionally” either...


Just as an example, this is very widespread in music though.


If the whole 'Dark Horse' debacle proved anything it would be that that can still be considered a copyright infringement. Sure that particular example was (rightly IMHO) deemed to not be a copyright violation, but they still had to show their version was original enough, they couldn't just claim such copying wasn't ever an infringement.


I am not a lawyer but I am sure that any legal standard for ML has to be different than "isn't it just doing what humans do, but faster?"

GitHub scanning billions of code files to build commercial software is different than you learning at human pace, even if they're both "learning" and in the end they both produce commercial software.


> isn't it just doing what humans do, but faster?

The human activity most like training an ML system is memorizing a text by reciting from memory, checking against the original, adjusting, and repeating until there are acceptably few mistakes.

And if a human did so for thousands of texts then publicly repeated those texts, they would be violating copyright too.


It does not have to be different but it certainly can be different, a difference in quantity can certainly be a difference in quality. People watching other people walk by and a camera - maybe with face detection - doing the same are not only a difference in quantity but also in quality.


That is exactly what needs some careful consideration. As a start, two people can write the exact same code independently, therefore having identical code is not sufficient. On the other hand I can copy some code and slightly modify it, maybe only the spacing or maybe changing some variable names, and it could reasonably be a license violation, therefore having identical code is also not necessary.

Does the code even matter at all? If I start with a copy of some existing code, how much do I have to change it to no longer constitute a license violation? Can I ever reach this point or would the violation already be in the fact that I started with a copy no matter what happens later? Does intention matter? Can I unintentionally violate a license?

But I think we don't have to do all the work, I am pretty sure this has already been considered at length by philosophers and jurist.


The boundaries are not set in stone, and so the answer is the old theme of "it depend". To provide a slightly different situation which was discussed a few years ago, can you train an AI on pictures of human faces without getting permission? Human painters have created images of faces for a very long time, so is it any different in terms of law and/ethics if an AI do it?

Yes, a bit? It depend. Using such things for advertisement would likely cause anger if people start to recognize images of the training set the AI was trained on.


My opinion would be that if the training set for the face generator was made up of photos whose creators had asked you to credit them if you re-used their work, then, yes, the generator is ethically in the wrong if it's skipping that attribution. Regardless of copyright. (And I feel the same way about Copilot.)


https://en.wikipedia.org/wiki/Clean_room_design

sometimes? it's enough of an issue that companies explicitly avoid it by having two teams.


Clean room design is a technique to avoid the appearance of copyright infringement. If the courts were omniscient and could see into your mind that you didn't copy then there would be no need. Why this is relevant is because we can see into the mind of copilot. Whether what it does it considered infringement I think will come out in the details.

If the ML model essentially is just a very sophisticated search and helps you choose what to copy and helps you modify it to fit your code then it's 100% infringement. If it is actually writing code then maybe not.


This means that also all illegally leaked codes from Apple, CDPR, Intel, NSA and Microsoft leaks are used in the models? iBoot, Witcher 3? Gwent? NSA backdoors?

Does the copilot still learn from new repos? Can I post github enterprise code publicly to let it learn from it?

Serious answers only please


IANAL but the serious answer -- i think -- is that you always use things at your own risk, even purchased tools, and are protected via indemnity agreements. If there is no indemnity agreement (is is the case here), you assume the risk.

That said, if enough people are bitten by this, i'm not sure what happens -- does anyone know of a relevant case. One somewhat relevant case that caused mass pain was the SCO Linux Dispute

https://en.wikipedia.org/wiki/SCO%E2%80%93Linux_disputes


If you're thinking about the liability waiver found in many licenses and contracts and EULA and other, they are often void, depends on the jurisdiction.

The official answer from Github that they take all input on purpose doesn't play in their favor.


I'm speaking specifically about the indemnity agreement that you get as part of a purchased license. It is the opposite of the liability waiver -- it is saying that the software publisher will take on responsibility in certain cases and with certain limits.

For example, if I purchase certain corporate Linux licenses, i'm protected against being sued if something in the distribution ends up having misappropriated code.

Check out the SCO Linux Dispute for how bad things can get for corporations: https://en.wikipedia.org/wiki/SCO%E2%80%93Linux_disputes


Having an indemnity clause doesn't mean that a company will automatically defend you or cover your expenses. You may have to sue them to enforce the contract and agree on the costs.

The answer to what will happen when companies are bitten, is that there will be series of lawsuits involving various parties (including GitHub), dragging on for a while and costing a fortune. The court will decide everything in the end (who's responsible for what, who cover the fees, who own the IP, etc...).

The SCO case was rather frivolous, I don't think there is much to take from it, except that if a US company is determined to sue and they have a billion dollar to go on forever, there's nothing stopping them to and it's a lot of troubles.

Which is relevant I suppose. It's only a matter of time until there's a major case putting GitHub Copilot in the spotlight and an aggressive company with deep pocket (think Oracle and the likes). We will certainly be reading about it everywhere the day it starts.


I assume that this is a yes to most of those ?

Of course using code generated by Copilot from those would still be illegal.

See also : Napster (and other p2p), the bitcoin blockchain allegedly containing illegal numbers...


So copyright doesn't apply unless copyright applies.


Has Microsoft just killed source code copyright? That would definitely be a win.


it would be a win for Microsoft that don't distribute their source code

whereas for open source it's a disaster


There's nothing to stop the employees from distributing it at that point and even with copyright it gets distributed anyway, it's just not allowed to be used for anything serious.


Which seems very much align with what has Microsoft been trying to do for decades now.


It would be quite impressive if this was a long-time planned "Embrace, extend, extinguish" move against Copyleft, with a casual acquisition of Github to make it work.

Finally, it beat the "cancer that attaches itself in an intellectual property sense to everything it touches" after all those years, with its own tools!

Now it's safe to touch.


Interesting idea considering Microsofts copyright-dependence has reached an all-time low point since they move as much as they can into their SaaS and PaaS offerings. Nothing left to copy, except for employees, but you don't need copyright to bash their heads in, legally speaking.


But who says the code has to be available to anyone but Microsoft?

Remember that Amazon won off the back of open source. Now all the open source servers and databases are Amazon products.


Why would you think letting copilot scan the code would absolve you of liability for posting it?


I'm not asking about legality of posting the code, but reuse of this by the AI and users of the AI. "All public repositories" is a wide net full of surprises.


No, it's not trained on all public code as the title suggests, it's trained on all GitHub public code (so public repos hosted on GH), none of the things you enumerate are hosted on GH.


Just found Intel leaks and Gwent on github without any effort. Intel has a few repositories in different formats, plain copy of .svn directory or converted to git. TF2/Portal leak is there as well. All but 2 I found were made by throwaway accounts.


Now, was the leaked nt kernel source ever published on github?



I wonder if co-pilot will cough up stuff like these useful macros? Seems like a reasonable hack...

https://github.com/PubDom/Windows-Server-2003/blob/master/co...

  #ifdef _MAC
  # include <string.h>
  # pragma segment ClipBrd

  // On the Macintosh, the clipboard is always open.  We define a macro for
  // OpenClipboard that returns TRUE.  When this is used for error checking,
  // the compiler should optimize away any code that depends on testing this,
  // since it is a constant.
  # define OpenClipboard(x) TRUE

  // On the Macintosh, the clipboard is not closed.  To make all code behave
  // as if everything is OK, we define a macro for CloseClipboard that returns
  // TRUE.  When this is used for error checking, the compiler should optimize
  // away any code that depends on testing this, since it is a constant.
  # define CloseClipboard() TRUE

  #endif // _MAC
Just the kind of trick co-pilot should help us with?


There have been leaks of copyrighted code that were hosted on Github before they were taken down. There is also a lot of public code on Github without any license at all, which is not public domain but actually unlicensed for all purposes.


>it's trained on all GitHub public code (so public repos hosted on GH)

This is exactly what I meant.

>none of the things you enumerate are hosted on GH.

Plenty of them on GH, if not src then magnet links


GitHub's Copilot looks like a "code laundering" machine to me.


Developers have lost the plot here. The number of people browsing stack exchange and copying code is huge. The number of people who have read GPL'ed code to learn from (from the kernel to others) is huge. The number of people who learned from code they had to maintain -> huge.

This idea that a snippet of a code is a work seems crazy to me. I thought we went through this with SCO already.


Stack exchange code is explicitly permissively licensed.


Unfortunately wrongly licensed for its precieved use-case - it'd be better if so used mit or bsd :/

https://stackoverflow.com/help/licensing

I'm guessing most uses of stack overflow snippets are violating the license (no attribution, no share alike of the "remix" - which would probably be the entire program).


It is, but it has a GPL style permission.

ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

The idea that programmers taking snippets from stackexchange or co-pilot etc meaning they have a derivative work seems like total insanity.


There's a phrase I never wanted to hear. But that sounds exactly like what it is.


Why and how? I'm honestly interested in an answer here.

What exactly is the difference between a machine learning patterns and techniques from looking at code and people doing it?

Is every programer who ever gazed at GPL'ed code guilty of plagiarism and licensing violations because everything they write has to be considered derivative work now?


I can think of certain things here. As human beings we have limitations. We get tired of gazing at code, GPLE'ed or not. GitHub's clusters don't. It puts fair use of copyrighted content under question. The next concern I have, is what happens when Copilot produces certain code verbatim? I saw the other day on HN that it produced some Quake code verbatim. See https://news.ycombinator.com/item?id=27710287


> As human beings we have limitations.

That's a fair point. ML models don't seem memorise all the code they've seen either, it seems. Plus while the argument of human limitations applies to the vast majority of people, what about those with eidetic memory?

> what happens when Copilot produces certain code verbatim?

There are several options: suppress the result, annotate with a proper reference or mark the snipped as GPL'ed.

There are technical solutions to this question, but it's also important to ask to which degree this is necessary.

Is a search engine that returns code snippets regardless of license also a tool that needs to be discussed the same way? After all, code samples from StackOverflow or RosettaCode are copied on a regular basis and not every example provides a proper reference as to where it's been taken from.

So maybe a hint like "may contain results based on GPL'ed code" suffices? I don't know, but that's a question best deferred to software copyright law experts.


Guys please read the Terms of Use of Github section D.4.

We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.

This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program.


If I upload somebody else's GPL code to GitHub, I also can't grant to GitHub the (implicit) legal rights to use that code in Copilot, because they are not mine to give.

I could previously mirror GPL code, because the GPL granted me the rights I need to grant GitHub as part of their ToS; but if they change their ToS, or if the meaning is changed by them adding vastly different features to their Service, this becomes a problem.


Can you explain what limitation in GPL would prevent someone from using it as training data? Also, if you are not allowed to upload GPL to GitHub, seems like the right answer is don't.


GPL does not prevent someone from using it as part of something else, so long as that other thing abides to the terms too. In particular, GPL and many open-source licenses require attribution. The fact that Copilot spits out code from other places without attribution clashes with that limitation.

Whether you're allowed to upload GPL code to GitHub or not depends on whatever their Service is at the moment, since the terms say you grant them all the rights "necessary to provide the Service".


If Copilot requires separate payment or signup I[1] fail to see how it can be part of ”the Service” as defined therein, and since the rights to do ”things” to the provided code only go as far ”as necessary to provide the Service” the ToS can’t[2] be used to argue that it gives explicit permission to use provided code for this purpose. Or am I misinterpreting something?

[1] I’m not a lawyer.

[2] Still not a lawyer.


Still depends on how they defined "the Service". Can't be bothered to read the full license myself because I don't use github - but I can't imagine "the Service" is defined as including an AI copy paster.


> The “Service” refers to the applications, software, products, and services provided by GitHub, including any Beta Previews.

So it wouldn't include just any AI copy pasters. Only the ones that are provided by GitHub.


GitHub's T&Cs don't override licensing terms.


Like it or not, it seems like:

* most people here are unhappy

* most laywers will say it's fine (it very probably passed MS ones)

I can understand that. Copyright was not created with AI/ML in mind, even as a random stray thought. Those were not even words at the time.

So the question is: If we change the law and require trained algorithms to only work on licenses that permit this, and to output the "minimum common license" somehow, what are the repercussion on other application of copyright?

Because the consensus here seems to be that this looks a lot like a de-licensor with extra steps


Standard caveat that I'm not a lawyer by any stretch, but this seems settled by the existence of text-generation assistants trained on the full corpus of human writing ever digitized, much of which is also copyrighted or licensed in some way. That is clearly fine, as training text generation programs on existing text has been standard for decades. Selling a product based on GPT-3 is fine and the law has not come after anyone trying to do that.

The more questionable line is if someone happens to inadvertently reproduce entire paragraphs of Twilight: Breaking Dawn word-for-word using GPT-3 and then sells it, that might be a violation even if they didn't realize they were doing it.

Copilot is the same thing. Creating a product that makes suggestions that it learned from reading other people's work is fine. Now if you write code using Copilot and happen to reproduce some part of glibc down to the variable names, and don't release it under GPL, you might be in trouble. But Copilot won't be.


I don't know if even copying small pieces of code verbatim should mean anything.

Another example is the photo generation ML algorithms that exist. They generate photos of random "people" (imaginary AI-generated people) by using actual photos of real people. If one eye or nose is verbatim copied from the actual photo to the generated photo, is the entire output now illegal or plagiarism? One might argue it's just an eye, the rest of the picture is completely different, the original photographer doesn't need to grant permission for that use.

Any analogies we make with this, be it text generation, image generation, even video generation, seems like it falls under the same conclusion: so far we've thought all of this was perfectly fine. I don't see why code-generation is any different. A function is just a tiny part of a project. It's not necessarily more important than the composition of a photograph, or a phrase in a book. We as programmers assign meaning to it, we know it takes time to craft it and it might be unique, but likewise a novelist may have spent weeks on a specific 10 word phrase that was reproduced verbatim, in a text of 500 pages.

The more I look at this the more it seems copyright, and IP law in general, is the main problem. Copyleft and OS licenses wouldn't be needed if it wasn't for the aggressive nature of IP law. I don't see the need to defend far more strict interpretations of it because it has now touched our field.


There is nothing intelligent about this. What they did is a context aware search and trying to claim that not what this is. If it was just used as a search engine and people weren’t using the results or following the license of the original source, then it would fine. There has been so much of a hype of machine learning people likely have a false impression of what it is.


I've seen this claim that Copilot is "just a search engine" repeated in multiple places now. It's wrong; as anyone familiar with any of the GPT variants or other similar autoregressive language models can attest.

Copilot isn't a search engine any more than any other language model is. It can sometimes output data from the training set verbatim as most AI models do from time to time, but that is the exception not the rule.

Whether modern autoregressive language models can be called "inteligent" is debatable, but they're certainly far beyond what you'd get from a simple search engine.


First off I said it was a context aware search which it is. It uses past training data to predict what you would type next based on the context ie the code around it. It’s no more intelligent than alpha go. Intelligent AI is considered to be a general ai which no one is even close to building yet.

Since neural networks are pattern matching based on the training input it is a derivative work of the training set. It says it right in first thing that comes up in auto regressive language models use the training input plus context to predict what the next word would be.

Now here where the fun begins if they try this in court. If you claim it’s generating new work then who owns the copyright? You may not realize how big of a deal this is but there was a court case you can lookup where a monkey took a selfy and the person who camera the monkey used tried to claim copyright and lost.


I see a lot of people trying to compare its "machine learning" to human learning.

Let's use this thought experiment: Imagine that Github's Copilot was just a massive array of all the lines of code from every github project, with some (magical automated whatever) tagging and indexing on each function, and a search engine on top of that.

Now imagine that copilot simply finds the closest search result, and then when you press a button, it inserts the line from the array, and press it again and you get the next line, etc.

Now hopefully nobody here thinks such a system would fulfil either the spirit or the law of any half-restrictive license. Yet that is a perfectly valid implementation of Copilot's aim - and it sounds like it's not that far from what actually happens, maybe with a bit of variable name munging.

So my question is this: If you could build a line between the system I describe above and the system of human learning, where a human learns the patterns and can genuinely produce novel structures and patterns and even programming languages that it has never seen before.

At what point along that line would you say that Copilot is close enough to human to not be violating licenses that require attribution?


I don't think it matters where Copilot is on that line. A skilled human programmer at the far end of that line, fully capable of producing novel programs that they haven't seen before, would still be violating copyright if they reproduced a program they have seen before.


I mean it answers the question pretty quickly if your agent isn't sophisticated enough to actually produce novel programs in the first place.


I'm a programmer and also studied law for some time. These stories make me - once more - realize the old adage: "Possession is nine tenths of the law." Don't host that code in the cloud (or a better term, someone else's dirty bucket). What happened to developers hosting stuff on their own website!?


GitHub's argument isn't that you hosted your code on GitHub and therefore gave them a license to use it to train their model. GitHub's argument is they don't need a license to train their model because it's fair use. Hosting your code somewhere else doesn't prevent fair use. If you don't want your code used to train ML models, don't host it anywhere.


I get it, but that's already a legal argument. I was trying to zoom out from the unavoidable legal argumentative deadlock: if GH does not have your code hosted on their servers, it becomes way harder for 'them' to grab it and rape it. Your own domain is - of course - also out in the open, but at least you can have more control.


> What happened to developers hosting stuff on their own website!?

Devs were hoping for stars and network effects rather than listening to those of us feeling uncomfortable taking all traffic to gh. Something like Copilot or even a coding bot was predicted two years ago already.


It doesn't matter where the code is hosted, just that it is publicly accessible. If developers hosted code on their own sites, someone could still scrape them and use that to train models.

(The question of whether this is sufficiently transformative to count as fair use is still wide open)


> It doesn't matter where the code is hosted, just that it is publicly accessible. If developers hosted code on their own sites, someone could still scrape them and use that to train models.

I'd suggest it makes it more interesting. If it's self hosted, then the hoster can choose to impose restrictions on server aceess, including no automated scraping, rather than trying to impose licensing on the code itself.


Anecdata: I’m a lawyer and programmer and my clients (large financial institutions) are increasingly insisting on hosting as much on-site as possible. It costs more, it can make it difficult to select vendors/service providers, and it’s not without business continuity risks which they take steps to mitigate.

But I think more and more companies, particularly those in highly regulated industries, are deciding that the benefit of controlling the data — access, security, privacy, and understanding who, exactly, it’s being shared with — outweighs the risks of someone else having that control.


So some professionals will have the chance to migrate systems back to on-premises, after having migrated them from on-premises to the cloud? Interesting.


This is why I have now moved my code off of GitHub.


ML novice question: is this atypical when training models? Wasn't GPT-3 trained on a lot of copyrighted data? My gut instinct, which is based on very low-information, is that it would be pretty hard to train models if you could only use open-licensed material.


It would be pretty concerning if people used GPT-3 while they were writing a novel, and it assisted them in plagiarizing a Steven King novel.

We already have examples of copilot blatantly plagiarizing code


Right, but that sounds like the bigger issue here is that the model might spit out copyrighted material, not just that it scrapes it. The former seems like a technology problem that Microsoft can solve.


The issue is that not only might the model spit out copyrighted material verbatim (which it is) but that it might also spit out non-obvious derivative works that will get you in legal hot water years down the road.


It is pretty concerning that copyright exists


Yes, it would stiffle NLP research immensely and we likely wouldn't see anything better than gpt3 for years if such restrictions are put in place.


You're basically seeing how some people would have had open source play out. You can look at and use the code but not to make money or in any other way that I personally disapprove of. This is a world where open source would have ended up being pretty much irrelevant.


Are we now also not seeing now why people would want to do that? A multi-billion dollar company using people work to make more profits without paying them.

I definitely understand why people pick a license that disallows use someone doesn't agree with. Imagine baking cookies for your friends, and one of them reselling them. The material effect is the same to you, you gave away your cookies, but sometimes you make/do something for a certain group of people and not for other to make a profit of your work.


People can do whatever they want with their work, including not sharing it at all.

But a great deal of the value that's come from open source generally has been that open source licenses haven't imposed the sort of usage-based restrictions (e.g. free for educational use only) that were fairly common in the PC world.

And, to your example, in the case of software the incremental copy that your friend sold cost you absolutely nothing. So it comes down to a purely emotional response to someone else making money off something you made.


>So it comes down to a purely emotional response to someone else making money off something you made.

Exactly, as I said, the material situation is the same. But we all are emotional beings, you would do certain things for your family you wouldn't for strangers. I don't think this case is any different.

I personally don't work for free for a company, but I do charity work for free. Working for a company in the time I work for a charity would "cost me absolutely nothing" if I already spend the time anyway, but everyone understands the difference.


There is a difference between a model that achieves "fair use" of copyrighted work and one that regurgitates copyrighted work without attribution.


You’re free to privately research with this data but commercializing other people’s work using ML is theft.

Edit: commercializing of the derived work is one explicit consideration used by US law in making a fair use determination. That said, even if it weren’t commercialized it may still be infringement and I believe it is.


Commercializing isn't really the issue, it's still copyright infringement even if you release it for free (i.e. piracy) -- it's unauthorized redistribution (i.e. copying).


Even if we accept that (which many wouldnt as most licenses say little about research), the research would never be very useful if you can never make a comparable dataset to use in the real world.


I get that the problem is commercializing, but the theories around copyright that are being deployed here would prevent even free, open-source NLP research from becoming a reality.


I am not a lawyer but I do believe GPT-3 as a commercial product trained using copyrighted data constitutes infringement. I also think GPT-2 does not because it is for research purposes, which made it fair use.


Yes training data is very valuable. Producing quality training data is an industry in itself. GitHub is trying to get it for free, doesn’t work that way.


I am not surprised given who the owner of GitHub is. Now, let's assume for a while that a private repo is left marked as public by mistake and Copilot regurgitates it... Lawyers are going to have fun with that one.


The worse scenario for GitHub is when a leak is published on GitHub. It's not like it hasn't happened before.

https://www.theverge.com/2018/2/8/16992626/apple-github-dmca...


There actually tons of unlicensed and wrongly licensed code on GitHub right now that being accidentally leaked by employees of many companies.


Who cares? Seriously? Copilot has ripped off the absurd charade around licensing and code.

It isn't any kind of copyright infringement. The AI is not copying and pasting code that is has found, it is rewriting the code from scratch on its own.

We keep trying to take old ways and meld them to the internet, and its just not appropriate and it doesn't work.



> not copying and pasting code

https://news.ycombinator.com/item?id=27710287


So playing devil's advocate. What if the courts just don't care, and rule that copying code verbatim is not a crime because you didn't copy it, and copilot is not a human so it can't commit crimes. What's the net effect of a system that draws upon all public code repos? It sounds... net beneficial to society?

On the plus side, a large body of work effectively becomes public domain. On the negative side, copyleft licenses lose their teeth. You probably see more power shift to those with big budgets. You probably see fewer things made source available, because you either have the public license or the private license now. This feels like a bad path but I'm not convinced the end result isn't better still.


>copilot is not a human so it can't commit crimes

I can setup my drone to detect me and attempt to crash into me. AI would be quite poor, probably would attempt to crash at any human. Would it be my fault it didn't crash into me and someone lost eyes?

Can I setup torrent box that automatically downloads and seeds all detected links from public trackers? Would I be responsible for it?


Both of these examples include you creating something and then using it. I don't know how copilot works, but using the second example, if you wrote a script to download and seed trackers, and someone else used it, I don't think you would be held under any liability, especially if you don't profit off of it.

Not a lawyer or even particularly well informed

edit: I am reminded of the monkey selfie, in which it was ruled that a non-human cannot create copyrightable works. https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...


Did copilot spring from the aether? Or was it built and trained on licensed code by github? Someone did something.


It's not a violation of copyright to train a model. There are three questions at play though:

1) Can you be liable for violating copyright if you have never seen the work?

2) Can a non-human be held accountable for violating copyright?

3) Can github be held liable for an end user using their tool to violate copyright?

https://en.wikipedia.org/wiki/Substantial_similarity

wikipedia states: Generally, copying cannot be proven without some evidence of access; however, in the seminal case on striking similarity, Arnstein v. Porter, the Second Circuit stated that even absent a finding of access, copying can be established when the similarities between two works are "so striking as to preclude the possibility that the plaintiff and defendant independently arrived at the same result."

This is a different situation in which exact replication can be reasonably occurred without access to the original.

Secondly, can you actually claim Github has violated copyright if it doesn't have any claims to the work in question?

I think it's totally plausible that they win this in the long run.


1) So you are saying if I get a disk duplication machine I can freely copy and distribute blu ray disks as long as I don't watch the movie on the disk?

2,3) Seems pretty settled at this point, look at the cases around the VCR and copy machine. In general the one using the machine is liable. The creator of the machine can be held liable if there aren't substantial non infringing uses.


1) No. But you can freely distribute the disk duplication machine.

2) Someone using a copy machine is knowingly copying a specific work.


> It's not a violation of copyright to train a model.

Many people on HN assert this based on the Authors Guild vs. Google case, but it's quite important to keep in mind that that case was about Google creating a search algorithm, which is not generating "new" output.

We are talking about a very different kind of system here and in many other cases. Claiming the Authors Guild case sets precedent for these very different systems seems unbased to me.


> It's not a violation of copyright to train a model.

This is a very bold assumption, one that I assume will not hold in the court of law in all cases. I think the nuanced question is: to train a model that does what, exactly.

Let's say distributing meth recipes is illegal[1], can one legally side-step that by training a model that spits out the meth recipe instead? No court will bother with the distinction, causation is well-trod ground.

1. As an example - not sure if its illegal. You may replace with classified nuclear weapon schematics if you like.


It's not illegal to train a model to spit out classified nuclear weapon schematics. Possessing the original data might be. Releasing software that does this might be illegal, but not for copyright reasons, which is the issue at hand.


It sounds like you're arguing that Github isn't liable for people using copyrighted code through Copilot.

I think most people are more concerned about whether the user of Copilot would be liable for using copyrighted code generated by Copilot.


Could be. But I could also see the courts ruling an individual can't be liable for copyright violations if they never accessed the original work, which is generally required.


The really nice thing is that this basically creates a library of industry methods and practices. It'd be really nice to be able to destroy copyright trolls because what their patent "covers" is already a known and established industry method, or a prior art.


Would that mean I can start sampling songs if they get fed through a neutral network? It'll be fine if I train it on whatever is playing on the radio right? Doing the same for poems?


I would expect the legal argument to get into the intentions of the user and their relationship to the tool. I would also expect perspectives of art and code to diverge.


OP’s rhetoric, and most discussion I see, asserts that training a model on copyrighted data is a copyright violation. Personally I don’t find this to be so obviously the case. Think back to when we were listening to AI generated pop music, for instance. I don’t recall any concern in HN comments about the copyright holders’ music being used for learning.


Did you miss the bit where copilot reproduced exactly a function including the comments? That's not some mashup or reinterpretation or inspiration it meets the definition of plagiarism in universities and is just copying.


I didn’t miss that, this still doesn’t make the answer obvious to me. I’m pretty sure I’ve unknowningly replicated licensed code as well during my time as an engineer, and I’ve written way less code over my 8 years than Copilot has.


Then if you were discovered using it in a commercial project you can fairly be sued for it. Unless you're trying to argue that you should for some reason get an exemption?


Would I be found guilty if I could prove that I didn’t explicitly copy that code but rather just happened to write the same code by arriving at the same solution as the original one I had seen years before?


Nobody can answer this because it depends on the code and the resources of the entity suing you, but in general yes. This is why clean room design is a well-defined strategy: depending on the code and company, you would indeed not be allowed to work on the project because of the fact you'd seen a competitors solution previously.


You mean like https://www.theburnin.com/technology/artificial-intelligence... ?

If one of the three largest record labels uses their own catalog to train an AI, copyright seems less important to discuss. I suspect the discussion would be a bit different if a company scraped youtube and used that as a training set for AI music and successfully sold it.


> Think back to when we were listening to AI generated pop music, for instance. I don’t recall any concern in HN comments about the copyright holders’ music being used for learning.

Were those products sold to help people write commercial pop music faster? If not, I don't think your point is valid.


I'd be surprised if nobody brought up those 'what-if' scenarios at the time.


Curious what the consensus is on how GH should have approached this to avoid such blowback.

Best case scenario, they explained in advance on the GH blog they're going to be doing some work on ML and coding, and they'd like people to opt into their profile being read via a flag setting/or put a file in the repo that gives permission like robots.txt? Second best case scenario, same as first but opt out vs opt in, and least ideal would be something like not doing the first two, however, when they announced it, explained in detail how the model was trained and what was used, why, and when- kinda thing?

Is that generally about right, or..?


Code (co)created with Copilot has to follow all the licenses of the source (heh) code. This generally means at the very least automatically including in projects getting help from Copilot a copy of all the licenses involved, and attribution for all the people the code of which Copilot has been trained on.

(Not sure for the cases where there is no license and therefore normal copyright applies, but AFAIK this isn't the case for any code on Github, which automatically gets an open source licence ?

EDIT : Code in public repositories seems to be "forkable" on Github itself but not copyable (to elsewhere). That's some nasty walled garden stuff right there, I wonder how legal that ToS is ? I could see how this could make them to incentivize people to stop using other licenses on Github, to not have to deal with this license mess... EEE yet again ?)


So I guess then, the first thing they should have done, is trained it to understand licenses, and used that as a first principle for how they built the system?


Is it a derivative work of GPL licensed work if it is trained on the license? Is the GPL license text under GPL?


> GNU GENERAL PUBLIC LICENSE

> Version 3, 29 June 2007

> Copyright © 2007 Free Software Foundation, Inc. <https://fsf.org/>

> Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.


Seems to be too much effort (is it even possible to link the source to the end result ?), and might not be admissible, so just include a database with all of the relevant licenses and authors ?


> Second best case scenario

Not really, consider for example repositories mirrored to Github.

It seems unclear who has the rights to grant this permission anyways (with free software licenses). Probably the copyright holder? Who that is might also be complicated.


In that hypothetical I wouldn’t think GitHub is responsible for determining if a repository is mirrored and what the implications of that are. They just need to look at what license is on the repo in GitHub.


Good point, I would have thought GH requires you to agree in some TOS that you have permission to put the code on GH (but I don't know)? If so, could that point be put aside? (I'm not a software engineer so sorry if that made no sense. Super curious about the whole codepilot thing from a business and community perspective)


> that you have permission to put the code on GH

This is the complicated bit: All open-source licenses grant you permission to redistribute the code (usually with stipulations like having to include the license), so you are almost always allowed to upload the code to Github.

What it doesn't mean however is that you're the copyright holder of that code, you're merely redistributing work that somebody else has ownership of.

So who gets to decide what Github is allowed to do with it?

I expect this will end up in courts and we won't get a definite answer before that.


If you'll entertain me on a hypothetical for a moment. Suppose then the copious amount of intelligent folks over at GH know this will eventually end up in the courts, and expected that from the start. Would you suggest they messaged/rolled it out any differently? Did they do exactly what they needed to do so that it did end up in the courts? Should they have done anything differently to not piss folks off so much? Sorry for the million questions, you seem to know/have thought a bit about this. Thanks! :)


They should have only used code from projects that included a license that allow for commercial use or made their model openly available and/or free to use


How does attribution work then?


Wouldn't it be the people publishing code written with Copilot that (potentially) violate any licenses? It doesn't seem to be that the tool violates anything, though it may put the _user_ at risk of violating something.

Like, don't use it if you're worried about violating licenses, but I don't see how Microsoft could get in trouble for the tool. It doesn't write and publish code by itself.


Sorry, we built this tool for you that auto violates licenses. Sure, we're owned by a huge megacorp with billions of dollars, but it's your responsibility to confirm - and yes, we recognize it's impossible to confirm - that what you release using our tool isn't violating the license.

In short, github gets to make the license violator bot and push the violations off onto the small fry who actually use it? No thanks.


Isn't that sortof the justification behind bittorrent and trackers?


I see the difference as BitTorrent being an ignorant tool that just processes the data it receives. If you point BitTorrent to copyright data, it emits copyright data. The fault is on the users. Copilot was built with and "contains" copyright data, which it can produce with non-copyright input.


Microsoft are violating the licenses already when they initially show you the generated code without attribution and ignoring other license restrictions. How you use it yourself is separate from that.


Unless that falls under their Terms of services license grant, which would bypass the code public license...


To the people arguing it's "fair use" to use this for training an ML network. Where do you draw the line? What if you train an "ML network" with one or two inputs... so that they almost always "generate" exact copies of the inputs? Five inputs..? Ten? A thousand? A million?


There obviously is no sharp line (e.g. it is 37. Immediate question: why not 36?), but that does not matter at all.

We already have the same fuzzy line for writing. Am I forbidden from ever reading other author's books because I might accidentally "generate exact copies" of some of the sentences? Clearly not, that is how people learn a language. Does that mean I am allowed to copy the whole book? Also clearly not.

Where do you draw the line? Somewhere.


And somewhere is determined for your particular case in court. And tomorrow, a similar case may be determined differently.


Not really, no.


I can imagine a requirement of the sort 'generated code needs to match at most X% to snippets of the training data as shown over Y amount of sampling' but I am not sure if you can get a much better requirement than that.

Forbidding the training of AI on public code would definitely be a step too far though.

Edit: I'd also like if they provide a tool for checking if your code matches copyrighted code too close so you can confirm if you are violating anything or not when you use copilot.


The line is exactly the same line that's always been drawn in fair use cases.

There's absolutely nothing different whether the creator is ML or a human.

Generally, if you train an ML network to generate an almost exact copy of a thousand lines, it's obviously not fair use. If it's five simple lines, it obviously is fair use. If it's somewhere in between, there are a lot of different factors that need to be weighed in a fair use decision, which you can easily look up.


> Where do you draw the line?

My simplistic view is that the following is legally equivalent:

input -> ai network -> output

input -> huffman coding -> output

So, whilst:

* compressing and decompressing a copyright work is permissible;

* output and weights are deterministic transformations of the inputs;

thus:

* not eligible for copyright (lacking creativity); and

* are derivative works of the inputs;


> output and weights are deterministic transformations of the inputs;

That may be true but I fail to see how any process that produces the same content that was input into it somehow strips the license. If the generated code is novel, then there is no copyright and it is just the output of the tool. If the code is a copy, but non-creative (example a trivial function) then it isn't covered by copyright in the source anyways, so the output is not protected by copyright either. However if the output is a copy and creative I don't think it matters how complicated your copying process was. What matters is that the code was copied and you need to obey copyright.

Again, I don't think that novel code generated from being trained on copyrighted code is the problem. I think it is just the verbatim (or minimally transformed) copying that is the issue.


But at the same time, a compiler does a deterministic transformation of its inputs, and we still count its output as under copyright and license.

copyrighted input -> compiler -> copyrighted output


Perhaps I wasn't clear enough on this point: copyright of a derivative work is distinct (but not inseparable) to the copyright of the original work.

So portions of a derivative work are covered by the original copyright, and other portions may be under a distinct copyright as a derivative work, and several copyrights may apply to a work as a whole.

In the case of a Huffman transform, the transformed work does not meet the "creativity" requirements to be eligible for copyright, over that of the original works.


So...

Putting the (imho) big licensing problems aside, what about the software patents?

Apache and GPL have patent protection clauses.

Does this mean that anyone using copilot might somehow get code that implements something patented, but protected by license, except they did not get proper permission through the Apache/GPL license?

...I kind of hate myself for saying this, but... Patent trolls to the rescue?


To be fair, this could just be a mistaken interpretation from the support staffer that answered the question - they didn't sound sure ("apparently"). It certainly needs an official response from GitHub senior management but I wouldn't call the foul yet (not that it's even clear that it is a foul).


At least to some first approximation irrelevant because reading code is not subject to any license. What if a human reads some restrictively licensed code and years later uses some idea he noticed in that code, maybe even no longer being aware from where this idea comes?

But what if the system memorizes entire functions? What if a human does so? What if you change all the variable names? What if you rearrange the control flow a bit? What if you just change the spacing? What if two humans write the exact same code independently? Is every for loop with i from 0 to n a license violation?

I am not picking any side, but the problem is certainly much more nuanced then either side of the argument wants to paint it.


I agree that it's nuanced and it's difficult to draw the line. but where copilot sits is way over on the plagiarizing side of the spectrum. Wherever we agree to draw the line, copilot should definitely fall on the wrong side of it

Copilot will replicate entire functions, including comments, from licensed code


> but where copilot sits is way over on the plagiarizing side of the spectrum

I think it is important to point out that not all Copilot output is on the plagiarizing side of the spectrum. However it does on occasion produce plagiarized code. And most importantly there is no indication when this occurs.


> What if a human reads some restrictively licensed code and years later uses some idea he noticed in that code, maybe even no longer being aware from where this idea comes?

In general using the idea is fine, whether it is AI or human written. I think the major concern here is when the code is copied verbatim, or near verbatim. (AKA the produced code is not "transformative" upon the original)

> But what if the system memorizes entire functions? What if a human does so?

In both of these cases I believe it would be a copyright concern. It is not strictly defined, and it depends on the complexity of the function. If you memorized (|a| a + 1) I doubt any court would call that copying a creative work. But if you memorized the quake fast inverse square root it is likely protected under copyright, even if you changed the variable names and formatting.

It seems clear to me that GitHub Copilot is capable of producing code that is copyrighted and needs to be used according to the copyright owner's license. Worse still, it doesn't appear of capable of knowing when it is doing that, and what the source is.


The problem is that humans are limited in retention and rate of learning. An AI/ML is not, which makes (or should make) a difference.


Sure, it might certainly be the case that different rules should be applied to humans and machines, but this makes the discussion only even more nuanced. But I don't think this could reasonably be used to ban machines from ingesting code with certain licenses even though it might restrict what they can do with this information.


Open source developers need a new kind of license with a ML model training clause, so there is no more ambiguity if they don't want their code to be used in this way.


People have been suggesting this ever since Copilot was announced and it doesn't work on any level. They're using all code on GitHub, even the ones with no license and which you can't use for any purpose and the reasoning is that they see it as fair use - which supersedes any licenses and copyrights in the US.


They only claimed that training the model was fair use. What about its output? I argue that its output is still affected by the copyright of its inputs, the same way the output of a compiler is affected by the copyright of its inputs.


That doesn't work: your suggestion applies at too late a stage in the flowchart. It looks like:

1. Do you need a license to use materials for training, or to use the output model?

2. If so, does the code's license allow this?

GitHub is claiming 'no' for #1, that they do not need any sort of license to the training materials. This is reasonably standard in ML; it's also how GPT-3 etc were trained.

Now, whether a court will agree with their interpretation is an interesting question, but if they are correct then #2 doesn't come into play.


If the answer is 'no' for #1 than the GPL might as well not exist because now we can just launder it through co-pilot and close it off, a rather distorted interpretation of "fair use" if you ask me.

"Dear copilot, I'm writing a Unix-like operating system...."


I don't think that's right; I wrote a response above: https://news.ycombinator.com/item?id=27779155


I made new licenses [1] [2] that attempt this. The problem with adding a clause against ML training is that that is (supposedly) fair use. What my licenses do is concede that but claim that the output of those algorithms is still under the copyright and license.

I hope that even if it wouldn't work, it puts enough doubt in companies' minds that they wouldn't want to use a model trained by code under those licenses.

[1]: https://gavinhoward.com/2021/07/poisoning-github-copilot-and...

[2]: https://yzena.com/licenses/


Suppose you had some kind of AI Deepfake program operating off a large database of copyrighted photos and you asked it to "make a picture of a handsome man on a horse" and the man's head was an exact duplicate of George Clooney's head from a specific magazine cover, would that be infringement? Would selling the services of an AI that took copyrighted photos of celebrities and edited them into porn movies be infringement? I don't know the answers to those questions but I find it very weird that people think large blocks of typed text are less worthy of copyright protection than other forms of media.


That would potentially be an infringement of the copyright of the photographer but in any case it’s an infringement of the personality rights of George Clooney.

You aren’t allowed to sell someone’s likeness without their permission. You don’t need an AI for this if you create a portrait of Clooney and sell it or make any use that isn’t covered by fair use he can sue you.

Depending on the composition of the picture for example if Clooney is naked and say Putin is riding in the “bitch seat” of the saddle then you also are quite likely be open for a libel suit as well.


Satire does not usually fall under libel/defamation, though, right?

>For example, in Hustler Magazine v. Falwell (1988), Chief Justice William H. Rehnquist, writing for a unanimous court, stated that a parody depicting the Reverend Jerry Falwell as a drunken, incestuous son could not be defamation since it was an obvious parody, not intended as a statement of fact. To find otherwise, the Court said, was to endanger First Amendment protection for every artist, political cartoonist, and comedian who used satire to criticize public figures.

https://www.mtsu.edu/first-amendment/article/1015/satire


Depends on the legal system in question and the intent and usage.

The US system isn’t the only one on the planet you know, the UK still has political cartoonists despite a very different definition for what defamation is which the example above can fall under.


A direct confirmation from GitHub itself. This is problematic because Copilot sometimes outputs code that was present in its training set.

https://fossbytes.com/github-copilot-generating-functional-a...


This is the crux of copilot. When I saw it copying RSA keys I knew, it's overtrained.

Most of the comments are waxing about philosophical possibilities of copilot copying GPL.

Reality is clear in the case, it's copy pasting thousands of characters of GPL code with no modifications. Copyright violation clear as day


So why did GitHub chose to exclude private repositories? Why not include everything, including the code for windows?


In training on publicly accessible repositories, GitHub did something anybody could have done. If they also used private repositories, though, I would see that as abusing their position.


Additionally, if they had trained on private repositories then they risk leaking code, and accidentally making it public. Even if that was within fair use it would still be a violation of the trust people put in them.


The outrage-bait approach in this thread detracts from it. Yes, they trained it on everything. No, it's not clear if that's legal or not (probably is) or if that is much of a problem.


Indeed; the question is if copyright should apply at all. Harping on about licenses, GPL, and whatnot is a detraction from the actual issue at hand.

Also, given that the author of this tweet called me a "bootlicker" last year in response to a somewhat lengthy nuanced post about GitHub, I'm gonna go out on a limb and say that they're not all that interested in a meaningful conversation on this in the first place but are rather on a quest to "prove" GitHub is evil.


The possibility of GPL violation does show (one of) enormous ramifications of the question though. I think it's not a detraction as long as the question itself is also mentioned.


There isn't any of this here though: it just operates on the assumption that the GPL applies.


It's not outrage bait. The thing reproduces GPL licensed code verbatim.


I'm talking about how it's presented. It starts with

>oh my gods. they literally have no shame about this.

Then continues with

>it's official, obeying copyright is only for the plebs and proles, rich people and big companies can do whatever they want

and

> GitHub, and by extension @Microsoft , knows that copyright is essentially worthless for individuals and small community projects. THAT is why they're all buddy-buddy with free software types; they never intended to respect our rights in the first place

At any rate, it's not even clear to me if me publishing code written with copilot (or even with a random tool that will wget from github) puts the blame on the toolmaker or on me. This post, however, doesn't attempt to look at that but uses language that paints GH/MS as doing something illegal (and evil) that others wouldn't even get away with but not caring about it.


It seems that github did make a legal consideration when choosing to include public projects but exclude private ones, with many big companies having private projects for proprietary code bases. Users of public repositories are less likely to be able to fight github on the issue.


Is that not true? Google and Oracle had a 10 year multi billion dollar legal fight over ~20 lines of code identical between Android and JVM.

A non rich individual has basically zero chance of challenging GitHub on these blatant violations, and they know it.

> At any rate, it's not even clear to me if me publishing code written with copilot (or even with a random tool that will wget from github) puts the blame on the toolmaker or on me.

It really depends on the license, which GitHub apparently doesn't care about at all.


Just a reminder: reproducing GPL licensed code verbatim is not illegal per see.

The legality lies on what the user does with the code.


But is that reproduced code "substantial"?

I'm sure there's a "for i in range(0, n):" somewhere in a GPL repo, and yet having that in my code doesn't make it GPL.


The somewhat frustrating solution is to simply realize that “copyright” is one of the worst abominations humanity has ever conceived…

It’s literally being used to stifle research as we speak, but for some completely insane reason we are protecting a handful of publishers as a cartel…

It really is so simple: “don’t be a bigoted fascist”, but just have a glance at the fascists decrying my stance as a degenerate liberal, in the answering comments


Public facing open-source code & media is going to be learned by language models because they're exposed to them. That's the simple truth. Nothing can stop that, not unless all public repos are made private. Everyone has access to the ability to create their own GPT, thanks to open-source. OpenAI is not actually very far ahead of open source anymore.

The US seems well enough informed. As mentioned in the following report "AI tools are diffusing broadly and rapidly" and "AI is the quintessential “dual use” technology—it can be used for civilian and military purposes.".

https://www.nscai.gov/wp-content/uploads/2021/03/Full-Report...

I'm fully expecting that if I begin a story and put it on my blog or on github, and if I go away for a couple years, I'll see it completed for me when I return. I can use foresight to my advantage or I can pretend like it's still the 1990s as if placing some text at the top of the code I exposed publicly is going to prevent people from training on it.

One thing for sure though, I don't think a large company such as Microsoft should be profiting from training their language model on open-source code.

The best way to release Copilot in my opinion would be to make the entire thing open source and have separate models, even a private paid-for model so long it's trained on their own code.

An open source model trained on code for specific licenses sounds fine, but then the model should also follow that same license as the code it was trained on.

There's just something deeply unsettling about having a computer complete your thoughts for you without being able to question how or why.


If a company built a tool like Copilot to help students write essays, is that considered plagiarism? Probably yes, and the reason is that regurgitating blobs of text without actually thinking like a human and writing them anew doesn't feel like actual work, just direct re-use.

Same thinking probably applies to GitHub Copilot and copyright


It’s already fairly commonplace for news agencies to generate articles using ML solutions such as https://ai-writer.com/

So by your logic ABC, CBS, Fox, and NBC have all been plagiarizing and violating copyright for doing so? I’m not sure if there’s been a legal challenge/precedent set in that case yet, but that seems like a more apples to apples comparison than the Google Books metaphor being used.

Disclosure: I work at GitHub but am not involved in CoPilot


The big question here is: On what data was the model trained? Presumably the news stations trained theirs on public-domain works and their own backlog of news articles, so even with manual copying there would be no infringement. In contrast, Copilot was trained on other people's code with active copyright.


That’s quite a big presumption IMO. Training sets need to be quite large in order to produce reasonable output. My understanding is that these companies provide the model themselves, which seems like it’d be trained on more than one company’s publications. But I get your point, and understand both sides of the argument here.

I think this will end up with a large class action lawsuit for sure, tho I really think it’s a toss up as to who would win it. This conversation was bound to happen eventually and we’re in uncharted territory here.

I think it’s going to hinge on whether machine learning is considered equivalent in abstraction to human learning, which will be quite an interesting legal, technological, and philosophical precedent to set if it goes that way.


I mean, if it's considered "fair use" legally (which is surely their position), then why wouldn't they?

Why would they distinguish between licenses if there's no legal need to?

Licenses are only restrictions on top of fair use. Licenses can't restrict fair use.

It would be interesting if someone takes them to court and a judge definitively rules on fair use in this particular case. Or I don't know if there's enough precedent here that the case would never even make it to trial. But with a team of top-paid Microsoft lawyers that gave this the green light, I'm pretty sure they're quite confident of the legality of it.


My guess is that is is fair use but...

The model is said to spit out code verbatim 0.1% of the time, a low number, but if copilot is used a lot, it means you are going to find a lot of copied code in people's projects, and these project owners may be breaching copyright. I don't think "but, copilot..." will be an excuse.

Here is a (probably unrealistic) scenario illustrating it. I am playing a copyright troll here:

- Release plenty of generic code and put it on GitHub under a restrictive license

- Have the copilot bot scan it

- wait some time

- scan public codebases and do an exact match for my code

- sue project owner that contain my code

I see the use of copilot more of a minefield for me than as a liability for Microsoft.


The solution here seems simple. If you don't want your code used for AI/ML like co pilot, then place a license in your code that explicitly forbids it. Looking at the MIT License as-is, which is used by many maintainers on github, there is nothing that forbids co-pilot. It's easy to add a few sentences to that which explicitly forbid the code being used by AI, ML, code generation or other code automation and calling it something else like the Free For Human Use License.

The sticky part may be: does GitHub T/C overrule these licenses?


Using copyright works to train an AI may qualify as fair use, meaning the terms of any copyright licence can be ignored, as argued in this blog post by reference to the Google Books litigation: https://juliareda.eu/2021/07/github-copilot-is-not-infringin...


This reminded me of those Facebook status updates that people used to post back in the day in which they forbid Facebook from using their data lmao


I can't wait for machine learning models that given the right input nearly perfectly reproduce feature length movies or music. Its not copyright infringement, it was generated by a computer!


So would a way to do this be to train multiple models on each different code license (perhaps allowing compatible licenses to cohabit) and then have Copilot identify the license of the target project and use the appropriate model?

It might have an interesting feedback effect that some licenses which are more popular would presumably have better Copilot recommendations, which would produce better and thus more popular code for those licenses. Although maybe this happens already.


This is why I relicensed my code [1] yesterday to a license I wrote [2], which is designed to poison the well for machine learning.

[1]: https://gavinhoward.com/2021/07/poisoning-github-copilot-and...

[2]: https://yzena.com/yzena-network-license/


If it's allowed by fair use, your license is irrelevant. If it's not, your license doesn't matter.


In my blog post, I talk about how training is fair use, but we don't know about distributing the output. These licenses, even if they don't work, are designed to poison the well by putting enough doubt into companies' minds that they would not want to use Copilot if it has been trained with my relicensed code.


Do the GitHub Terms of Service give them the necessary permissions for Copilot, independently of the license? (I honestly don't know the answer; this is a straight question.)


> The licenses you grant to us will end when you remove Your Content from our servers, unless other Users have forked it. [0]

I don't see how they can keep this clause, and then have a service that recites/redistributes code, based on a model that has already ingested said code.

> This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program. [1]

Copilot is distributed verbatim code when it regurgitates, which seems a pretty clear violation of this clause. (If it wasn't regurgitating, they'd have caselaw for fair use. But... It is.)

[0] https://docs.github.com/en/github/site-policy/github-terms-o...

[1] https://docs.github.com/en/github/site-policy/github-terms-o...


I don't know. Because I don't know is why I pulled all of my code (except for a permissively-licensed project that people actually depend on the GitHub link for) off of GitHub.


That’s safe but it’s probably not necessary to be protected from what GitHub, OpenAI, and Microsoft are doing. When these licenses were crafted there was no reasonable expectation that companies could use ML applications as a loop hole in existing copyright licenses, so just because there is no explicit clause denying it doesn’t mean they are in the clear for using copyright-protected code that way. Licenses give permission, they don’t revoke it.

Copyright is broad, licenses are minimal. This must be the case otherwise they would not be very effective at protecting the work of creators. There is no explicit allowance for what GitHub is doing in most licenses so they do not have general permission to do so.


I agree; my blog post says so.

What my licenses are supposed to do is sow even more doubt in companies' minds about models trained on my code.


I think to actually poison the well, we should add code to existing repos with dead code clearly labelled as "the way that things shouldn't be done" that are wrong in subtle ways. So every time we fix a security issue, we keep the version with the bug with some comments indicating what's wrong with it. Of course, this only works until the AI is trained to weigh the code based on how often the code is called.


The notion of intentionally polluting and over complicating your code base just to "poison the well" is bizarre. Talk about cutting off your nose to spite your face.

If you don't want others to use your code then the solution is very simple. Keep it on a secure private server and don't publicly release it.


Keeping it private is one option, but I really want my end users to have the freedom to modify the code for themselves.


That is a funny idea. Personally, too much work for me, and Copilot probably generates subtly wrong code already.


Since you allow new versions by default, can't someone just release a new version of your license allowing everything they want?


That is a good point, but easily fixed. Will do that now.

Edit: done. They are under the CC-BY-ND license now.


My question is what GitHub is going to do when people start sending them DMCA takedown notices over their code being distributed through this system.

Currently, if you claim to be a copyright owner GitHub can respond to a DMCA takedown by removing the repository. This might require them to retrain the entire model.

One option for GitHub might be to maintain a blocklist of various code snippets, and if there is a substring match, just don't make the suggestion.


It's been admitted again. This contraption by GitHub is really causing chaos in the open source world and has been trained upon all public GitHub code; essentially those who have their code hosted there publicly, gave them permission to train copilot on their code. Now they are complaining about it after all these problems [0].

I warned against hosting source code on GitHub and going all in on GitHub Actions, mainly for them being unreliable for the past year. [1] (They go down every month). Now Copilot has gone and trained on every single public repo on GitHub as admitted right in this post, regardless of the copyright.

Maybe for organisations with serious projects, perhaps now's the time to leave GitHub and self-host your own somewhere else?

[0] https://news.ycombinator.com/item?id=27726088

[1] https://news.ycombinator.com/item?id=27366397


Well, if you write software under a free license, you can't really prevent someone from uploading the source on GitHub...


As I said in another thread, in my opinion there is no issue to have with whatever they did as training with whatever public data.

And in the end, the output in itself is not really an issue. It is just a machine outputting random lines it encountered on internet.

The problem is from the user side: Ok, you got random lines from random places. If you do nothing about it, then no issue. But if you try to use, publish, sell the code, then you are in deep shit. But somehow it's your fault.

For GitHub, the problem is more to be sued by "customers" that assumed that the generated code was safe to use when it is not the case.

And, as a general comment, I think that this case is very illustrative about the misconceptions about AI and machine learning for the general public:

Here you can see that you don't really have an intelligent system that can learn and then create something new and innovative from scratch. But it is just a machine that copy code it already saw based on correlation with similarities in your current code.


Regardless of how the (potentially very impactful) debate about licensing and copyright plays out, I think many here would agree this constitutes an "exploitation" of labor, at least in a mild sense.

Optimistically, Copilot could be a wake up call for thinking more deeply about how the winnings of data-dependent technologies (ultimately, dependent on the labor of people who do things like write open source code) are concentrated--or shared more broadly.

This longer blog post goes into more of a labor framing on the topic: https://www.psagroup.org/blogposts/101

(For the record, I certainly think Copilot could be very good for programmers in general and am not arguing against its existence -- just arguing that this is a high profile case study, useful for thinking about data-dependent tech in general)


There will be just a short transition period. In 10 years, AI will be writing most of code, and in 20 years - nearly all code. People will do only architecture/business analysis.

No more "exploitation" of labor.


"in 10 years, AI will drive most cars". See how that one panned out? Programmers are safe for still quite a while.


> I've reached out to @fsf and @EFF's legal teams regarding this. Please also reach out if you would be interested in participating in a class action.

I think she's barking up the wrong tree here. If she's looking for organizations interested in eliminating fair use, RIAA, MPA, and AAP are more likely allies.


GitHub CoPilot is clear fair use. Having a ruling that says it doesn't will be regression. Please don't.


I don’t know how to approach this. As a human I can read all public code and learn from regardless the license and come up with new solutions. Machine can read everything too, but can't create new ideas or approach. How is copilot defined then? Should it be only smart system for general code snippets?


Well you can read public code all you like but you can't just take chunks of code and write them under different licenses like how copilot has been shown doing.


If you grab chunk of licensed code and put into private repo, what prevents you from doing that? How much of licensed code is scattered across private projects? I’m curious how these license violations are detected.


Copyright law "prevents" you from doing that. To be more specific copyright law specifies that you must comply with the license of the copyright holder in cases such as the one you have described.

> How much of licensed code is scattered across private projects?

Whether or not copyright violations regularly occur is not (directly) relevant to whether or not it is illegal. People download copyrighted movies without licenses all the time and it still isn't legal.


I mean if you text and drive while a police officer isn't around to see it you still broke the law. Just because piracy is huuuuge and largely unpunished doesn't mean that copyright doesn't have to be respected in a huge publicly visible trying to be above-board project.


So, when a human reads public code on the Internet (no matter the licence), and gains knowledge, learns (updates the synaptic weights of the brain), and then makes (indirectly) use of that gained knowledge for further work, how is this different to this case?


It's no different, but if human reads copyrighted proprietary code and then reproduces part of it exactly he have good chance to get into huge legal trouble.

On other hand said AI have no idea of who the code belongs to and it's able to reproduce it perfectly.


The difference is intent. When Github reads public code, their only intent is to profit from it. Depending on the license, that's a violation.


A human also often intends to make profit (by using the gained knowledge).


No, they intent to learn from it or find a solution to their problem. It's much harder to argue human intent in court than GitHub blantantly doing so.


https://news.ycombinator.com/item?id=27771742

The above thread is a dupe of this discussion but with interesting discussions already in place before being marked as a dupe.


Open source is about love, sharing, helping out the fellow coder. Coderz of the past hated all this licensing and copyright BS. Your code, used to train this NN, is making the world a better place, I'd be content with that.


Nothing about further enriching Microsoft and continuing the network effects behind a closed source “social network”, is making the world a better place. Quite the opposite really.


It’s not copyright violation to train ML on content. So the license doesn’t matter unless there’s some “can’t use this for ML training” license that I don’t know about (and doesn’t seem to be legal).


> It’s not copyright violation to train ML on content.

The training is not a copyright violation. That seems to be settled case law. Whether the verbatim copying as a result of that training is a copyright violation I think is less tested.

Let’s flip the domains. Say we had an ML algorithm that could auto generate news stories and it at some point (not all the time) copied verbatim a Wall Street Journal article and posted it to a blog. Copyright violation?

With copilot, we’re sometimes seeing “paragraphs” of source lines copying verbatim, so this analogy is not such a stretch.

I think we need to think about how much our sharing culture in programming has tinted our view of the legality of this enterprise.


>It’s not copyright violation to train ML on content.

I agree. It'd be a nice gesture to reach out to the creators of the training data, like is usual with web scrapers. But collecting and analyzing data publicly available on the web is ok.

>So the license doesn’t matter unless there’s some “can’t use this for ML training” license that I don’t know about (and doesn’t seem to be legal).

I disagree. While Copilot is, at heart, a ML model, the copyright trouble comes from its usage. It consumes copyright code (ok), analyzes copyright code (still ok), and then produces code which sometimes is a copy of copyright code (not ok). The only way it'd be ok is if Copilot followed all licensing requirements when it produced copies of other works.

Personally, I won't touch it for work until either Copilot abides by the licenses or there's robust case law.


> It'd be a nice gesture to reach out to the creators of the training data, like is usual with web scrapers.

I don’t think this is practical. And who notifies people of scraping content? I would’ve annoyed if I got spam from sites that scraped my content.


I've contacted websites about scraping when it'd be a repeat thing and they didn't have a robots.txt file available. Also if their stance on enforcing copyright was hazy (e.g. medical coding created by a non-profit). Sometimes, they pointed me toward an API I didn't know about.

>I don’t think this is practical.

I don't like people ignoring things just because they're impractical for ML. That leads to crap like automated account banning without possiblity of talking to a living customer service representative.


All I know is this Copilot opens a whole can of worms. And doesn't or possibly never will have a right answer without court settling.

Obviously most ( I think ) lawyers seems to be siding with Microsoft on fair use. But most owner of the code seems to think they are infringing on their work.

Then there is the international issue because one court cant decide for everyone else.

I think the issue is important enough I wonder if we could somehow crowdfund it for a court trial or something.


I just don't get this, it's ai not a search engine, unless we deliberately bait it, it won't spit out verbatim code snippets. I'd also use all public code in github if I wanted to train a similar tool. Furthermore, tabnine has been doing the same for years and not a single dramatic statement about it.

This simply feels like anti-Microsot people flocking to what they see as some exposed Microsoft flesh for social media biting.


The way I see it is that MS has probably put as much financial investment in the development of this product as in the research of legalities regarding its release with its highly paid legal army. To expect a multi billion dollar company to not do its due diligence seems naive.

Maybe this could spark a discussion to change the current rules that allow them to do this, but questioning the current legality to me is a waste of time.


It's a gamble. Worst case they have to reduce the quality by removing GPL code from the training data. And/or pay off a few lawsuits, which is routine stuff for them. Cost of doing business.


Besides, as a programmer you should not excuse yourself with "IANAL" or otherwise pass any judgment to lawyers. Lawyers are just that: lawyers. They don't hold the truth either. One lawyer says this, another lawyer says that. F*k 'm. If anything, say "IANAJ" (I Am Not A Judge). Trias politica, you gotta love it.


How do people think developers learn?! Many probably recite copyrighted code almost verbatim on the reg. Storm in a tea cup.


Not a github user (*lab), also not a lawyer, so please excuse my ignorance.

As this boils down to legal arguments, are there any clauses (maybe disputed) in the ToS allowing github/MS usage of public repos for such purpose?

Would it even be legally possible to override a software license as a repo provider like "by using this service, you agree to..."?


Of all the concerns over training large AI models incidental copyright infringement doesn't seem that important.


Does gtp-3 have to attribute mankind for reading all of the internet?

What about deep learning-artwork trained on google searches?

We enter a new era…


Could this be the beginning of the true test of open source licenses? My understanding is that there has never been a ruling by a court to give precedence to the validity or scope of any open source license. I can see a class action suit coming on behalf of all GPL licensed code authors.


What ? There have been plenty of GPL cases defended in court.

https://en.m.wikipedia.org/wiki/Open_source_license_litigati...


All of the copyright cases were settled, so no precedence is set. Open source as a contract has been ruled legal, and licensors can sue for breach of contract - which is not the same as copyright infringement.

I think my point still stands.


GitHub used code that wasn't under any license at all, just publicly visible. Their claim is not that the license allows what they're doing, but that they do not need a license.


which is a different issue to my point, but still very valid. what terms are implied if no license is specified? I would argue attribution should be expected if used, but I also wouldn't go near any code without a specific license attached as there's no express permission given - just because a license isn't disclosed doesn't mean it isn't there.

you can't go copying anything and everything just because nobody has told you that you can't. and I feel that's part of the purpose behind GPL. force a license on derivative code so that at least there's clear rights moving forwards.


It's stronger than that: if GitHub is correct that they don't need a license then they are allowed to train on publicly visible code even if it is labeled with "no one has any provision to use this for anything at all, especially training models"


Which is why I think this could be a big turning point. IMO, GitHub is breaking licenses. If an ML algorithm ingests a viral licensed block of code, its outputs should be tainted with that license as it's a derived work. Otherwise I can make a program reproduce whole repositories license free, so long as I can claim "well, the AI did it, not me!" It's produced something based on the original work, therefore it should follow the license of the original. And that issue is exacerbated by the mixture of licenses available - they will all apply at the same time, and not all are compatible.

I would hope GitHub (and Microsoft) did the legal work to cover this, and not just ploughed ahead with the plan to drown any legal challenges. From my perspective, they're doing the latter.


This isn't as clear as most things we work on as engineers, but there's a spectrum:

* An algorithm (or person) ingesting lots of code and then later spitting out that same input, does not free anyone from the copyrights of the input.

* An algorithm (or person) that ingests lots of code, finds commonalities, synthesizes that into something new, and produces something well beyond mere copying is producing something new, likely without any legal tie to the original.

Right now, it looks like most of what co-pilot does is closer to the latter, but sometimes it does some things that are closer to the former? I can't see any reason why they wouldn't be able to fix it to avoid regurgitating its input, however, with something like a bloom filter, so I expect a long-term there's a way to do it that falls entirely within fair use?


When I first read about the new copilot tool I immediately thought it would just be a matter of time before some group started poisoning the AI. Garbage in, garbage out right?

So now we know its ALL public repos ... how long until all the opponents of this tool have a giant repo full of syntactically correct code that employs terrible design patterns and is thoroughly obfuscated? I'm not going to waste my time on this personally but there are certainly those who will. Someone will invent a tool that perverts perfectly good code in the process and probably have a good laugh.

Personally, while I recognize some people might find it useful, I don't much care for it. No, I haven't tried it yet either. Ive never sampled escargot either and I know I don't care for it all the same. Maybe it's wonderful, I'll never know - but I do know that I simply don't like the idea of it. Call it an objection on General Principal if you like.

So remember kids, If you're not PAYING then you are the product.

Bottom line - private repos are cheap and you should use them rather than freebie public stuff.


The answer is simple: Github needs to make a tool which can scan all your code to see if it contains code from public code. Its what universities around the world do for students work.

Of course, theres a huge irony in that Github is also making the tool that enables the widespread plagarism....


Please don't get mad at me but my question is genuinely: so what? Why does it matter? Can't you violate licenses in a tedious manner just by Googling + copy pasting blindly already? Genuinely looking to understand the consensus here


Bit confused. If I have code on GitHub with most restrictive licence possible (no commercial reuse, no derived works) then how did Githubs legal get comfortable with this approach? What am I missing ?


By using github you have acceded to their terms of use[1]:

> Short version: You own content you create, but you allow us certain rights to it, so that we can display and share the content you post. You still have control over your content, and responsibility for it, and the rights you grant us are limited to those we need to provide the service. We have the right to remove content or close Accounts if we need to.

[1] https://docs.github.com/en/github/site-policy/github-terms-o...


You uploaded your code to their service and agreed to their TOS.


There's an assumption that public repos can be read by humans and machines both which hasn't been questioned legally.


But the repos are provided under licenced terms no which can vary depending on publishers choice? Put another way, is there a licence that would prohibit reuse in this manner ?


You can likely write/find one but if you don't want your code seen perhaps it'd be simpler to use a private repo.


Microsoft has spent a lot of money and energy in earning developers' trust over the last 15 years.

They have done an excellent job and succeeded in their goal.

Now, with copilot they are about to lose it all.



Ugh, the headline... Most interesting part got truncated

> regardless of license


What about the other half of the law: If your copilot code takes from public source but produces something that is patented, can you be sued by a patent troll? (Yes.)


While I’m not trying to lessen the implications of something like this, but didn’t we all agree to them being able to do this when we agreed to their TOS?


Let's assume that is enforceable through the TOS (which I doubt), would that make hosting GPL'd code on Github a violation of the GPL? If programmer X releases GPL'd code on his website and programmer Y copies it to Github, then it could presumably be considered a bypass of the copyright.


I don't believe ToS are ever legally binding.


I'm really hoping some big corp whose codebase is source-available and on GitHub but still under copyright, takes the piss out of them for this


What is Microsoft's long-term goal here? Why did they release a hugely controversial feature that is not making them any money? (Or is it?)


"All your code are belong to us" ... :(


There are a lot of posts here debating if licenses still apply when copilot generates verbatim code. The answer is yes.

Copilot is currently a technical preview. Github has already said they intend to detect verbatim code and notify the user and present the correct license. That'll be in the final release.

Don't use the technical preview for anything except demoing a cool concept. It's not ready for that yet because it will reproduce licensed code and not tell you.


> If licensing still apply when copilot generates the code. The answer is yes.

Please provide a source for this.


Wouldn't this question already have been asked and answered when AI's were trained on books and articles?


As far as I know, there isn't a formal copyright-related US court ruling (yet anyway) of training ML/AIs on any media (except for copying the code of an ML). So everything is actually on thin ice, much like the infamous "GIFs [from snippets of shows etc.] are widely believed to be fair use", which in reality is still untested. Let's not forget other countries, with much stricter copyright rules (especially moral rights).


I treat Copilot as literally a programmer in pair programming. Which means that if it's trained, i.e. it has "seen" GPL code, then it's tainted, and we should treat resulting code as GPL code.

Replace "GPL" with the most restrictive license that's on GitHub, but you get the point.

They're kinda shooting themselves in the foot, because this reduces the commercial potential of the tool to almost nothing.


Of all the potential issues with training large AI models incidental copyright infringement seems pretty mild.


Road to hell paved with good intentions.


One question I haven't really see talked about is: when you get a suggestion through copilot and save the document, who is the author of the document?

I think this may be the crux of this whole kerfuffle.

If you're the author isn't it on you if you infringe?

If not then perhaps you and GitHub/Microsoft share authorship/culpability?

Who has the copyright to a piece of text generated by a tool? Or art generated by a model?


It seems that now is finally the time I must apologize for some of the java code I put up on github.


A lot of hate for a cool piece of tech. Can’t we just be happy this tool exists?


I've figured out why ML based fair use arguments for generative models feel dirty to me.

Imagine a scenario where you'd love to have access to a large number of my digital widgets, but they're expensive to make or buy, and a large number of them is really expensive. So you train an ML model on my things you can't afford to buy. It's still expensive, but that's a one time cost. Spend $5M training GPT-3, it's fine. Now you can sample from the space of my digital widgets. You have gotten a large number of widgets, just by throwing money at AWS. With money, you have converted my widgets into your widgets, and I'll never see a cent of it.

That's the issue. Content is expensive and it's still needed. Traditionally, I make content and if you want to benefit from my labor, you pay me. In the future, if you want to benefit from my labor, you pay AWS instead.

tl;dr The most significant equation for generative models is "$$$ + my stuff = your stuff"


In addition, the model is going to spit out widgets that are combinations of the existing ones, if it doesn't outright copy. This is different from a human who is going to put their own creativity into it (and will be accused of plagiarism if they don't): the model has no creativity to offer on top of the unlicensed input.


I hope they don't shutdown the project with all the legal nightmares.


Hmm, no. I'll be (finally) moving to GitLab or similar.


The paper clip maximizers have already taken over :(


Most likely they are using the private stuff too.


I honestly don’t see that as a problem…

A human learns by looking at all public code

A robot learns by looking at all public code

(Okay, I have some reservations to above comment, but for discussions sake, that’s what I’m going with)


Microsoft LicenseLaunderer.


my code's in there? im so sorry everyone


If you learn a word or phrase from a copyrighted public broadcast, does that mean you cannot speak it to others?


Really hoping to see a max exodus from GitHub after this. Microsoft back to their old tactics like we all knew they would.


If you have public repos anywhere people can train on them just as much.


That's also my general sentiment. I assume anyone can do virtually anything with my public repos with little recourse from me. I wouldn't even know if they are indeed breaking my license agreements. Doesn't really help the situation though.


GitHub only recently allowed non-paid private repos. Previously these were only reserved for paid plans. Also, GitHub has a specific section for license files. GitHub actually believes these license files mean something, and states that they must be included with the repo so they are downloaded with it. Just because you can teach a script to ignore a LICENSE file, doesn't mean that it still doesn't apply. That is like saying that because you can teach a robot to ignore restricted airspace, that it is allowed to fly around an airport.


Any suggestions for an alternative? One thing I like about github is that it 'seems' to be a defacto standard for portfolios & public works. It also has excellent integration into AWS and alike


GitLab is the best alternative feature wise. https://sourcehut.org/ is great too if you are into this kind of things


GitLab is a fairly good one. Lots of people self-host their own GitLab/Gitea instance too.


SourceHut or Codeberg


The licensing game is really awful imo. It should be that releasing your code on github = fair game. Licenses are seriously hindering development. You either take part in open source or you don’t. I get anxious every time someone asks me to add a license to one of my project because I don’t know which license to use and wonder if it’ll prevent some people from using the software down the line. Once I tried writing my own license that basically said: I don’t care, do whatever. Yet someone complained with “yet another license”.


yeah, no. Licensing is really awful, yes.

"You either take part in open source or you don’t." I disagree. You can allow your software to be used and post the source code, but it is yours so you get some say in your intentions. Forking is what you're looking for. However, once you fork it, you still owe credit to those that did the heavy lifting before making whatever tweak it is you made and want to call it your own. There's nothing wrong with the original developers getting credit for the work they did. There's nothing wrong with the original devs willing to let other people use their work as long as it is used in the same spirit it was provided (FOSS). That also does not mean the original devs are wrong for wanting evilCorps that want to use their freesoftware to be included/distributed in their packages they sell and profit from to be restrictive.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: