All public GitHub code was used in training Copilot

fleddr · on July 8, 2021

To me, the particular use case and whether it is fair use or not, is of minor interest. A far more pressing matter is at hand: AI centralization and monopolization.

Take Google as an example, running Google Photos for free for several years. And now that this has sucked in a trillion photos, the AI job is done, and they likely have the best image recognition AI in existence.

Which is of course still peanuts compared to training a super AI on the entire web.

My point here is that only companies the size of Google and Microsoft have the resources to do this type of planetary scale AI. They can afford the super expensive AI engineers, have the computing power and own the data or will forcefully get access to it. We will even freely give it to them.

Any "lesser" AI produced from smaller companies trying to compete are obsolete, and the better one accelerates away. There is no second-best in AI, only winners.

If we predict that ultimately AI will change virtually every aspect of society, these companies will become omnipresent, "everything companies". God companies.

As per usual, it will be packaged as an extra convenience for you. And you will embrace it and actively help realize this scenario.

mikewarot · on July 8, 2021

I have about 300,000 photos that haven't been scanned by AI (unless someone at Backblaze did it without permission). I'm sure there are lots of other photographers out there who miss Picassa, which Google killed off to push everyone's data to their service. (It did really well in matching faces, even across age, but the last version has a bug when there are multiple faces in a picture, sometimes it swaps the labels)

If there were offline image recognition we could train on our own data privately, could the results of those trainings be merged to come up with better recognition on average than any one person could do themselves with their own photos?

In other words, would it be possible for us to share the results of training, and build better models, without sharing the photos themselves?

mceachen · on July 9, 2021

Absolutely possible.

What I'm building into PhotoStructure is typically called "transfer learning."

https://en.wikipedia.org/wiki/Transfer_learning

PhotoStructure is entirely self-hosted, including model training and application: the public domain base models (trained on huge datasets) are fetched and cached locally.

By design, none of your data (or even metadata) leaves your server.

(I expect to ship this in an upcoming beta next month.)

mikewarot · on July 9, 2021

I want to label all the faces in the photos I've taken since 1997, and save them in the metadata. I'll be glad to run it against my photos. Windows 10, WSL, and/or Virtual Machine with Linux of your choice.

mceachen · on July 9, 2021

I've got desktop builds for macOS, Windows, and Linux, as well as "headless" builds for Docker and even "directly" via Node.js. Instructions here: https://photostructure.com/install

singhkays · on July 9, 2021

Nice! Will try this out. Are you planning on taking advantage of in-built neural engines like that in Apple M1 for speeding up object/facial recognition?

mceachen · on July 9, 2021

I'd like to, but practically speaking, I'm at the mercy of native support in the libraries I'm using. If support is added, though, it's trivial for me to add the switch as a user-definable setting.

waterheater · on July 8, 2021

Yes, you're talking about federated learning.

https://en.wikipedia.org/wiki/Federated_learning

bayindirh · on July 9, 2021

> If there were offline image recognition we could train on our own data privately...

Apple does all face recognition and image processing stuff on the edge. On your iPhone or Mac.

I wondered why my phone got frighteningly hot while charging sometimes. Then I saw the note after adding some faces manually for it to recognize, which was in the line of "Your phone will update faces when the phone is charging". My all photos are backed up to iCloud, btw.

pulkitsh1234 · on July 9, 2021

While possible, only the tech-savvy people would take part in this "collective", which is of-course a minor fraction of the data which Google has access to. This is the same argument as saying that if you care about privacy "just" don't use Google, easier said than done for the vast majority of people on earth.

cgio · on July 8, 2021

I am not an expert on the field. But my hope was that this could be facilitated by Transfer Learning. Still don’t know how the scale economies could be achieved. Maybe just out of the sweat and network of passionate people like in the case of open source.

Mehdi2277 · on July 9, 2021

I work in the field. Transfer learning helps get you decent/good models, but the best models remain ones trained on large amounts of data. You may be able to get away with good performance and not great on your task. For some areas you really care a lot about long tail performance (like self driving) that you will need massive dataset. For other areas if your goal is to be the best relative to other large companies you will need a massive dataset.

Transfer learning best use cases are for fast prototypes or for ml tasks that do not need state of the art performance.

mikewarot · on July 9, 2021

If there are hundreds of people with 100,000 photos each, that collectively is a massive training database, with a lot more labels and diversity of subjects.

By keeping the training data itself private, distributed and outsourced, you might be able to get otherwise unachievable levels of performance.

BlueTemplar · on July 9, 2021

This isn't going to solve the data ownership issues though, since they contaminate the program trained on them (and its blackbox nature only makes it worse)... though I guess that specifically for copyright it's going to depend on the final usage of that tool ?

mostdataisnice · on July 8, 2021

While interesting, we don't know enough about how models are learning to where we would be able to consider doing this.

bartimus · on July 9, 2021

There would still need to be a central model (and centralized management thereof) if I understand correctly.

nine_k · on July 9, 2021

Photoprism, digiKam, shotwell all have image recognition features, with varied levels of sophistication.

colordrops · on July 9, 2021

You don't encrypt your data before uploading to backblaze?

mikewarot · on July 9, 2021

Oh heck no, I never encrypt data.

I run windows. It can't ever be secure, anyone who wanted to hack me could.

Scrambling the data really makes things worse as any accident requiring recovery of my data is also probably going to lose the encryption key.

The only time I ever lost any significant chunk of data (a persons lifetime set of photos!) was because Windows encrypted data at rest, and thus it couldn't be recovered after a disk crash.

Unless there is some corporate or legal requirement to do so, I'll never encrypt a whole disk, or backup.

serf · on July 10, 2021

> any accident requiring recovery of my data is also probably going to lose the encryption key.

... why?

i'd hate encrypting too if I threw away all best-practices regarding it -- losing a key with the failed system is a "problem exists between chair and keyboard" type of issue.

Encryption protects your data from yourself, from your adversaries, from serendipitous grey-moral types, and from the prying eyes of over-zealous data-collection conglomerates.

You seem experienced in the field, so I won't presume what your best practices are -- but to be enthusiastic against encryption is a form of cheer-leading that I think I cannot ethically support; the longer I live and the more pervasive companies get to be with their data collection policies then the more powerful and required tools like encryption seem to become.

nonbirithm · on July 9, 2021

Agreed. Everybody talks about encrypting backups like it's common sense, but almost nobody talks about the risks involved with failing to back up the encryption key itself properly. The entire integrity of the backup then depends on that sensitive piece of data, and it's not something that can be openly shared by its nature, or included in the encrypted backup itself. It's even deceptive if your measure for success is restoring the backup to make sure it works properly, because there is now an implicit assumption that the encryption key is still valid and undamaged the next time you restore.

I wish backup tools like Duplicity would warn you about the risks of encrypting backups instead of warning the user if they disable encryption, because encryption has the possibility of rendering all those backups useless when the moment to use them finally comes.

I have a similar feeling that large swathes of my digital life would be rendered permanently inaccessible if 2FA was enabled and my device was rendered inoperable. (That's why I keep meticulous physical backups of emergency keys.) I think 2FA and the like should be considered a tradeoff with its own inherit risks and benefits, instead of a universally better option than randomly generated 80-character passwords alone.

spyckie2 · on July 9, 2021

The thought that data monopolization will be a moat against competitors is actually argued against by VC firms specializing in AI companies, who claim that after a certain amount of data (which is accessible to most people) the additional data isn't going to improve the model much.

https://a16z.com/2019/05/09/data-network-effects-moats/

https://a16z.com/2020/02/16/the-new-business-of-ai-and-how-i...

hankman86 · on July 10, 2021

And in case of Copilot, the training data isn’t a moat anyhow. Last I looked, everyone could freely access GitHub public repositories.

abiro · on July 9, 2021

> If we predict that ultimately AI will change virtually every aspect of society, these companies will become omnipresent, "everything companies". God companies.

What we currently call AI is very from AGI, and it's not clear that sitting on piles of proprietary data gives an edge towards AGI. If the goal is human-level intelligence, that has been demonstrably achieved with the far lesser resources of the public school system. :)

Current DL systems need huge amount of data, because they are very primitive: they work with immediate associations, so they require seeing data very similar to all possible inputs to generalize well.

As we develop more sophisticated systems, I expect that the leverage from data will tip over to engineering finesse, and nothing is better at fostering great engineering than the permissionless tinkering environment of open source.

SirensOfTitan · on July 9, 2021

> If the goal is human-level intelligence, that has been demonstrably achieved with the far lesser resources of the public school system.

Pretending that the scientifically managed public school system, that attempts to manufacture uniform educated humans on a conveyer belt, is responsible for human education is fairly ridiculous.

Children have a remarkable capacity to learn, and do so automatically through free play and exploration until public education wrings that curiosity out of them and turns education into a job.

Humans get educated despite the public education system, not because of it.

dragonwriter · on July 9, 2021

> Pretending that the scientifically managed public school system,

Say what now? There may be places on Earth that practice scientific management, there are definitely some that pretend to, but IME public school systems are neither.

SirensOfTitan · on July 9, 2021

Schools (at least American public schools) are one of the last bastions of Taylorism in the west. They treat students like uniform widgets on an assembly line.

You can read for yourself: https://files.eric.ed.gov/fulltext/ED566616.pdf https://radicalpedagogy.icaap.org/content/issue3_2/rees.html

dragonwriter · on July 9, 2021

“Treating X as uniform widgets” (where X are not uniform widgets) and “scientific management” are not only not the same thing, they are anticorrelated.

guythedudebro · on July 9, 2021

Yeah everyone knows children will innately learn calculus from flinging mud at eachother.

visarga · on July 9, 2021

> If the goal is human level intelligence, that has been demonstrably achieved with the far lesser resources of the public school system.

Seems unlikely human education costs less than AI education in total.

aeyes · on July 8, 2021

For years we thought Google Translate was the best machine translation we would ever get. Then DeepL just popped up out of nowhere and today other services still didn't manage to catch up.

Every now and then you get someone to think about an old problem on a clean sheet of paper and you might get a better result with less training data / investment.

pyuser583 · on July 9, 2021

Google doesn’t have the best (publicly) available reverse image search AI. That would be Yandex.

Google is actually pretty crappy at reverse image searches.

https://www.bellingcat.com/resources/how-tos/2019/12/26/guid...

_fizz_buzz_ · on July 9, 2021

The point still stands though, Yandex is also a behemoth with access to a massive amount of data.

colordrops · on July 9, 2021

It's pretty clearly intentionally hobbled for various reasons (e.g. privacy, obscenity, etc). It used to work a lot better.

IfOnlyYouKnew · on July 9, 2021

Which would explain why Yandex, specifically, is the best-in-category: being based in a country where your government enjoys trolling developed world's idea of decency and responsibility can have its advantages.

(Until, of course, they force you out and give company to some crony oligarch. But that idea is also not unknown to Yandex, I believe?)

someperson · on July 9, 2021

On the other hand, DeepL (made by a small German company) is better than Google Translate.

summarity · on July 9, 2021

Makes it sound like DeepL exists in isolation. It's good because the company behind it has the largest hyperlocal (small phrases with confirmed usages) translation data set.

hexa22 · on July 8, 2021

There is a second best though. Apple offers image AI which is worse than googles but wins because it works offline.

lathiat · on July 9, 2021

I've got 70,000 photos in my library, with AI search and recognition, all done on my device. Thanks Apple.

In fairness it's not quite as good, but, it's good enough for the searches I've wanted to do so far and gets better all the time. And they're adding searching text in photos this release. I'm happy to wait a little for this better implementation.

tvirosi · on July 20, 2021

I largely agree. But there are still some fun opportunities around. One are things google would never touch because of PR reasons (e.g. state of the art scalable face identification). Another is just silly out-of-the-box creative uses of AI which wouldn't fit well with Google's brand.

jspaetzel · on July 8, 2021

If Google makes an amazing model that no-one can beat it will only be dominate as long as others get access to it freely. But if there are restrictions on access or if it's too expensive, other options will appear and even if they're not as perfect, they'll still be very usable. Imagine a coalition of companies all feeding data, that could compete just as well.

est31 · on July 8, 2021

Google has all the data of all the users though. I'd wager that they won't just let AI companies scrape it.

londons_explore · on July 9, 2021

I don't think Google uses user photos to train their photo search algorithm.

They use photos from the web for training, and then user photos are only used for the actual indexing.

ncr100 · on July 8, 2021

I think it's an innate quality of technology.

Yes sophisticated AI tech concentrates power for those who already have power.

And the technology we all (presumably readers of HN) create can enhance the impact of the user. And this can result in unfair circumstances, in reality.

Law and force can prevent disproportionate use of power. Of course one must define the law, which may be done AFTER the offense has been committed. Further, if those who make the laws are corrupted by those with e.g. this AI tech power, then no effective law may be enacted and the hypothetical abuse will continue.

mtrn · on July 8, 2021

The final step is to break down these monopolies. The government can do that and has done it before.

hamilyon2 · on July 9, 2021

Interesting, given that HN thinks that it is yandex who has SOA image search, not google https://news.ycombinator.com/item?id=23976172 which kinda counters your logic.

It is yandex who now collects massive amounts of data to improve their image search now, while google apparently doesn't.

Yandex is a giant, for sure, but google is, like, 10 times bigger and still doesn't provide the best service.

visarga · on July 9, 2021

They are not hoarding the latest results, except for a few cases where the general public is a year behind their secret sauce. Take a look at the huge zoo of planetary-scale models that are published by the big companies and universities (HuggingFace, https://modelzoo.co/, ...)

The problem with the huge models like GPT-3 is that they are too expensive even to run by regular people, not train.

hankman86 · on July 10, 2021

Regular people yes, but no problem for decently funded startups.

kulikalov · on July 9, 2021

This seem to be inevitable. An individual doesn't have horizontal scalability, you know... So, unless we'll have some kind of brain extension capabilities, there is no other choice but to build such technologies collectively.

Also, I think you are overdramatizing this. Governments used to be omnipresent (maybe still are), in a different way, more threatening to individuals and probably as threatening to societies as "everything companies" could be.

sdevonoes · on July 9, 2021

We can decide to stop using some (or most of) Google services. It's hard, but it's not that they are pointing us with a gun in order to use their services, right? Sure, for the cases when one cannot escape Google, use it; but for the rest of scenarios? It's all about tradeoffs: Can I live without YouTube? Can I live with DuckDuckGo (Google Search is "better" but I don't mind)? etc.

hankman86 · on July 10, 2021

Google search would be hard to replace. In fact, if Google search was turned off over night, the world would probably see a major economic downturn, caused by a sudden drop in productivity.

redeux · on July 9, 2021

You, a person knowledgeable in this field, may choose to stop using Google services, but that won’t have any societal impact if you can’t also convince the “average” user to do the same.

Siira · on July 10, 2021

And I have yet to see a single life-changing AI application. I haven't tested Copilot yet, but I'll bet it is so precariously useful that a lot of people will feel more productive without it. (BTW, the last time I opened VSCode, it could not even autocomplete Numpy, so I am not holding my breath for AI autocomplete.)

cbolton · on July 9, 2021

Well of course only the huge companies can develop products that require enormous resources.

But I'm not too worried here because everyone gets access to larger datasets every year, and it gets cheaper to process every year, so whatever Microsoft or Google is capable of doing now, smaller companies will be capable of doing in a few years.

hankman86 · on July 10, 2021

It’s also a huge call for innovation. When a student learns to code, (s)he doesn’t need to analyse millions of Git repositories to get good at it. Throughout their entire career most developers will probably only see comparatively little code. Perhaps the equivalent of the Linux kernel, if that. And yet, we’re able to learn from the little we see and get reasonably good at coding. It even stands to reason how much better one gets by reading more code (most of which is pretty crappy anyway).

sgrove · on July 8, 2021

I believe this is actually powered by OpenAI, which while large (now), is nowhere near the behemoth that Microsoft or Google is.

This suggests that seeing the future a bit ahead of the rest of the world, and then assembling a motivated all-star team is (perhaps in the short term at least) one way of out-competing the "super AI" of the giants.

shoo_pl · on July 9, 2021

Not only Microsoft basically bough OpenAI a couple of years ago, they also made the GPT-3 closed thing that you can only access via API.

Don't let the name fool you, OpenAI is anything but Open.

BlueTemplar · on July 9, 2021

Last I've checked, Microsoft pretty much owns OpenAI ?

hankman86 · on July 10, 2021

I don’t think that your premise concerning “planetary-scale AI” (and the ability to pull it off) holds up. If Google and Microsoft are so dominant and had such an insurmountable head start, why are we seeing such an enormous number of AI startups? In fact, there are countless startups busy figuring out how to make AI work for software development. I’d even argue that copilot was not that expensive to build. I very much doubt that GitHub (or Microsoft for that matter) had a huge team working on this or has spent such a vast amount on hardware resources that they’d outcompete the rest of the market by virtue of their cash reserves. Any decently funded startup should be able to finance such as effort. Especially since in this case, the training data is cheap (and legal) to access for anyone.

Where Microsoft does have an “unfair advantage” is in their marketing and sales firepower. Replicating their B2B and B2C sales channels is indeed very expensive. GitHub will be able to monetise Copilot by some upselling campaign. Then again, startups regularly manage to break into markets that are supposedly locked down by the likes of Microsoft.

keonix · on July 8, 2021

If the training set contains verbatim (A)GPL code does this mean that Copilot also should be distributed by Microsoft under GPL? Because without it Copilot (as it is distributed by Microsoft) couldn't be built, wouldn't it make it a derivative work of GPL'd code (and obviously every other license)?

I see a lot of people comparing human learning to machine learning in the comments, but there is a huge difference - we don't distribute copies of humans

AaronFriel · on July 8, 2021

No, see Authors Guild v. Google. Even without a license or permission, fair use permits the mass scanning of books, the storage of the content of those books, and rendering verbatim snippets of those books. The Google Books site is not a derivative work of the millions of authors they copied from, and if they did copy any coincidentally GPL, AGPL, or creative commons copyleft work, the fair use exception applies before we reach the question of whether Google is obligated to provide anything beyond what it is doing.

By comparison, Copilot is even more obviously fair use.

I've had this conversation quite a few times lately, and the non-obvious thing for many developers is that fair use is an exception to copyright itself.

A license is a grant of permission (with some terms) to use a copyrighted work.

This snippet from the Linux kernel doesn't make my comment here or the website Hacker News a GPL derivative work:

    ret = vmbus_sendpacket(dev->channel, init_pkt,
        sizeof(struct nvsp_message),
        (unsigned long)init_pkt, VM_PKT_DATA_INBAND,
        VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);

This snippet from an AGPL licensed project, Bitwarden, does not compel dang or pg to release the Hacker News source code:

    await _sendRepository.ReplaceAsync(send);
    await _pushService.PushSyncSendUpdateAsync(send);
    return (await _sendFileStorageService.GetSendFileDownloadUrlAsync(send, fileId), false, false);

Fair use is an exception to copyright itself. A license cannot remove your right to fair use.

The Free Software Foundation agrees (https://www.gnu.org/licenses/gpl-faq.en.html#GPLFairUse)

> Yes, you do. “Fair use” is use that is allowed without any special permission. Since you don't need the developers' permission for such use, you can do it regardless of what the developers said about it—in the license or elsewhere, whether that license be the GNU GPL or any other free software license.

> Note, however, that there is no world-wide principle of fair use; what kinds of use are considered “fair” varies from country to country.

(And even this verbatim copying from FSF.org for the purpose of education is... Fair use!)

jrm4 · on July 8, 2021

You're strongly and incorrectly implying that "Fair Use" is a clear (and relatively immutable) concept within copyright law, which couldn't be further from the truth. Even if this or that particular case sets out what appears to be solid grounds, one shouldn't take that as gospel by any means.

This mostly has to do with the nature of the wishy-washy nature of the 4 part Fair Use test, which, unlike decent legal tests, doesn't actually have discrete answers. The judge looks at the 4 questions, talks about them while waving her hands, and makes a decision.

Comparing to, e.g., Patent, where you actually do have yes-or-no questions. Clean Booleans. Is it Novel? Is it Non-Obvious? Is it Useful? If any of the above is "No", then no patent for you.

As for the execution of Fair Use, while I haven't gone too deep into Software, I can assure that for music, the thing is just a silly holy-hell mess; confirmed most recently by the "Blurred Lines" case, where NO DIRECT COPYING (e.g. sampling or melody taking) was alleged, merely that the song sounded really similar to "Got to give it up" and that was enough.

So then, I'd say everything either is, or should be, up in the air, when it comes to Fair Use and software.

cogman10 · on July 8, 2021

Most law is wishy washy. There are very few cut and dry answers in the law (If there were, we wouldn't need lawyers and a court system based on deciphering the law).

All that said, the one thing I'd add about fair use is that it isn't permission to use anything you like, but rather a defense in a legal proceeding about copyright. It's pretty much all about being able to reference copyrighted material with the law later coming in and making final decisions on whether or not that reference went too far. (IE, copying all of a disney movie and saying "What's up with this!" vs copying 1 scene and saying "This is totally messed up and here's why".)

That was a big part of the google oracle lawsuit.

extra88 · on July 8, 2021

> Is it Novel? Is it Non-Obvious?

Those questions for patents are barely more clear-cut than copyright fair use tests, there is lots of room for disagreement.

It's definitely true that a fair use defense against copyright infringement varies a lot by the field of work and norms can develop which are relevant to court cases. The music field is a mess, the "Blurred Lines" judgement was total bullshit. But the software field is not without its own copyright history and norms so there's no reason to expect everything to go to hell.

jrm4 · on July 8, 2021

But there's no reason not to either - I suppose my point is, don't take too much as gospel and think about everybody's best "end-goals" and push or pull with or against the law as needed.

TheSpiceIsLife · on July 8, 2021

There’s also an aspect of this that varies by size, budget, political clout, etc etc, of the individual or organisation.

The big guns like Microsoft, Google, Oracle, do this sort of thing as a matter of course in their business activities, they have the lawyers, the money, and the ear of members of parliaments, senators etc.

Whereas an individual or small business probably wants to conduct themselves within a more narrow set of adherences.

nerdponx · on July 8, 2021

Unanswered question, as far as I know: is a trained model a derivative work? If the model accidentally retains a copy of the work, is that an unauthorized copy?

zarzavat · on July 9, 2021

In my opinion, the model would not be an unauthorized copy given that it's primary purpose was for some other task and the inclusion of the work was merely incidental.

The unauthorized copy arises when someone gets the work out of the model.

Of course if you make a model explicitly for the purpose of evading copyright then the courts will see through that ploy.

jpalawaga · on July 8, 2021

I think it would be pretty easy to stake opinions on those "boolean questions."

Is (was?) a swipe gesture novel? Is it non-obvious?

Arelius · on July 8, 2021

I think what the parent is stating is that even though the patent questions can have debate, once you settle the question "Is it Novel" as yes or no you can determine if the item is patentable... wheras for fair-use, the questions themselves aren't yes/no questions, and further, they are just used as balancing factors, so even if everyone agrees on "the effect of the use upon the potential market for or value of the copyrighted work" it's only weighed as a factor for how fair the use is, and broadly left up to the hand-waving of the particular judge.

jrm4 · on July 8, 2021

Oh, absolutely. Kind of furthers my point. Patent is a silly mess in a lot of ways, but at least there's something like Booleans in it. "Fair use" doesn't even have THAT.

extra88 · on July 8, 2021

Yes to all this.

I think the factor most at risk in a fair use test with Copilot is whether it ever suggests verbatim, code that could be considered the "heart" of the original work. The John Carmack example that's popped up here at least gets closer to this question, it was a relatively small amount but it was doing something very clever and important.

One can imagine a project that has thousands of lines of code to create a GUI, handle error conditions, etc. that's built around a relatively small function; if Copilot spat out that function in my code, it might not be fair use because it's the "heart" of the original work. Additionally, its inclusion in another project could affect the potential market for the original, another fair use test.

But Copilot suggesting a "heart" is unlikely, something that would have to be ruled on in a case-by-case basis and not a reason to shut it down entirely. Companies that are risk-averse could forbid developers from using Copilot.

mthoms · on July 8, 2021

This is an excellent comment because it captures some important nuance missing from other analysis on HN.

I agree with you that the relative importance of the copied code to the end product would be (or should be) the crux of the issue for the courts in determining infringement.

This overall interpretation most closely adheres to the spirit and intent of Fair Use as I understand it.

RandomBK · on July 8, 2021

For any discussion on copyright and fair use, we should distinguish between the implications to Copilot the software itself and the implications to users of Copilot.

For Copilot itself, I do see the case for fair use, though it gets fuzzy should Microsoft ever start commercializing the feature. Nevertheless it remains to be seen whether ML training fits the same public policy benefits public libraries and free debate leverages to enable the fair use defense.

For Copilot users, I don't see an easy defense. In your hypothetical, this would be akin to me going on Google books and copying snippets of copyrighted works for my own book. In the case of Google books, they explicitly call out the limits on how the material they publish can be used. I'm contrast, Copilot seems to be designed to encourage such copying, making it more worry some in comparison.

Talanes · on July 9, 2021

>In your hypothetical, this would be akin to me going on Google books and copying snippets of copyrighted works for my own book.

A book completely written by pasting passages of other books would actually be a pretty interesting transformative work.

BlueTemplar · on July 9, 2021

Yeah, but a book like this would be an artistic work.

While software is in this limbo between copyrights and patents...

jkaplowitz · on July 8, 2021

The world is global. That's a US court ruling from one court of appeals. Most countries have narrower fair use rights than the US. Even if Copilot would fall within that legal precedent (far from guaranteed), a legal challenge in any jurisdiction worldwide outside the US states covered by that particular court of appeals, or which reaches the US Supreme Court, or which goes through the Federal Circuit Court of Appeals due to the initial complaint including a patent claim, would not be bound by that result and (especially in a different country) could very plausibly find otherwise.

What's more, if any of the code implements a patent, fair use does not cover patent law, and relying on fair use rather than a copyright license does not benefit from any patent use grant that may be included in the copyright license. If a codebase infringes a patent due to Copilot automatically adding the code, I can easily imagine GitHub being attributed shared contributory liability for the infringement by a court.

Not a lawyer, just a former law student and law feel layman who has paid attention to these subjects.

jkaplowitz · on July 9, 2021

> law feel layman

What a weird autocorrect typo. This should have read "law geek layman." (And it initially autocorrected again as I was typing this paragraph.)

shakna · on July 8, 2021

> No, see Authors Guild v. Google.

That case required that the output be transformative, in that "words in books are being used in a way they have not been used before".

Copilot only fits the transformative aspect if it is not directly reciting code, that already exists in the form that it is redistributing. So long as it does so, it fails to meet the criteria.

kmeisthax · on July 8, 2021

I think you might be considering two different acts here:

1. The act of training Copilot on public code

2. The resulting use of Copilot to generate presumably new code

#1 is arguably close to the Authors Guild v. Google case. You are literally transforming the input code into an entirely new thing: a series of statistical parameters determining what functioning code "looks like". You can use this information to generate a whole bunch of novel and useful code sequences, not just by feeding it parts of it's training data and acting shocked that it remembered what it saw. That smells like fair use to me.

#2 is where things get more dicey - just because it's legal to train an ML system on copyrighted data wouldn't mean that it's resulting output is non-infringing. The network itself is fair use, but the code it generates would be used in an ordinary commercial context, so you wouldn't be able to make a fair use argument here. This is the difference between scanning a bunch of books into a search engine, versus copying a paragraph out of the search engine and into your own work.

(More generally: Fair use is non-transitive. Each reuse triggers a new fair use analysis of every prior work in the chain, because each fair reuse creates a new copyright around what you added, but the original copyright also still remains.)

AaronFriel · on July 8, 2021

Is there any evidence of Copilot producing substantial (100s of lines) verbatim copies of copyrighted works?

Absent this, I don't think there's a case. The courts have given extraordinarily wide latitude to fair use and ML algorithms are routinely trained on copyrighted works, photos, etc. without a license.

I understand that this feels more personal because it involves our field, but artists and authors have expressed the same sentiment when neural nets began making pictures and sentences.

The question here is no different than "Is GPT-3 an unlicensed, unlawfully created derivative work of millions, if not billions of people?"

No, I'm quite confident it is not.

shakna · on July 8, 2021

> Is there any evidence of Copilot producing substantial (100s of lines) verbatim copies of copyrighted works?

It doesn't need to be substantial. In Google v. Oracle a 9-line function was found to be infringing.

AaronFriel · on July 8, 2021

If I recall correctly, the nine line question wasn't decided by the supreme court, but the API question was.

The Supreme Court did hold that the 11,500 lines of API code copied verbatim constituted fair use.

https://www.supremecourt.gov/opinions/20pdf/18-956_d18f.pdf

shakna · on July 8, 2021

> The Supreme Court did hold that the 11,500 lines of API code copied verbatim constituted fair use.

Yes, because it was _transformative_, in a clear way. Because an API is only an interface. Which makes that part of that decision largely irrelevant to the topic at hand.

> Google’s limited copying of the API is a transformative use. Google copied only what was needed to allow programmers to work in a different compu-ting environment without discarding a portion of a familiar program-ming language. Google’s purpose was to create a different task-related system for a different computing environment (smartphones) and tocreate a platform—the Android platform—that would help achieve and popularize that objective.

> If I recall correctly, the nine line question wasn't decided by the supreme court, but the API question was.

It was already decided earlier, and Google did not contest it, choosing instead to negotiate a zero payment settlement with Oracle over the rangeCheck function. There was no need for the Supreme Court to hear it.

AaronFriel · on July 8, 2021

A $0 settlement means there is no binding precedent and signals to me that Oracle's attorneys felt they didn't have a strong argument and a potential for more.

If they felt the nine line function made Google's entire library an unlicensed derivative work, they would have pressed their case.

shakna · on July 8, 2021

> A $0 settlement means there is no binding precedent and signals to me that Oracle's attorneys felt they didn't have a strong argument and a potential for more.

That's not the case. It wasn't an out-of-court-settlement, but an agreement about the damages being sought, the court had already found it to be infringing, and that was part of the ruling.

But none of that changes that 9-lines is substantial enough to be infringing. It isn't necessary to be a large body of work.

> If they felt the nine line function made Google's entire library an unlicensed derivative work, they would have pressed their case.

No... It means the rangeCheck function was infringing. The implication you seem to have inferred here wouldn't be inferred by any kind of plagiarism case.

AaronFriel · on July 8, 2021

I think we agree then, and appreciate the correction on the lower court settlement.

If Copilot is infringing, I suspect it's correctable (by GitHub) by adding a bloom filter or something like it to filter out verbatim snippets of GPL or other copyleft code. (And this actually sounds like something corporate users would want even if it was entirely fair use because of their intense aversion to the GPL, anyhow.)

shakna · on July 8, 2021

It may be correctable... It doesn't change that Copilot is probably infringing today, which may mean that damages against GitHub may be sought.

b3morales · on July 8, 2021

The point of Copilot -- its entire value as a product -- is to produce code that matches the intent and semantics of code that was in the input. In other words, very deliberately not transformative in purpose.

infogulch · on July 8, 2021

Why did you choose the standard of "substantial" = "100s of lines"? Especially since we've already seen examples of verbatim output in the dozens of lines range, that choice of standard is rather conveniently just outside what exists so far. If we find a case with 200 lines of verbatim output will you say the only reasonable standard is 1000s of lines?

I don't think your argument is as strong as you're making it out to be.

AaronFriel · on July 8, 2021

Just a fairly arbitrary number. It's easy to produce a few lines from memory, up to 10s of lines and that's "obviously" fair use. I would be surprised if many of haven't inadvertently "copied" some GPL code in this way!

This goes to the "substantial" test for fair use. Clips from a film can contain core plot points, quotes from a book can contain vital passages to understanding a character, screen captures and scrapes of a website can contain huge amounts of textual detail, but depending on the four factors for fair use, still be fair use. (There have been exceptions though.)

The reaction on Hacker News to a machine producing code trained on their works is no different than the reactions artists and writers have had to other ML models. I suspect many of us are biased because it strikes at what we do and we think that our copyrights (because we have so many neat licenses) are special. They are not.

I think it would need to get to that level of "Copilot will emit a kernel module" before it's not obviously fair use.

After all, Google Books will happily convey to me whole pages from copyrighted works, page after page after page.

https://www.google.com/books/edition/Capital_in_the_Twenty_F...

jcelerier · on July 8, 2021

> Just a fairly arbitrary number. It's easy to produce a few lines from memory, up to 10s of lines and that's "obviously" fair use.

it's anything but obvious. https://www.copyright.gov/fair-use/

> there is no formula to ensure that a predetermined percentage or amount of a work—or specific number of words, lines, pages, copies—may be used without permission.

9 lines of very run-of-the-mill code in Oracle / Google weren't considered fair use.

BlueTemplar · on July 9, 2021

A big difference is that software is both is and isn't an artistic work.

tedunangst · on July 8, 2021

It's not possible to get copilot to output a transformed version of the input?

shakna · on July 8, 2021

Transformed output _may_ fall under fair use.

However - Copilot directly recites code. That is _very unlikely_ to fall under fair use.

Redistributing the exact same code, in the same form, for the same purpose, probably means that Copilot, and thus the people responsible for it, are infringing.

nonfamous · on July 8, 2021

> However - Copilot directly recites code.

You make that statement as an absolute, but in the interests of clarity, all evidence so far shows that it directly recites code very rarely indeed. Even the Quake example had to be prompted by the specific variable names used in the original code.

In practice, the output code is heavily influenced by your own context — the comments you include, the variable names you use, even the name of the file you are editing — and with use it’s obvious that the code is almost certainly not a direct recitation of any existing code.

shakna · on July 8, 2021

> all evidence so far shows that it directly recites code very rarely indeed.

_Once_ is enough for it to be infringing. The law is not very forgiving when you try and handwave it away.

mthoms · on July 8, 2021

You sound quite sure that the outlying instances of direct copying wouldn't be covered by the Fair Use copyright exemption. Any particular reason for that?

I tend to think it would be covered (provided it there were relatively small snippets and not entire functions).

jkaplowitz · on July 8, 2021

I'm not the person you're replying to, but one strong reason is that the global reach and standardization of copyright law is far broader than the global reach and standardization of the fair use exception. A single non-US country in which GitHub Copilot is used in a way that would be infringing without the US fair use exception, and outside the scope of any such exception in that law, would be enough to cause GitHub/MS a legal hassle. There could well be more than one such country.

mthoms · on July 9, 2021

Oh, absolutely.

I'm not American, but like others around here — I was just restricting the discussion to American law for simplicity's sake.

jkaplowitz · on July 9, 2021

Fair, but GitHub/MS (same company now) can't afford to ignore other countries' law in their internal evaluations of whether globally* available products like Copilot are legal.

* Minus a few countries/regions targeted by US sanctions, I assume, though they've gradually broadened their services in sanctioned countries with the necessary licenses from OFAC.

shakna · on July 9, 2021

Precedent. Google v. Oracle found 9 lines, of an "obvious" implementation to be infringing.

mthoms · on July 9, 2021

Right, but would 3-4 lines in the middle of a 50 line function also be infringing? What about 2 lines?

I don't know the answer. I was only surprised that the commenter seemed dead sure that any and all copying (no matter how small) would be infringing.

That just doesn't correlate with my understanding of how Fair Use works: The "amount" of the infringement is one (of several) factors in determining if something falls under Fair Use:

>The third factor assesses the amount and substantiality of the copyrighted work that has been used. In general, the less that is used in relation to the whole, the more likely the use will be considered fair.

From https://en.wikipedia.org/wiki/Fair_use

svaha1728 · on July 8, 2021

So if a foreign company pilfers the source code to Windows, can they add it to a training set and then 'prompt' the machine learning algorithm to spit out a new 'copyright free' Windows, just by transforming the variable names?

rkeene2 · on July 8, 2021

I think that's my question regarding this whole thing:

If it's so fair use, why not train it on all Microsoft code, regardless of license (in addition to GitHub.com) ? Would Microsoft employees be fine with Copilot re-creating "from memory" portions of Windows to use in WINE ?

nonfamous · on July 8, 2021

Well no, because only GitHub has access to the training set. But more importantly this misunderstands how Copilot even works -- even if Windows was in the training set, you couldn't get Copilot to reproduce it. It only generates a few lines of code at a time, and even then it's almost certainly entirely novel code.

Now, if you knew the code you wanted Copilot to generate you could certainly type it character by character and you might save yourself a few keystrokes with the TAB key, but it's going to be much MUCH easier to simply copy the whole codebase as files, and now you're right back where you started.

svaha1728 · on July 9, 2021

GPT-3 is still Microsoft licensed, but a similar model can be put together with the freely available GPT-2 and source code -- especially if your intent is copyright transfer.

As Francois Chollet points out in this talk, ultimately deep neural network models are locally sensitive hash tables, so the examples of people pulling out source code is an inherent shortcoming of deep learning models in general. Give the right 'key' and you can 'recall' the value you are looking for.

https://www.youtube.com/watch?v=J0p_thJJnoo

ThrowawayR2 · on July 8, 2021

> "However - Copilot directly recites code."

Sounds like that wouldn't be difficult to fix? Transform the code to an intermediate representation (https://en.wikipedia.org/wiki/Intermediate_representation) as a pre-processing stage, which ditches any non-essential structure of the code and eliminates comments, variable names, etc., before running the learning algorithms on it. Et voila, much like a human learning something and reimplementing it, only essential code is generated without any possibility of accidentally regurgitating verbatim snippets of the source data.

salawat · on July 8, 2021

At that point, can we all just agree IP is the stupidest concept to ever be layered on top of math (which programming is) and move on with non-copyrightable code?

jcheng · on July 8, 2021

Only if you agree that copyleft licenses are also stupid; without copyright, there's no way to prevent companies from making closed-source forks of code you wrote and intended to stay open.

ThrowawayR2 · on July 9, 2021

The whole point of copyleft was as a stepping stone to get to RMS's four freedoms (https://www.gnu.org/philosophy/free-sw.en.html) which effectively eliminates copyright for software.

jcheng · on July 10, 2021

Freedom 1: “Access to the source code is a precondition”

With no copyright/copyleft, how do you enforce the rule that derived works must provide access to the source code? I’ve never heard that copyleft was a stepping stone—rather, it’s the stick that fully realizes the four freedoms.

salawat · on July 9, 2021

Correct. Copyleft is idiocy as well. You don't really need a pay for a proprietary fork of a tool when no one can keep you out of the free one, and the proprietary stuff diffuses into the free option.

oefnak · on July 8, 2021

Yes, sure. Without copyright there's no need for copyleft left, right?

jcheng · on July 9, 2021

No...? Not unless that closed-source project's source code is leaked?

BlueTemplar · on July 9, 2021

You don't care about attribution and other moral rights ?

(I guess these are going to depend a LOT on the jurisdiction that you're in ?)

salawat · on July 9, 2021

I care, but in the long run, I care more about our descendants not having tools locked out of their hands. Facilitated information asymmetry is the root of far too many evils.

Where is your ego when you're dead and gone? Where could we be if the majority of human advancement we're not tightly clutched as trade secrets?

As someone who has done paid software engineering (yes, you can feel free to call me a hack or sell out if you wish), I've come to find that the salary I've pulled over the years has not gone to me... But keeping a roof over those I love, helping other people's projects grow, giving people a shot, etc.

My time on the other hand, gets dumped into implementing the same handful of processes doing the same damn thing, but different this time, because you can't just bloody make "Here ya go, here's your Enterprise-in-a-box".

I'd like people more people able to solve novel problems than necessarily need to retread the same path over and over. Some degree of that will always have to be done to keep the skills fresh in the population, but we could do way better at marshaling that split, and I'm convinced part of what necessitates it is creating artificial barriers through things like enforced implementation monopolization. Yes. It ensures a minimum level of novelty and variance across populations, but it also does terribly at not consuming the finite amount of human capacity for truly novel thought to innovate.

It may make societies that function based on greed and economic/fiscal measures work, but I'm not convinced other incentive structures won't keep the rolling stone of innovation from accruing moss.

BlueTemplar · on July 10, 2021

I don't understand what you're talking about, I'm talking about the non-commercial parts of the monopoly rights that are copyrights and patents, the non-commercial parts arguably aren't going to restrict the users much, and their commercial parts are temporary by design.

(Copyright has went IMHO overboard with its duration, we should scale to back to the original 14 years renewable once, just like patents, but copyright doesn't apply to processes anyway, and so arguably it shouldn't apply to software that can't claim to have any artistic merit.)

_Understated_ · on July 8, 2021

> By comparison, Copilot is even more obviously fair use.

Not sure I see it that way.

If I take your hard work that you clearly marked with a GPL license and then make money from it, not quite directly, but very closely, how is that fair use? Or legal?

Copying and storing a book isn't recreating another book from it. Copilot is creating new stuff from the contents of the "books" in this case.

Edit: I misunderstood fair use as it turns out...

arianvanp · on July 8, 2021

Google did not scan those books and use it to build new books with different titles. The comparison doesn't hold up at all.

_Understated_ · on July 8, 2021

> Google did not scan those books and use it to build new books with different titles. The comparison doesn't hold up at all.

Not sure if you meant to reply to me but I agree with you: you can't compare what Google did to what Copilot does.

PieUser · on July 8, 2021

Copilot just suggests code.

kroltan · on July 8, 2021

And someone accepts it. Even if suggesting derivatives of licensed code is not a license infringement, then Copilot sure is a vector for mass license infringement by the people clicking "Accept suggestion". And those people are unable to know (without doing extensive investigation that completely nullifies the point of the tool) whether that suggestion is potentially a verbatim copy of some existing work in an incompatible license.

MadcapJake · on July 8, 2021

If I suggest whole lines of dialogue to you, the screenwriter, did I write those lines or you? If you change names in those lines of dialogue to fit your story, do you now gain credit for writing those lines?

Suggesting code is generating code

midev · on July 9, 2021

> did I write those lines or you

Neither. Someone else did, and published it. Copilot copied the dialog and suggested it.

> If you change names in those lines of dialogue to fit your story, do you now gain credit for writing those lines?

It depends. Talking generalities isn't productive or interesting. Can you give an example and we can discuss specifics?

> Suggesting code is generating code

This isn't even superficially true

jollybean · on July 8, 2021

There are situations where the question is are the mishmashes from Copilot 'fair use'.

But the other, more direct question is ... what about the instances where Copilot doesn't come up with a learned mishmash result? What happens when Copilot just gives you a straight up answer from it's learning data, verbatim?

Then you, as a dev, end up with a bunch of code that is effectively copied, via a 'copying tool', which is GPL'd?

It's that specific case that to me sticks out as the 'most concerning part'.

Please correct me if I'm wrong.

nonfamous · on July 8, 2021

For your specific case, “take your hard work that you clearly marked with a GPL license and then make money from it”, you don’t even need to rely on fair use. As long as you comply with the terms of the GPL, making money with the code is perfectly acceptable, and the FSF even endorses the practice. [1] Red Hat is but one billion-dollar example.

[1] https://www.gnu.org/licenses/gpl-faq.en.html#DoesTheGPLAllow...

b3morales · on July 8, 2021

But the person making money from the GPL code has to follow the terms of the license. Attribution, sharing modifications, etc.

nonfamous · on July 8, 2021

Correct. That's why I said "As long as you comply with the terms of the GPL".

AaronFriel · on July 8, 2021

I've edited my comment with examples and a clarification.

Fair use is an exception to copyright and, by definition, copyright licenses.

_Understated_ · on July 8, 2021

I understand the concept of fair use (I think) but I can't see how it applies to Copilot.

Google didn't create new books from the contents of existing ones (whether you agree that they should have been allowed to store the books or not) but Copilot is creating new code/apps from existing ones.

Edit: I guess my understanding of fair use was wrong. I stand corrected.

AaronFriel · on July 8, 2021

If Google Books were creating new books, that would only help their argument. Transformativeness is one of the four parts of the fair use test.

Copilot producing new, novel works (which may contain short verbatim snippets of GPL works) is a strong argument for transformativeness.

FemmeAndroid · on July 8, 2021

It would help the transformativeness, but it would substantially change the effect upon the market. By creating competing products with the copyrighted material, there is a higher degree of transformative, but you also end up disrupting the marketplace.

I don't know how a court would decide this, but I do think the facts in future GPT-3 cases are sufficiently different from Author's Guild that I could see it going any way. Plus, I think the prevalence of GPT-3 and the ramifications of the ruling one way or another could lead some future case to be heard by the Supreme Court. A similar case could come up in California, or another state where the 2nd Circuit Artist Guild case isn't precedent.

peteradio · on July 8, 2021

> short verbatim snippets of GPL works

Define short

unanswered · on July 8, 2021

[flagged]

_Understated_ · on July 8, 2021

Yeah, I realise that now.

However, where does one draw the line between fair use and derivative works?

Creating something based on other stuff (Google creating AI books from the existing ones for example) would possibly be fair use I think but would it not also be derivative works?

thrashh · on July 8, 2021

There's no clear line and there can never be because the world is too complex. We leave up determination to the court system.

Google Books is considered fair use because they got sued and successfully used fair use as a defense. Until someone sues over Copilot, everyone is an armchair lawyer.

elliekelly · on July 8, 2021

I don’t disagree with your point but was it necessary to make it in such a snarky way?

unanswered · on July 8, 2021

[flagged]

dang · on July 9, 2021

Would you please stop breaking the site guidelines? You've been doing it repeatedly and it's not cool. Please just be kind.

https://news.ycombinator.com/newsguidelines.html

unanswered · on July 9, 2021

This is the clearest display yet that moderation on HN has absolutely nothing to do with your purported values like constructive criticism, and has everything to do with whether dang agrees with you or not.

dang · on July 9, 2021

I actually have no idea what you were arguing about, nor which side you were on, nor what your argument was. I haven't paid enough attention to know those things, because (a) I don't want to, (b) I don't need to, and (c) not doing it leaves me in the desirable state of being incapable of agreeing or disagreeing.

It's a happy fact that figuring out people's arguments is often unnecessary for moderating the threads, especially in cases where people are breaking the site guidelines. Everyone needs to follow the site guidelines regardless of what the topic is, what their argument is, and how right they are or feel they are. Please stick to the rules when posting here.

https://news.ycombinator.com/newsguidelines.html

Arelius · on July 8, 2021

I don't think that's an accurate description...

Fair use is a defense for cases of copyright infringement, which means you're starting of from a case of copyright infringement, which sort-of muckys up the whole "innocent until proven guilty" thing. And considering it's a weighted test, it's hardly very cut-and-dry at that.

IgorPartola · on July 8, 2021

If you view GPL code with your browser would that mean that your browser now has to be GPL as well? In the sense that copilot is not much different than a browser for Stack Overflow with some automation, why would it need to be GPLed? Your own code on the other hand…

johndough · on July 8, 2021

For sake of discussion, it would be clearer to split copilot code (not derived from GPL'd works) and the actual weights of the neural network at the heart of copilot (derived from GPL'd works via algorithmic means).

For your browser analogy, that would mean that the "browser" is the copilot code, while the weights would be some data derived from GPL'd works, perhaps a screenshot of the browser showing the code.

I'd think that the weights/screenshot in this analogy would have to abide by the GPL license. In a vacuum, I would not think that the copilot code had to be licensed under GPL, but it might be different in this case since the copilot code is necessary to make use of the weights.

But then again, the weights are sitting on some server, so GPL might not apply anyway. Not sure about AGPL and other licenses though. There is likely some illegal incompatibility between licenses in there.

IgorPartola · on July 8, 2021

As I understand it the things copilot tries to do is automate the loop of “Google your problem, find a Stack Overflow answer, paste in the code from there into my editor”. In that sense, the burden of whether the license of the code being copy pasted is on the person who answered the SO question and on me. If this literally was what copilot did, nobody would bat an eye that some code it produced was GPL or any other license because it wouldn’t be copilot’s problem.

No let’s substitute a different database of for the code that isn’t SO. It doesn’t really matter if that database is a literal RDBMS, a giant git repo or is encoded as a neural net. All copilot is going to do is perform a search in that database, find a result and paste it in. The burden of licensing is still on me to not use GPL code and possibly on the person hosting the database.

The gotcha here is that copilot’s database is a neural network. If you take GPL code and feed it as training data to a neural network to create essentially a lookup table along with non-GPL code did you just create a derived work? It is unclear to me whether you did or not. In particular, can they neural network itself be considered “source code”?

atq2119 · on July 8, 2021

> If you view GPL code with your browser would that mean that your browser now has to be GPL as well?

Some good responses in sibling comments already, but I don't see the narrow answer here, which is: No, because no distribution of the browser took place.

If you created a weird version of the browser in which a specific URL is hardcoded to show the GPL'd code instead of the result of an HTTP request, and you then distributed that browser to others, then I believe that yes, you'd have to do so under the GPL. (You might get away with it under fair use if the amount of GPL'd code is small, etc.)

dtech · on July 8, 2021

If you use your browser to copy some GPL code into your project your project must now be GPL as well.

So following your own argument, even if Copilot is allowed, using it still risks you falling under GPL

IgorPartola · on July 8, 2021

My point exactly. Copilot is innocent in that case just like the browser.

dpe82 · on July 8, 2021

Or if you simply read GPL code and learn something from it - or bits of the code are retained verbatim in your memory, are you (as a person) now GPL'd? Obviously not.

zelphirkalt · on July 8, 2021

That probably depends on how large and how significant the bits you remember are. Otherwise one could take a person with photographic memory and circumvent all GPL licenses easily, by making that person type what they remember.

jcelerier · on July 8, 2021

> Or if you simply read GPL code and learn something from it - or bits of the code are retained verbatim in your memory, are you (as a person) now GPL'd? Obviously not.

I do not find that to be obvious at all.

dpe82 · on July 20, 2021

You do not find it obvious that a human being would not become a GPL'd work?

keonix · on July 8, 2021

To build a browser you don't need a verbatim GPL code, so it's not a derivative work in the same sense copilot is.

Stackoverflow on the other hand is much trickier question...

IgorPartola · on July 8, 2021

SO clearly doesn’t need GPL code to be useful. The wider SE network is evidence of that.

shadowgovt · on July 8, 2021

> If I take your hard work that you clearly marked with a GPL license and then make money from it, not quite directly, but very closely, how is that fair use? Or legal?

If I'm Google, and I scan your code and return a link to it when people ask to find code like that (but show an ad next to that link for someone else's code that might solve their problem too), that's fair use and legal. My search engine has probably stored your code in a partial format, and that's fine.

BlueTemplar · on July 9, 2021

It's fine because a search engine is a generic tool the main purpose of which is not to replicate the code verbatim to be used as code.

Hamuko · on July 8, 2021

>If I take your hard work that you clearly marked with a GPL license and then make money from it, not quite directly, but very closely, how is that fair use? Or legal?

You can wipe your ass with the GPL license if your use of the product falls within Fair Use.

You can actually take snippets from commercial movies and post them onto YouTube if your YouTube video is transformative enough for your usage to be considered fair use. Well, theoretically at least - in reality YouTube might automatically copyright strike it.

>Copying and storing a book isn't recreating another book from it.

That doesn't mean that GitHub has to redistribute Copilot under GPL. However, the end user could potentially have to if they use Copilot to generate new code that happens to copy GPL code verbatim.

_Understated_ · on July 8, 2021

> You can wipe your ass with the GPL license if your use of the product falls within Fair Use.

Is Copilot fair use? It's reading code, generating other code (some verbatim) and making money from it all while not having to release its source code to the world?

> That doesn't mean that GitHub has to redistribute Copilot under GPL

I wasn't saying that was the case: some of the code that Copilot used may not allow redistribution under GPL.

But let's say that all of the code it scanned was GPL for the sake of argument. Why would they not have to distribute their Copilot source yet, if I use it to generate some code, I'd have to distribute mine?

My spidey-sense it tingling at that one!

tshaddox · on July 8, 2021

> Is Copilot fair use? It's reading code, generating other code (some verbatim) and making money from it all while not having to release its source code to the world?

Again, fair use is an exception to copyright protection. If something is fair use, the license does not apply. The fact that Copilot does not release its source code is related only to a specific term of a specific license, which does not apply if Copilot is indeed fair use.

8note · on July 8, 2021

Making money is irrelevant to fair use

gavinhoward · on July 8, 2021

Totally relevant: https://en.wikipedia.org/wiki/Fair_use#1._Purpose_and_charac... .

zelphirkalt · on July 8, 2021

Irrelevant to GPL maybe.

ska · on July 8, 2021

> By comparison, Copilot is even more obviously fair use.

You are correct about (US specific) the fair use exception, but it is in no way as clear as you suggest that what copilot is doing entirely falls under fair use. Fair use is always constrained.

I suspect some variant of this sort of thing will have to be tested in court before the arguments are really clear.

bdowling · on July 9, 2021

> ...the non-obvious thing for many developers is that fair use is an exception to copyright itself.

More precisely, fair use is an affirmative defense to an claim of copyright infringement. A fair use defense basically says, "Yes, I am copying your copyrighted material and I don't have a license (or am exceeding a licensed use), but my usage is allowed under the fair use doctrine (codified in 17 USC 107 in US law)."

jollybean · on July 8, 2021

Thanks for this, but can you answer the question:

Would it be 'fair use' for the devlopers to simply copy code from those repos - even just 10 lines, and claim 'fair use' - i.e. circumventing Copilot?

Even if Copilot is 'fair use' ... does that mean the results are 'fair use' on the part of AutoPilot users?

And a bigger question: is your interpretation of those statues and case law enough to make the answer unambiguous?

I don't have legal background, but I do have an operating background with lawyers and tech ... and my 'gut' says that anyone using Copilot is opening themselves up to lawsuits.

If the code you put in your software comes, via Copilot, but that code is verbatim from some kind of GPL's (or worse, proprietary) ... there's a good chance you could get sued if someone gets the inclination.

Maybe it's because of my personal experience, but I can just see corporate lawyers banning Copilot straight up as the risks are simply now worth the upside. That's now what we like to hear in the classically liberal sense i.e. 'share and innovate' ... but gosh it doesn't feel like a happy legal situation to me.

Looking forward to people with more insight sharing on this important topic.

AaronFriel · on July 8, 2021

> Would it be 'fair use' for the devlopers to simply copy code from those repos - even just 10 lines, and claim 'fair use' - i.e. circumventing Copilot?

Only a lawyer (and truly, only a court) could answer that question.

If you copy 100 lines of code that amounts to no more than a trivial implementation in a popular language of how to invert a binary tree, it's likely fair use.

If you copy 10 lines of code that are highly novel, have never been written before, and solve a problem no one outside the authors have solved... It may not be fair use to copy that.

Other people who have replied have mentioned "the heart" of a work. The US Supreme Court has held that even de minimis - "minimal", to be brief - copying can sometimes be infringement if you copied the "heart" of a work.

otterley · on July 8, 2021

If this issue is eventually litigated, we will see. The law in the Second Circuit (where the final judgment was rendered before the case was eventually settled) may well be different than the law in a different circuit. If there is a split in the circuit courts, then the Supreme Court may have to weigh in on this issue.

When fair use is an issue, the courts look at the facts in context each time. These are obviously different facts than scanning books for populating a search index and rendering previews; and each side is going to argue that the facts are similar or that they are dissimilar. How the court sees it is going to be the key question.

tyre · on July 8, 2021

This could either be:

1. a fascinating Supreme Court opinion.

2. a frustrating ruling because SCOTUS doesn't understand software and code.

3. the type of anti-anticlimactically(?) narrow ruling typical of the Roberts court.

While our Congresspersons can't seem to wrap their minds around technology/social media, I think SCOTUS would understand this one enough to avoid (2).

otterley · on July 8, 2021

Fair use cases tend to produce narrowly-written law because the outcomes hinge on how the court judges the facts against the list of factors codified in the Copyright Act (17 U.S.C. section 107). The courts don't really have breathing room to use a different test. I don't recall any cases in which the courts have set binding guidelines for interpretation of these factors.

dtech · on July 8, 2021

The Google vs Oracle case showed that SCOTUS can handle technical topics

indigochill · on July 8, 2021

Next up, Copilot for college papers! Who needs to pay a professional paper-writer (ahem, I mean write the paper) when you can have an AI write your paper for you! It's fair use, so you're entitled to claim ownership to it, right?

jrochkind1 · on July 8, 2021

I think you are confusing legal protections for intellectual property with plagiarism. (At least that's what I think you're doing if I read your comment as sarcasm and guess what you're trying to say non-sarcastically?) But they are entirely different things.

You can be violating copyright without plagiarizing, so long as you cite your source, but if you copy a copyright-protected work in an illegal way when doing so.

And you can be plagiarizing without violating copyright, if you have the permission of the copyright holder to use their content, or if the content is in the public domain and not protected by copyright, or if it's legal under fair use -- but you pass it off as your own work.

Two entirely separate things. You can get expelled from school for plaguriism without violating anyone's copyright, or prosecuted for copyright without committing any academic dishonesty.

You can indeed have the legal right to make use of content, under fair use or anything else, but it can still be plagiarism. That you have a fair use right does not mean "Oh so that means you are allowed to turn it in to your professor and get an A and the law says you must be allowed to do this and nobody can say otherwise!" -- no.

indigochill · on July 8, 2021

Yeah, I was being sarcastic. But you make a good point about the legality of plagiarism.

tyre · on July 8, 2021

Copilot is not doing what your example does.

If Github had a service that automatically mirrored public repositories on Gitlab, that would be equivalent to the example you gave.

But Github is taking content under specific licenses to build something new for commercial use.

I'm not sure if what Github does falls under Fair Use, but I don't know that it matters. I can read fifty books and then write my own, which would certainly rely—consciously or not—on what I had read. Is that a copyright violation? It doesn't seem like it is but maybe it is and until now has been impossible to prosecute?

nojito · on July 8, 2021

GitHub isn’t building anything.

The end user is.

By this logic any and all neural nets that draw pictures are copyright infringing as well.

saati · on July 8, 2021

If they create exact copies of copyrighted pictures, then yes, they do.

logifail · on July 8, 2021

> Fair use is an exception to copyright itself. A license cannot remove your right to fair use.

...and if you're outside the USA?

Causality1 · on July 8, 2021

Read the Authors Guild v Google dismissal. The court considered it fair use because Google's project was built explicitly to let users find and purchase books, giving revenue to the copyright holders. Copilot does not do that.

bdowling · on July 9, 2021

> ... giving revenue to the copyright holders.

That's a reference to factor four of the fair use test, "the effect of the use upon the potential market for or value of the copyrighted work." (17 USC 107).

None of the factors are dispositive, however. For example, a scathing book review that quotes a passage to show how bad the writing is might eviscerate sales of the book, but such a use is usually protected. For a counter-example, see Harper & Row v. Nation Enterprises 471 U.S. 539 (1985).

ElFitz · on July 8, 2021

> Note, however, that there is no world-wide principle of fair use; what kinds of use are considered “fair” varies from country to country.

Exactly the point I came to make.

The Authors’ Guild is a US entity, and so is Google, so only US law applies. And thus, we have the Fair Use exception.

But developers sharing code on GitHub come from and live all over the world.

Now, Github’s ToS do include the usual provision stating that US & California law applies, et cætera, et cætera [1], but… and even they acknowledge it may be the case, such provisions usually aren’t considered legal outside of the US.

So… developers from outside the US, in countries with less lenient exceptions to copyright, definitely could sue them.

Identifying these countries and finding those developers, however, is a different matter altogether.

[1]: https://docs.github.com/en/github/site-policy/github-terms-o...

adverbly · on July 8, 2021

This was a good point. Really enjoying this discussion. Interesting stuff.

I'm really out of my depth in giving my own opinion here, but I'm not sure that either the "distribution != derivative" characterization, or that "parsing GPL => derivative of GPL" really locks this thing down. The bit that I can't follow with the "distribution != derivative" argument is that the copilot is actually performing distribution rather than "design". I would have said that copilot's core function is generating implementations, which to me does not seem like distribution. This isn't a "search" product, and it's not trying to be one. It is attempting to do design work, and I could see a case where that distinction matters.

wilde · on July 8, 2021

I buy the argument about copilot itself and this comment. But when someone goes to release software that uses the output of Copilot, I fail to see how they wouldn’t be a GPL derivative work if enough source was used. Copilot is essentially really fancy copy/paste in that context.

IgorPartola · on July 8, 2021

I think this is the correct answer. IANAL but the copilot code vs the copilot training data are different things and licensing for one shouldn’t affect the other, right? And the fact that training data happens to also be code is incidental.

8note · on July 8, 2021

One view would be that copilot the app distributes GPL'd code, in a weird encoding. Training the model is a compilation step to that encoding

keonix · on July 8, 2021

I assume the code is a derivative work of training data because given different data code would be also different (neuron weights)

IgorPartola · on July 8, 2021

If I read a GPL implementation of a linked list and then write my own linked list implementation, was my neural network in my brain a derivative work of the GPL code?

keonix · on July 8, 2021

Sure it is, you brain is not software though

IgorPartola · on July 8, 2021

So as long as I read GPL code, then rewrite it from memory and feed it to copilot to train it I can unGPL anything?

mirekrusin · on July 8, 2021

If fair use memorising whole source code byte-by-byte, storing it as ie. some non-100%-lossless compression for subsequent retrieval or arbitrary size snippets?

random314 · on July 8, 2021

If copilot was trained using the entirety of the linux kernel, wouldn't the neural network itself need to be GPLed, if not its output.

echelon · on July 8, 2021

> Even without a license or permission, fair use permits the mass scanning of books, the storage of the content of those books, and rendering verbatim snippets of those books.

For commercial use and derivative works?

Authors won't incorporate snippets of books into new works unless they're reviews. Copilot is different.

AaronFriel · on July 8, 2021

Google Books is a commercial site which incorporated the snippets of millions of copyrighted works. And of course, sitting in thousands of Google servers/databases are full copies of each of those books, photos of each page, the OCRed text of each page, and indexes to search them. Even that egregious copying without a license or permission was considered fair use.

If anything, the ways in which Copilot is different aid Microsoft/GitHub's argument for fair use. Because Copilot creates novel new works, that gives them a strong argument their system is more transformative than Google Books, which just presents verbatim copies of books.

cycomanic · on July 8, 2021

The Google books example really misses the point, one of the reasons why the judges considered it fair use was because it was pointing back to the original sources (and thus potentially increasing publishers earnings).

Copilot does none of that. If all the ML companies are so sure this is fair use I encourage them to train an AI on Disney movies to generate short cartoon snippets based on some description. There sure would be a court case.

Mehdi2277 · on July 9, 2021

The main issue here is less doing it, but getting sufficiently nice results. I've done work in generative AI before and right now the state of the art is passable on single images with some but not enough control and is still weak on videos without heavy structure requirements. I expect in 5-10 years we will have good enough models (or hardware) to do short video generation and the question will get tested then. I also think a meaningful good video requires audio and have fun making well aligned text (for dialogue) audio of that text, and video frames. Aligning all that generation together is still challenging today.

extra88 · on July 8, 2021

> Authors won't incorporate snippets of books into new works

Of course they do, previous works are quoted all the time.

e12e · on July 8, 2021

But that's another thing - co-pilot doesn't quote it encourages something more akin to plagarism, doesn't it?

extra88 · on July 8, 2021

Plagiarism, pretending you made a work entirely yourself when you didn't, is rarely a matter for a court to decide and the standards for what constitutes plagiarism can vary a lot. When I turn in projects for a course, a cite sources in the comments a lot, even if what I turn in is substantially modified. An employer generally doesn't care if you copied and pasted code from StackOverflow or wherever, so long as you don't expose them to a suit and you don't lie if asked "Did you write this 100% yourself?"

Citing your source is not a get out jail free card for copyright infringement, it doesn't really matter.

e12e · on July 8, 2021

> Citing your source is not a get out jail free card

No, but it's a requirement of the license stackoverflow.com uses, which is unfortunate, for code (as opposed to text, where a quote can be easily attributed).

b3morales · on July 8, 2021

...with attribution.