I’m genuinely surprised to see this not discussed more by the FOSS community. There are so many ways to blow past the GPL now:
1. File by file rewrite by AI (“change functions and vars a bit”)
2. One LLM writes a diff language (or pseudo code) version of each function that a diff LLM translates back into code and tests for input/output parity
The real danger is that this becomes increasingly undetectable in closed source code and can continue to sync with progress in the GPLed repo.
I don’t think any current license has a plausible defense against this sort of attack.
I’ve never delved fully into IP law, but wouldn’t these be considered derivative works? They’re basically just reimplementing exactly the same functionality with slightly different names?
This would be different from the “API reimplementation” (see Google vs Oracle) because in that case, they’re not reusing implementation details, just the external contract.
Because copyrights do not protect ideas. Thankfully. We are free to express ideas, as long as we do so in our own words. How that principle is applied in actual law, and how that principle is a applied to software is ridiculously complicated. But that is the heart of the principle at play here. The law draws a line between ideas (which cannot be copyrighted), and particular expressions of those ideas (e.g. the original source code), which are protected. However, it is an almost fractally complicated line which, in many place, relies on concepts of "fairness", and, because our legal system uses a system of legal precedence, depends on interpretation of a huge body of prior legal decisions.
Not being a trained lawyer, or a Supreme Court justice, I cannot express a sensible position as to which side of the line this particular case falls. There are, however, enormously important legal precedents that pretty much all professional software developers use to guide their behaviour with respect to handling of copyrighted material (IBM vs. Ahmdall, and Google v. Oracle, particularly) that seem to suggest to us non-lawyers that this sort of reimplementation is legal. (Seek the advice of a real lawyer if it matters).
Taking a step back, it seems fairly clear that wherever you set the bar, it should be possible to automate a system that reads code, generates some sort of intermediate representation at the acceptable level of abstraction and then regenerates code that passes an extensive set of integration tests … every day.
At that point our current understanding of open source protections … fails?
"change functions and bars a bit" isn't a rewrite. Anything where the LLM had access to the original code isn't a rewrite. This would just be a derivative work.
However most of the industry willfully violates the GPL without even trying such tricks anyway so there are certainly issues
#1 is already possible and always has been. I never heard of a case of anyone actually trying it. #2 is too nitpicky and unnecessarily costly for LLMs. It would be better to just ask it to generate a spec and tests based on the original, them create a separate implementation based on that. A person can do that today free and clear. If LLMs will be able to do this, we will just need to cope. Perhaps the future is in validating software instead of writing it.
(1) sounds like a derivative work, but (2) is an interesting AI-simulacrum of a clean room implementation IF the first LLM writes a specification and not a translation.
+1 I’ve always had the feeling that training from randomly initialized weights without seeding some substructure is unnecessarily slowing LLM training.
Similarly I’m always surprised that we don’t start by training a small set of layers, stack them and then continue.
Better-than-random initialization is underexplored, but there are some works in that direction.
One of the main issues is: we don't know how to generate useful computational structure for LLMs - or how to transfer existing structure neatly across architectural variations.
What you describe sounds more like a "progressive growing" approach, which isn't the same, but draws from some similar ideas.
In terms of sub structure - in the old days of Core Wars randomly scattering bits of code that did things could pay off. I’m imagining similar things for LLMs - just set 10% of weights as specific known structures and watch to see which are retained / utilized by models and which get treated like random init
It’s interesting that you invest in mouse movements vs just targeting a click at X in Y milliseconds. CAD and video games are of course a great reason for this, but I wonder how much typical tool use can be modeled by just next click events.
I’d love to see this sort of thing paired with eye tracking and turned into a general purpose precog predictive tool for computer use … but you probably have many better use cases for your world model!
+1 this does seem to be a genuine attempt to actually build an interpretable model, so nice work!
Having said that, I worry that you run into Illusion of Conscious issues where the model changes attrition from “sandbagging” to “unctuous” when you control its response because the response is generated outside of the attribution modules (I don’t quite understand how cleanly everything flows through the concept modules and the residual). Either way this is a sophisticated problem to have. Would love to see if this can be trained to parity with modern 8B models.
Yeah I’ve often wondered why folks aren’t training two tier MoEs for VRAM + RAM. We already have designs for shared experts so it cannot be hard to implement a router that allocated 10x or 100x as often to “core” experts vs the “nice to have” experts. I suppose balancing during training is tricky but some sort of custom loss on the router layers should work.
I’ve also wondered why the routers aren’t training to be serially consistent so you can predict layers to swap into VRAM a few layers ahead to maximize available bandwidth.
I think part of the issue is that in production deployments, you're batching high enough that you'll be paging in those long tail experts constantly.
Unless you're handing that in some kind of fancy way, you'll be holding up the batch while waiting for host memory which will kill your throughout.
It makes much more sense for non batched local inference, especially if you can keep the MoE routing stable like you say, but most folks aren't optimising for that.
Ideally, you should rearrange batches so that inference steps that rely on the same experts get batched together, then inferences that would "hold up" a batch simply wait for that one "long tail" expert to be loaded, whereupon they can progress. This might require checkpointing partial inference steps more often, but that ought to be doable.
I think this is doable for very long tail experts that get swapped in for specialised topics - say, orbital mechanics.
But for experts that light up at, say, 1% frequency per batch, you're doing an awful lot of transfers from DRAM which you amortize over a single token, instead of reads from HBM which you amortize over 32 tokens.
I think your analysis is right this would make sense mostly for the 30B-3A style models that are mostly for edge / hobbyist use, where context length is precious so nobody is batching.
Given that experts live per layer I dont think it makes sense to have orbital mechanics experts but … I have wondered about swapping out the bottom 10% of layers per topic given that that is likely where the highest order concepts live. I’ve always wondered why people bother with LORA on all layers given that the early layers are more likely to be topic agnostic and focused on more basic pattern assembly (see the recent papers on how LLMs count on a manifold)
1) This is basically the intention of several recent MoE models: keep particular generally useful experts hot in VRAM.
2) Unless you can swap layers in faster than you consume them there is no point to predicting layers (what does this even really mean? did you mean predicting experts?).
It seems at the moment the best you can do is keep experts and layers more likely to be used for a given query in VRAM and offload the rest, but this is work-dependent.
So llama.cpp currently statically puts overflow MoE experts in RAM and inferences them on CPU, so you get a mix of GPU + CPU inferencing. You are rooflined by RAM->CPU bandwidth + CPU compute.
With good predictability of MoE, you might see a world were it's more efficient to spend PCI bandwidth (slower than RAM->CPU) on loading MOE experts for the next ~3 layers from RAM to VRAM so you are not rooflined by CPU compute.
VLLM / SGLang (AFAIK) just assume you have enough VRAM to fit all the experts (but will page KV cache to RAM).
Most of the work I'm aware of starts from the perspective of optimizing inference but the implication that pushing the lessons upstream gets mentioned here and there.
Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models (https://arxiv.org/abs/2505.16056)
I’d really like to see this optimized for the 50-120B parameter open source models that are local viable (gpt-oss-120b, qwen3-80b-3a etc.).
For them I think it would be optimal to provide a tag per function and trust the llm to rewrite the function. As the article notes full reproduction is generally more reliable than edited for short code.
The token and attention overhead from a per line hash I suspect limits this approach for smaller models
The earth is an oblate spheroid to an approximation. It's not that they're not symmetric, but at the equator the north south axis has higher rates of curvature than anywhere else (but the east west has somewhat lower rates because of the larger circumference due to the bulge).
So that large lines of sight are near the equator on a north south axis (or symmetrically south north) is crazy because the high rates of curvature in that direction at those latitudes should give the shortest distance to the horizon on earth, making those lines of sight even that much more impressive!
Skills aren’t the right fulcrum to move LLMs on something as generic as well known language best practices. Trust the model owner to post-training on that.
Use skills for more specific things (idiosyncratic patterns, specific library docs, project specific info) that an LLM cannot be expected to know already, otherwise you are just wasting context.
I think it maybe time for us to think about what the sensible version of these capabilities are.
Short term hacky tricks:
1. Throw away accounts - make a spare account with no credit card for airbnb, resy etc.
2. Use read only when it's possible. It's funny that banks are the one place where you can safely get read only data via an API (plaid, simplefin etc.). Make use of it!
3. Pick a safe comms channel - ideally an app you don't use with people to talk to your assistant. For the love of god don't expose your two factor SMS tokens (also ask your providers to switch you to proper two factor most finally have the capability).
4. Run the bot in a container with read only access to key files etc.
Long term:
1. We really do need services to provide multiple levels of API access, read only and some sort of very short lived "my boss said I can do this" transaction token. Ideally your agent would queue up N transactions, give them to you in a standard format, you'd approve them with FaceID, and that will generate a short lived per transaction token scoped pretty narrowly for the agent to use.
2. We need sensible micropayments. The more transactional and agent in the middle the world gets, the less services can survive with webpages,apps,ads and subscriptions.
3. Local models are surprisingly capable for some tasks and privacy safe(er)... I'm hoping these agents will eventually permit you to say "Only subagents that are local may read my chat messages"
1. File by file rewrite by AI (“change functions and vars a bit”)
2. One LLM writes a diff language (or pseudo code) version of each function that a diff LLM translates back into code and tests for input/output parity
The real danger is that this becomes increasingly undetectable in closed source code and can continue to sync with progress in the GPLed repo.
I don’t think any current license has a plausible defense against this sort of attack.
reply