I like your example of prokaryotic microbes because I think it points to the dif...

Veedrac · on Nov 5, 2023

> But why would someone build an AI that evolves to become more performant at reproducing itself and not it’s builder?

Because people are not building AIs that meaningly encode any of their creators' preferences whatsoever. They are building AIs that are in a very broad sense capable at tasks they've been trained on to increasingly general degrees, and then on top of this they have a bunch of finagling where they try to point it somewhat vaguely in the direction of increasing usefulness.

When you have a system that has capabilities rivalling humans, as well as the general ability to apply its skills to broad ranges of tasks, then the ability for this system to do things like self-replicate, or make plans that involve mundane deceit, or perform smart-human levels of hacking already exist. To the extent that the system isn't directly optimizing for what the people who made it wanted it to, the relevant question isn't why would someone design it to do that?, but what are the attractor states for this sort of system?

You say microbes "evolved to increase their own chances of reproduction", but this isn't true. There is no intent there. Microbes did physics. They only evolved to increase their own chances of reproduction in the sense that the random changes you get by running physics on microbes produces both adaptive and maladaptive changes, and it's the adaptive changes stick around.

The same thing applies to AIs' preferences, except that while it's very hard for a bunch of atoms to assemble into something that successfully optimizes towards any non-nihilistic result, it's very easy for a sufficiently smart to do that, and instrumental convergence means almost all of those are incidentally very bad.

To put this in concrete terms, if the abstract arguments aren't helping, consider a system that was trained to be generally capable, and then fine-tuned towards polite instruction following. Beyond a level of capability, the following scenario becomes plausible:

Human: what's a command that let's me see a live overview of activity on our compute cluster?

AI system: <provides code that instantiates itself in a loop using an API over activity logs, producing helpful activity outputs>

I'm not saying this is, like, the most plausible xrisk scenario, I'm just pointing out that given extremely plausible priors, like having an AI system that just wants to give reasonable answers to reasonable questions, but is also smart enough to quickly write code to use its own API, and also creative enough to recognize when that's the easiest and most effective way to answer a question, you already get a level of bootstrapping.

Note that none of the above even required considering:

* a sharp left turn or other specific misalignments,

* the AI going weirdly out of distribution,

* superhuman creative strategies or manipulation,

* malicious actors, terrorists, enemy states, etc., or

* people intentionally getting the system to bootstrap.

Those are all very real problems, but you don't have to invoke them to notice that you just end up, by default, in a very dangerous place just by following mundane logic on what's ultimately an extremely milquetoast vision of AI.

You might argue, fairly, that the situation above is a pretty weak form of bootstrapping, but so were the first proto-life chemicals, and the same sort of logic I'm using lets you just continue walking down the chain. Let's say you have such a system tuned to follow instruction and that's instantiated as above, aka. running in a loop with the instructions to turn certain data dumps into live reports about system activity. Let's say one component fails, or is reporting insufficient information, or was called wrong, or one piece of the loop has a high failure rate. Surely a system that has the intellectual faculties that you or I do, and that knows from its inputs that it has the ability to call itself in a loop, should also be able to deduce that the most effective way to follow the instructions it has been given is to fix those issues, repair faulty components, proactively add error handling, or even report information up the chain, or maybe there's a runaway process that needs to be culled to ensure API throttling doesn't affect reporting latency.

And suddenly, not because anyone in the chain designed it to happen, but just because it's an attractor state you get by having sufficiently capable systems, you don't just have a natural organism, but one that self heals, too, and that selection pressure will continue to exist as time goes on.

The more your model of AGI looks like far-superintelligence, the more this looks like 'everyone falls over and dies', and the more your model looks like amnesiac-humans-in-boxes, the more this looks like natural competitive organisms that fill a fairly distinct biological niche that's initially dependent on human labor. I personally don't buy that AI progress will stop at the amnesiac human level, but it is a helpful frame because it's basically the minimum viable assumption.