*Summary: Agents that can reason about their own source codes are weirder than you think.*

**Previous in series:** The Problem of IndignationBot, Part 1.

It’s very intuitive to think that IB(CB) = C (where IB stands for IndignationBot and CB for CooperateBot, defined previously) since if IndignationBot cooperates, it “would have no effect” on CooperateBot, who always returns C no matter what. So it would be very strange to find a proof that \[(1)\quad IB(CB) = C \; \rightarrow \; CB(IB) = D.\] Right? The whole point of IndignationBot is that it cooperates with its opponent unless it proves that if it cooperates then its opponent will defect… and surely the simple CooperateBot wouldn’t do that, and surely IndignationBot can see this, right? So let’s try proving by contradiction that IB(CB) = C

Suppose instead that IB(CB) = D. Then the above material conditional (1) is true, because its left side is false. Now, consider the sub-case where IndignationBot is also able to *prove* that IB(CB) = D. It would quickly thereafter prove the material conditional (1), which, from the source code of IndignationBot in the previous post, was its criterion for defection. So then IB(CB) = D, which is… *not a contradiction*.

Huh. So it looks like a logically consistent state of affairs for all of the following statements to hold simultaneously:

- $IB(CB) = D$,
- $CB(IB) = C$,
- $\blacksquare(IB(CB) = D)$, and therefore also
- $\blacksquare(IB(CB) = C \; \rightarrow \; CB(IB) = D)$

Does this mean it’s somehow undecidable what IndignationBot does? Isn’t the following list of propositions *also* logically consistent?

- $IB(CB) = C$,
- $CB(IB) = C$,
- $\neg\blacksquare(IB(CB) = C \; \rightarrow \; CB(IB) = D)$

I claim no. The way IndignationBot is defined, it turns out that $IB(CB) = D$ is the *only* logically consistent outcome. You are not misreading; this is very strange. You might have guessed this from the fact that I said the answer was surprising, which I admit was a spoiler… but I had to say that, otherwise you’d think, *“The answer is obviously C. I get this, no need to keep reading.”*.

The strangeness of the answer derives from a result in logic called Löb’s Theorem, which says that, for any logical statement p, if $\blacksquare x$ denotes “x is provable” in a logical system such as Peano Arithmetic that is expressive enough to prove when computer program halts, then:

(Löb’s Theorem) $\blacksquare\left(\blacksquare p \rightarrow p\right) \rightarrow \blacksquare(p)$

## Digression on Löb’s theorem (skip if familiar)If you’ve never seen it before, maybe a take a moment to parse what that string says, intuitively. For example, suppose p is the Riemann Hypothesis, a mathematical statement whose truth value is not currently known to anyone. Inside the lefthand parentheses of Löb’s theorem is the statement “If I can prove the Riemann Hypothesis, then it’s true”, which seems like a fine and reasonable thing to say. However, Löb’s theorem says that if we could I won’t digress further about Löb’s theorem here, despite how interesting it is. My favorite proof of it is available on Wikipedia. My colleague Eliezer Yudkowsky has also written a Cartoon Guide to Löb’s theorem which many people have found insightful. |

Anyway, let’s apply Löb’s Theorem to the case where p = “IB(CB) = D”. As we saw above, if IndignationBot proves p, then it also proves the material condition for its own defection, so it defects, i.e. p is true. That is, \[(2) \quad \blacksquare p \rightarrow p\]

Moreover, this argument is itself a proof (under an appropriate encoding), so \[(3) \quad \blacksquare\left(\blacksquare p \rightarrow p\right)\] But then by Löb’s Theorem, we can conclude $\blacksquare p$, which by (2), implies p… in other words, IndignationBot defects!

Moreover, this proof did not use any facts about CooperateBot… so it turns out that IndignationBot *always* defects!

You might say that we therefore have failed to properly codify “indignation”. I’d agree, but the more interesting take-away here is that a *very simple* piece of code that looks like it *super-obviously outputs “cooperate”* turns out to in fact do the exact opposite.

The weirdness here all seems to stem from these programs’ ability to reason about themselves, which is how Löob’s theorem comes up, and indeed, how other surprising results like Gödel’s Theorems come about, which do not arise in systems that are not self-reflective in this way.

Is this just a problem with formal logic? Are other self-reflective systems we build unlikely to encounter such weirdness?

I currently think not. Rather, formal logical systems are merely the first self-reflective systems that are simple and well-understood enough that we’re able to prove theorems about them. If, say, we built an artificial general intelligence that was capable of reasoning about its own source code at or beyond the level of understanding of the engineers who built it, I think we’re also likely to encounter very strange behavior unless we develop a much better understanding of self-reflective agents before we build it.

For exactly this reason, my colleagues at the Machine Intelligence Research Institute in Berkeley have been studying self-reflective agents for a while now, and have found many more such surprising results, including the aforementioned Robust Cooperation in the Prisoner’s Dilemma, numerous results on a broad class of agents called “Modal Agents”, and another paper called Reflective Oracles: A Foundation for Classical Game Theory. I think these papers are circling around some very important and hard problems about self-reflective agents, and are worth checking out if you’re interested in math/CS research.

**Next in series:** The Problem of IndignationBot, Part 3.