# Abstract open problems in AI alignment, v.0.1 — for mathematicians, logicians, and computer scientists with a taste for theory-building

This page is a draft and will be updated in response to feedback and requests to include specific additional problems.

Through my work on logical inductors and robust cooperation of bounded agents, I’m meeting lots of folks in math, logic, and theoretical CS who are curious to know what contributions they can make, in the form of theoretical work, toward control theory for highly advanced AI systems. If you’re one of those folks, this post is for you!

Contents:
(11/28/16) Problem 1: Defining logical control
(11/28/16) Problem 2: Theoretically amenable bounded rationality

Rather than justify the relevance of these problems to AI control, let’s just dive right into them and see if you find any of them intrinsically interesting, which I think is the largest determinant of whether you’ll want to work on them anyway 🙂

### Problem 1: Defining logical control

How should an algorithm determine which quantities it “controls” and which quantities it “does not control”?

To formalize this question, imagine that

• an “agent” is a program which receives as input a string to be interpreted as a source code for a game, and returns a value called an “action”, and
• a “game” is a program which receives an input string to be interpreted as a source code for an agent, and returns value called a “score”.

Now, consider the following simple game, called HowMuch, which gives an agent either 1 or 2 points depending on how much the agent asks for:

def HowMuch(Agent):
x = Agent(HowMuch)
if x == 1:
score = 1
elseif x == 2:
score = 2
else
score = -1
return score


Any score-maximizing agent which views itself as “in control” of the value of $x$ will choose to output 2 and earn a score of 2. So far so good.

But now consider the following code-based version of Newcomb’s problem, which takes some large number $k$ as a parameter:

def Newcombk(Agent):
B = 1,000
search all strings of length ≤k
...for a proof that:
"Agent(Newcombk) == 1"
if found:
A = 1,000,000
else:
A = 0
if Agent(Newcombk) == 1:
U = A
elseif Agent(Newcombk) == 2:
U = A + B
else:
U = -1
return U


(For a more rigorous discussion of programs that reason about each other using proof searches, see arXiv:1602.04184.)

When $Agent$ receives $Newcomb_k$ as an input (indicating that it is “in” Newcomb’s problem), should $Agent$ view itself as “controlling” the first IF statement of $Newcomb_k(Agent)$ (which only writes proofs about the agent and does not run it), or only the second IF statement?

Utility-maximizing agents who view themselves as “controlling” both IF statements will output 1 in order to “make” $A = 1,000,000$, and will earn a payoff of $1,000,000$ (at least when $k$ is large enough for the proof search to determine what they do). Utility-maximizing agents who view themselves as “controlling” the second IF statement only will view A as a fixed value, and choose A+B over A, i.e. will return 2 and earn a payoff of $1,000$.

Thus, it seems better for the agent to view itself as “controlling” both IF statements (when $k$ is sufficiently large that a proof of its behavior can be expected to be found). But what procedure should an algorithm use to ascertain which variables it can “control”?

If a variable $x$ is defined by the command $x=3$, then it should probably be treated as a constant and one should not attempt to “optimize” it. But if $x$ is defined by a string of the form $x=evaluate(your\_source\_code)$ then perhaps you should treat it as a variable that you control, even though in both cases $x$ is a constant (at least, if you’re a halting Turing machine). What about if someone makes a trivial modification to your source code, and evaluates that?

y = your_source_code
z = y.append("#this comment won't affect your output")
x = evaluate(z)

What’s the general answer? It probably involves some kind of non-boolean quantity that indicates a “degree” or “quality” of control, but what should it be, and how should it work?

An answer to this question would help us reason about how AI systems ought to behave in light of the fact that their source codes can in fact be copied, modified, simulated, and otherwise explicitly reasoned about by humans and other AI systems. It turns out that this relates to ensuring that AI systems are not exploitable in certain ways, and perhaps to ensuring cooperative behavior between humans and AIs more generally; see my paper on robust cooperation of bounded agents for the latter.

### Problem 2: Theoretically amenable bounded rationality

How should you manage your beliefs and decision-making if you have exactly 1GB of RAM, a 1GHz processor, and 1TB of hard disk space?

Here’s a story progressing through increasingly realistic theoretical models of “ideal” reasoning:

Bayesian inference describes how to manage uncertain beliefs when you have no computational limitations whatsoever. But doing Bayesian inference exactly is in general not even a computable operation, and in practice usually runs into the curse of dimensionality: just being able to write down an integral for what your beliefs should now be does not mean you, with your limited hardware, can answer the difficult mathematical question of approximating that integral with a computer.

Logical induction is a natural next step in the direction of realism: what does it mean for a Turing machine to manage uncertainty about math problems? Special cases of “math problems” include “What does this Bayesian update integral evaluate to?” or “What would this really hard computation ‘$X()$’ return if I actually ran it?” Logical induction provides an algorithm for answering such questions, with limited computation, that asymptotically satisfies a large number of “reasonableness” desiderata.

The natural next step, I believe, is to specify a procedure for “managing uncertainty well” given some fixed, constant bounds on hardware. The procedure should be theoretically amenable in the same way that Bayesian inference and logical induction are amenable to having theorems proven about them that offer certain performance guarantees (both statistical and deterministic). The procedure also should not be “use massively more hardware to search for the optimal algorithm to run on your limited hardware, and then run that”; the entire procedure, including your procedure for specifying it to me, should run on the hardware limitations.

If your procedure appears to work well but is not amenable to theoretical analysis, it doesn’t help that much with safety. But a theoretically amenable procedure that’s easy to prove theorems about would bring us closer to reasoning about the behavior of highly intelligent systems before they are built, and lower our dependence on empirical testing to determine their safety. Less need for testing means fewer occasions where something can go wrong with a superintelligent AI before we know it’s safe.