# Deserving trust / grokking Newcomb’s problem

Summary: This is a tutorial on how to properly acknowledge that your decision heuristics are not local to your own brain, and that as a result, it is sometimes normatively rational for you to act in ways that are deserving of trust, for no other reason other than to have deserved that trust in the past.

Related posts: I wrote about this 6 years ago on LessWrong (“Newcomb’s problem happened to me”), and last year Paul Christiano also gave numerous consequentialist considerations in favor of integrity (“Integrity for consequentialists”) that included this one. But since I think now is an especially important time for members of society to continue honoring agreements and mutual trust, I’m giving this another go. I was somewhat obsessed with Newcomb’s problem in high school, and have been milking insights from it ever since. I really think folks would do well to actually grok it fully.

You know that icky feeling you get when you realize you almost just fell prey to the sunk cost fallacy, and are now embarrassed at yourself for trying to fix the past by sabotaging the present? Let’s call this instinct “don’t sabotage the present for the past”. It’s generally very useful.

However, sometimes the usually-helpful “don’t sabotage the present for the past” instinct can also lead people to betray one another when there will be no reputational costs for doing so. I claim that not only is this immoral, but even more fundamentally, it is sometimes a logical fallacy. Specifically, whenever someone reasons about you and decides to trust you, you wind up in a fuzzy version of Newcomb’s problem where it may be rational for you to behave somewhat as though your present actions are feeding into their past reasoning process. This seems like a weird claim to make, but that’s exactly why I’m writing this post.

Overview:

### Introducing Newcomb’s problem

Let’s start by analyzing Newcomb’s original problem, because it’s an extreme case of “influencing the past”. Being an extreme case makes the original Newcomb easier to understand in technical terms than its fuzzier, real-life variants, which we’ll analyze later.

In Newcomb’s problem, you have a choice between taking either

1. box A (“one-boxing”)
2. box A and B together (“two-boxing”)

Box A contains either \$0 or \$1,000,000, and box B definitely contains \$1,000. So far, you should clearly take both boxes, because A+B > B, no matter what A is. But there’s a catch: yesterday, Newcomb scanned your brain and predicted what you’d do in this scenario. If he (yesterday) predicted you’d take only box A (today), he (yesterday) placed \$1,000,000 in box A, otherwise he placed \$0 in box A. In this scenario, the people who one-box get \$1,000,000 (rather than 0), and people who two-box get \$1,000 (rather than \$1,001,000). So a one-boxing makes more money than a two-boxing strategy, and its therefore better. But there is a tempting argument in favor of two-boxing, namely: no matter what A is, A+B > A. This makes it “obvious” that you should take both boxes, which we know is “wrong” because that strategy earns less money.

So what’s wrong with the two-boxing argument? It seems like you should be treating A as a variable under your control, but now that you’re standing in front of the boxes and Newcomb has left the room, A is a fixed constant. Does it make sense to be trying to “control” it to make it \$1,000,000 instead of$1,000? What’s going on?

### Debugging two-boxing behavior

I claim that most people who think they know what’s wrong with the two-boxing argument are really just ignoring the two-boxing argument in favor of the one-boxing argument. They (and perhaps you) just redirect mental attention to the one-boxing argument and feel like “That one. That’s the right way.”, instead of nerding out about what exactly is wrong with the two-boxing argument. Figuring out what exactly is wrong with the argument will help you generalize to more scenarios and is much more useful than merely choosing “one-box” and your favorite argument for it.

To say this another way: yes, it’s clear that one-boxing is a better strategy, but knowing that two-boxing is wrong is not the same as knowing what’s wrong with the argument for two-boxing. Knowing that the argument leads to a wrong conclusion is not the same as knowing where the fallacy is, just like knowing your program doesn’t run is not the same as knowing where the bug is. And finding the bug is key to getting better performance in the future!

Moreover, I claim that understanding what’s wrong with the two-boxing argument, at a deep, intuitive/emotional level, is key to understanding how not to be tempted by anti-sunk-cost-fallacy heuristics to violate your integrity. You can progress from resisting the temptation to be untrustworthy to not being tempted at all, by realizing that sometimes, being untrustworthy is a logically incoherent strategy.

### Transparent Newcomb

To make the two-boxing argument more potent, let’s now imagine the boxes are transparent. Say they’re made of glass. A strategy for this game consists of two components:

1. What you’ll do if you see \$1,000,000 in box A 2. What you’ll do if you see \$0 in box A
(Technicality: to ensure his predictions remain accurate, if Newcomb predicts you’ll attempt to defy his predictions— say, by two-boxing in case 1, or one-boxing in case 2—then he makes sure not to give you that opportunity, perhaps by not offering you the game at all.)

Now, if you walk up to the boxes and see \$0 in Box A, it feels really weird to treat that number as a variable under your control. Zero is a constant! If this happens, you should just take both boxes and collect your \$1,000 right? And if you walk up and see A=\$1,000,0000, then you might as well take A+B and collect your \$1,001,000, right?

Well, if you’re the kind of person who two-boxes when you see A=$\1,000,000, that scenario will never happen to you, so your payoff is bounded at \$1,000. You have to be the kind of person who can turn down the extra \$1,000 in order to get offered the \$1,000,000 in the first place. This is kind of like being “trustworthy”, insofar as you model Newcomb’s hopes that you won’t defy his predictions as “trust”.

Moreover, to ensure Newcomb definitely sets you up with \$1,000,000 and not \$0 in box A, you have to be the kind of person who would one-box anyway even if you see A=\$0. That way, when Newcomb imagines you in that scenario, he learns that if he places \$0 in box A — an implicit prediction that you will two-box — then you will defy his prediction and one-box instead. This ensures that the only consistent thing for Newcomb to imagine (and remain an accurate predictor) is you one-boxing. This is kind of like being “trustworthy” even in scenarios where someone didn’t trust you; it means that you would defy Newcomb’s “mistrust” by being trustworthy anyway. Since Newcomb’s is aiming to be a good predictor, this ensures that he will “trust” you.

### Probabilistic Transparent Newcomb

Some people feel like Newcomb being a 100% accurate predictor of them is robbing them of free will, making the problem is unfair. They feel like as long as Newcomb has only a 99%-or-less chance of predicting them correctly, they should assume they’re in that 1% scenario when they happen upon the \$1,000,000 box, and just go ahead and grab the extra \$1,000 in Box B.

This is a mistake. For, consider a further variant of the Transparent Newcomb problem where you’re uncertain about how good a predictor Newcomb really is. Say that Newcomb makes perfect predictions only a 10% of the time (independently of what you do), and the other 90% of the time his predictions are random and uncorrelated with you. Well, 10% of \$1,000,000 is still much more than \$1,000, so one-boxing is still the right strategy. That is, if you have a 10% chance of ending up in a perfect Transparent Newcomb scenario, you still want to be a one-boxer. In particular, uncertainty about Newcomb’s predictive power does not eliminate the need to sabotage the present (leave \$1,000 on the table) in favor of the past (to have been offered the \$1,000,000 option). What’s going on?

### What exactly is wrong with the two-boxing argument?

I’m just gonna spoil it for you now, so stop here if you want to think more. The logical fallacy in the argument for two-boxing is the part where you think you’re walking up to the boxes for the first time in real-life. If Newcomb is so good at predicting you, when you see the \$1,000,000 in box A, you might be a simulation in Newcomb’s imagination, being run yesterday to decide how much money goes in the box! In that case, you should clearly one-box (if you care about how much money real-you gets, which you should, because people whose copies care about each other perform better than people who don’t). More interestingly, if you see A=0, then simulation-you should still one-box, to ensure that scenario doesn’t play out, and that Newcomb will end up offering you the \$1,000,000.

Note that I’m not just saying “it’s better to one-box on Newcomb”; I’m saying it is a logical fallacy to be certain you’re not in a simulation when you make the decision. Let’s examine this claim more critically.

First, perhaps Newcomb doesn’t simulate a fully conscious version of you when he prepares for the scenario. And, since you can tell that you’re conscious, it’s a bit weird to say “I can’t tell if I’m in Newcomb’s imagination/simulation”.

However, if you know already that Newcomb is somehow-able to predict you, and you know (or believe) that he does this using non-conscious imagined version of you, it is a fallacy for you to think that you can base your decision on whether you’re conscious. Apparently, whatever decision procedure you use, if it starts with “If I’m conscious, do X, else do Y” then Newcomb must always be running the “If I’m conscious” branch of your strategy, otherwise his predictions would be incorrect.

In other words, even if some part of your mind “really knows” that it’s conscious, the part of your mind that decides what to do on Transparent Newcomb apparently doesn’t “really know” that you’re conscious, in the sense that Newcomb is able to trick that decision algorithm by running it in his head with the “If I’m conscious” boolean set to “true”.

In this case, even if some part of your mind can in some sense legitimately detect your consciousness (I won’t get into arguments about whether that’s possible here), it is a fallacy for your decision procedure to act like *it* can tell that you’re conscious, once you know (or suspect) that Newcomb is going around making perfect predictions of you with non-conscious imagined versions of your mind.

In order to more fully acknowledge this realization, back in high school I spent a bunch of time vividly imagining myself in Newcomb-like hypotheticals, and vividly imagining the uncertainty of not knowing whether I’m the real me or Newcomb’s imagined version of me. After a while, it just felt automatic to be uncertain in that way, and the temptation to two-box went away.

That is, not only does the argument for two-boxing no longer make emotional sense to me, and not only am I able to pinpoint the exact line of the argument where if fails, but my pinpointing of the error itself happens automatically and intuitively, as a result of a clearer visualization of how the scenario is working. When the two-boxing argument gets to saying “Since you’re staring right through the glass at the contents of Box A and already know its fixed value, you should optimize your payoff by adding \$1,000 to it.”, my gut goes … Wrong! Because Newcomb can predict my decisions, I (or my decision-algorithm) does not know if I’m staring at the real Box A, or the one in Newcomb’s imagination yesterday! (If you want to nerd out about exactly how sure I should be that I’m real, the answer is 50%, because of a super cool result in this blog post by Jessica Taylor and Ryan Carey, echoing an equivalent result of Piccione and Rubinstein from 1997.) This is what it’s like to let decision-theory all the way into your soul 😉 ### People who trust you are like Probabilistic Transparent Newcomb problems Now, here’s the fun part where you realize you’re secretly in Newcomb-like problems all the time. When someone looks you in the eyes, talks to you, and gives you a great opportunity (like the \$1,000,000) with the hopes that you won’t exploit them (take the extra \\$1,000)… well, before they offered you that opportunity, they had some chance of understanding and predicting whether you would be trustworthy to them. That means you’re in a probabilistic transparent Newcomb problem! On top of caring about that person intrinsically and valuing their happiness and your relationship with them, you should be extra loyal to them because of the value of the great opportunity they gave you based on trust. The strength of this consideration depends both on the value of the opportunity the person gave you, and how well you think they understood your trustworthiness when they did.

When I say you “should” be extra loyal based on this consideration, this isn’t just a moral “should”. It’s just a logical, you’re-missing-something-if-you-don’t-realize-this kind of “should”. The kind that maybe ought to make you feel icky and embarrassed of yourself for missing out on, if you’re the kind of person who feels icky about that kinda thing.

So, if you ever find yourself feeling “Whoah, why am I being loyal to this person if I’ve already reaped the benefits of their trust? I know I’m supposed to be moral, but isn’t this the sunk cost fallacy?” Then, even if for some reason you don’t like that person anymore (hopefully a rare scenario), and even if for some reason you think there won’t be any reputational cost of letting them down (hopefully also rare), if you think they gave you a great opportunity based on some minimal degree of understanding of your personality, you should still think, “Wait a minute, I’m about to commit the my-decision-algorithm-thinks-it-knows-where-it-is fallacy. The part of me that’s making this decision right now might actually be operating in the imagination of the person who I think already trusted me, but is currently using this imagined scenario to determine whether I get to reap the benefits of their trust for real. Maybe this isn’t a sunk cost fallacy after all!”

### Other important arguments for integrity

Deserving trust is just one aspect of what you might call “moral” integrity that I think follows from avoiding certain logical fallacies; check out Paul’s post on “Integrity for consequentialists” for a shallower-but-broader overview of some more such considerations that all add up to what I consider a pretty strong force in favor of basic moral behavior, even in the absence of friendship and reputation.

In other words: geez, be a decent person already :p