
I'd like to illustrate the problem, using a very simple environment

that looks, as follows:

Suppose you live in world like this;

and your agent starts over here,

and there are 2 possible outcomes.

You can exit the maze over here

where you get a plus 100

or you can exit the maze over here,

where you receive a minus 100.

Now, in a fully observable case,

and in a deterministic case,

the optimal plan might look something like this;

and whether or not is goes straight over here or not, depends on the details.

For example, whether the agent has momentum or not.

But you'll find a single sequence of actions and states that might cut the corners,

as close as possible, to reach the plus 100 as fast as possible.

That's conventional planning.

Let's contrast this with the case we just learned about,

which is the fully observable case or the stochastic case.

We just learned that the best thing to compute is a policy

that assigns to every possible state, an optimal action;

and simplified speaking, this might look as follows:

Where each of these arrows corresponds

to a sample control policy.

And those are defined in part of the state space that are even far away.

So this wouuld be an example of a control policy

where all the arrows gradually point you over here.

We just learned about this, using MDPs and value iteration.

The case I really want to get at is the case of partial observability

which we will eventually solve, using a technique called POMDP.

And in this case, I'm going to keep the location of the agent in the maze observable.

The part I'm going to make unobservable is where, exactly, I receive plus 100

and where I receive minus 100.

Instead, I'm going to put a sign over here

that tells the agent where to expect plus 100,

and where to expect minus 100.

So the optimum policy would be to first move to the sign,

read the sign;

and then return and go to the corresponding exit,

for which the agent now knows where to receive plus 100.

So, for example, if this exit over here gives us plus 100,

the sign will say Left.

If this exit over here gives us plus 100, the sign will say Right.

What makes this environment interesting is

that if the agent knew which exit would have plus 100,

it will go north, from a starting position.

It goes south exclusively to gather information.

So the question becomes: Can we devise a method for planning

that understands that, even though we'd wish to receive the plus 100 as the best exit,

there's a detour necessary to gather information.

So here's a solution that doesn't work:

Obviously, the agent might be in 2 different worldsand it doesn't know.

It might be in the world where there's plus 100 on the Left side

or it might be in the world with plus 100 on the Right side,

with minus 100 in the corresponding other exit.

What doesn't work is you can't solve the problem for both of these cases

and then put these solutions together

for example, by averaging.

The reason why this doesn't work is

this agent, after averaging, would go north.

It would never have the idea that it is worthwhile to go south,

read the sign, and then return to the optimal exit.

When it arrives, finally, at the intersection over here,

it doesn't really know what to do.

So here is the situation that does work

and it's related to information space or belief space.

In the information space or belief space representation you do planning,

not in the set of physical world states,

but in what you might know about those states.

And if you're really honest, you find out that there's a multitude of belief states.

Here's the initial one, where you just don't know where to receive 100.

Now, if you move around and either reach one of these exits or the sign,

you will suddenly know where to receive 100.

And that makes your belief state change

and that makes your belief state change.

So, for example, if you find out that 100 is Left,

then your belief state will look like this

where the ambiguity is now resolved.

Now, how would you jump from this state space or this state space?

The answer is: when you read the sign,

there's a 50 percent chance that the location over here

will result in a transition to the location over here

50 percent because there's a 50 percent chance that the plus 100 is on the Left.

There's also a 50 percent chance that the plus 100 is on the Right,

so the transition over here is stochastic;

and with 50 percent chance, it will result in a transition over here.

If we now do the MDP trick in this new belief space,

and you pour water in here, it kind of flows through here

and creates all these gradientsas we had before.

We do the same over here and all these gradients are being created

point to this exit on the Left side.

Then, eventually, this water will flow through here and create gradients like this;

and then flow back through here, where it creates gradients like this.

So the value function is plus 100 over here, plus 100 over here

that gradually decrease down here, down here;

and then gradually further decrease over here

and even further decrease over there, so we've got arrows like these.

And that shows you that in this new belief space, you can find a solution.

In fact, you can use value iterationMDP's value iteration

in this new space to find a solution to this really complicated

partially observable planning process.

And the solutionjust to reiterate

we'll suggest: Go south first,

read the sign,

expose yourself to the random position to the Left or Right world

in which you are now able to reach the plus 100 with absolute confidence.