RL-Basics: Value Functions and Bellman Equation

Yunzhe Wang
5 min readMar 15, 2024

--

A Step-by-Step Guide to the Math Derivation

Image by Midjourney

The Bellman Equation is arguably one of the most important equations in RL, but many students might find it hard to understand especially when textbooks and lecture materials present the equation in the stochastic format. In fact, it can be very intuitively understood assuming a non-stochastic format.

This post helps walk through the Value Functions and the Bellman Equation, breaking it down step by step from its simplest deterministic forms to the more complex and stochastic ones.

The Basic Concepts

Before we start, let's make sure we go over the basic concepts and notions. In RL, agents (like robots) navigate and interact within an environment by taking actions according to a learned policy. This policy, which can be thought of as a strategy, guides the agent in choosing actions based on the current state of the environment. As the agent explores, it constructs and updates value functions. These functions represent the agent’s “mental map” of the anticipated rewards or values it expects to receive for taking certain actions in specific states.

Then the learning goal is to identify the optimal policy that maximizes the expected rewards over time. This involves refining the policy based on the agent’s experiences to ensure that the value functions accurately reflect the true potential rewards the environment offers. In essence, the agent seeks to align its expectations (value functions) with reality as closely as possible, thereby maximizing the cumulative rewards it can achieve through its actions.

  • Agent: The learner or decision-maker.
  • Environment: The world with which the agent interacts.
  • State (s): A representation of the situation the agent is in.
  • Action (a): What the agent can do in its current state.
  • Reward Function R(s, a): A feedback signal from the environment, indicating the benefit of an action.
  • Discount Factor (γ): Determines how much future rewards are worth today. e.g. Lottery winner gets 1M$ today vs over 10 years, and the value of the money gets discounted over time.
  • Policy (π): A function telling the agent what action to choose at a particular state.
  • Value Functions V(s) and Q(s, a): the agent’s “mental map” of the anticipated rewards it expects to get for taking certain actions in specific states. V is called the state-value function and Q is called the action-value function
  • Bellman Equation: Tells you how to calculate value functions

Now let’s start piecing them together

Policy Defines Action at a given State

A policy specifies the agent’s action in any given state. It can be both deterministic and stochastic. A deterministic policy selects the same action for any state, while a stochastic policy assigns probabilities to each possible action and samples from them.

Return = Total Discounted Reward

Return (u) is the sum of all rewards an agent expects to accumulate over the future, adjusted by a discount factor to reflect the present value of those future rewards.

We use t to denote the timestep, note how it can be rewritten in a recursive format

Relationship Between Value Functions V(s) and Q(s, a)

Simply put, V(s) and Q(s, a) both quantify the expected rewards an agent can accumulate, but they do so from slightly different perspectives. While the state-value function, V(s), estimates the total future rewards from a given state s, the action-value function, Q(s, a), considers both the state and the immediate action. Essentially, Q is a more detailed value function.

Given a policy π, we can write V in the form of Q

State Transition Probability Function p(s’|s, a)

It is worth noting that when an agent chooses an action, the outcome is not always deterministic; it may end in a variety of next states s′. For example, when driving, choosing a shortcut route may not necessarily result in a shorter time (there could be an unexpected traffic jam). The purpose of the state transition probability function is to model these uncertainties.

Bellman Equation with Deterministic Transition and Deterministic Policy

Now that we have all the basic concepts in place, the Bellman Equation provides a way to quantify the value of being in a state (or taking an action from that state), essentially guiding us in computing V(s) and Q(s, a). This foundation informs the creation of all update rules and learning algorithms in RL and is arguably the most important equation to understand early on.

By definition, the value of a state is its expected return. In a deterministic environment, there is no expectation, so recalling how we previously wrote the recursive format of return, we have:

Similarly for Q:

Bellman Equation with Stochastic Transition and Deterministic Policy

Now introducing the transition probability function p into our equation, making the next state s’ in our reward calculation R(s, a, s’) a random variable, so we can use the formulas for calculating the expected value

Where we have:

Similarly for Q:

Bellman Equation with Stochastic Transition and Stochastic Policy

Lastly, with a stochastic policy, we essentially make the action a in our reward function R(s, a, s’) a second random variable, so we need to calculate the expectation with joint probability distribution:

Where we have:

Similarly for Q:

One more thing

The significance of the Bellman equation in RL cannot be overstated; it underpins the development of various learning algorithms by offering a recursive way to estimate value functions. For instance, modifying the update rule based on expected returns leads to Temporal Difference learning, while incorporating the maximization of action values brings us to Q-learning. These adaptations exemplify the equation’s versatility in crafting strategies that progressively approximate optimal policies.

--

--