Chapter 18: Reinforcement Learning

18 min read

(Page 609: Introduction - The Excitement of RL)

The chapter starts by highlighting that Reinforcement Learning (RL) is one of the most exciting fields in Machine Learning today, yet it’s also one of the oldest, with roots in the 1950s.

Historical Applications: It has seen interesting applications over the years, particularly in games (like TD-Gammon, a Backgammon program) and machine control, but it didn’t often make headline news.
The DeepMind Revolution (2013 onwards): A major turning point occurred in 2013 when researchers from DeepMind (a British startup, later bought by Google) demonstrated a system that could learn to play almost any Atari game from scratch, using only raw pixel inputs and no prior knowledge of the game rules. It eventually outperformed humans in most of them.
- This was followed by a series of amazing feats, most famously AlphaGo defeating legendary Go player Lee Sedol in 2016 and world champion Ke Jie in 2017. This was a monumental achievement, as Go is an incredibly complex game.
The “Simple” Idea Behind DeepMind’s Success: With hindsight, the core idea was to apply the power of Deep Learning (which we’ve been studying) to the field of Reinforcement Learning. The combination proved to be incredibly effective.
Current State: The field of RL is now “boiling with new ideas” and has a wide range of potential applications.

Chapter Goals:

Explain what Reinforcement Learning is and what it’s good at.
Present two of the most important techniques in Deep Reinforcement Learning (DRL):
- Policy Gradients (PG)
- Deep Q-Networks (DQNs)
- This will include a discussion of Markov Decision Processes (MDPs), which provide a formal framework for RL problems.
Use these techniques to train models for classic RL tasks (e.g., balancing a pole on a cart).
Introduce the TF-Agents library (a TensorFlow library for RL) to build powerful RL systems and train an agent to play the Atari game Breakout.
Briefly look at some of the latest advances in the field.

What this chapter is ultimately trying to achieve: To give you a solid understanding of the fundamental principles of Reinforcement Learning, how deep learning is used to enhance it, and how to start building and training RL agents.*

(Page 610-611: Learning to Optimize Rewards)

This is the core definition of what RL is all about.

The Setup:
- A software agent (our learning algorithm or program).
- Makes observations from an environment.
- Takes actions within that environment.
- In return, it receives rewards (or penalties).
The Objective of the Agent: To learn to act in a way that will maximize its expected cumulative rewards over time.
Anthropomorphic View:
- Positive rewards = “pleasure.”
- Negative rewards = “pain.”
- The agent learns by trial and error to maximize its pleasure and minimize its pain.
Broad Applicability: This is a very general framework.
- Examples (Figure 18-1, page 611):
  - a. Robotics: Agent is the robot’s control program. Environment is the real world. Observations from sensors. Actions are motor commands. Rewards for reaching a target, penalties for wasting time/going wrong.
  - b. Ms. Pac-Man: Agent is the game-playing program. Environment is the Atari game simulation. Actions are joystick moves. Observations are game screenshots. Rewards are game points.
  - c. Go Player: Similar to Ms. Pac-Man, but for a board game.
  - d. Smart Thermostat: Doesn’t have to be a “moving” agent. Agent learns to anticipate human needs. Rewards for target temperature and energy saving, penalties if humans have to adjust it.
  - e. Automatic Stock Trader: Observes market prices, decides to buy/sell. Rewards are monetary gains/losses.
Nature of Rewards:
- Rewards don’t have to be positive. An agent in a maze might get a negative reward at every time step, so its goal is to find the exit as quickly as possible (to minimize total negative reward).
Other Applications: Self-driving cars, recommender systems, ad placement, controlling where an image classification system should focus its attention.

What “Learning to Optimize Rewards” is ultimately trying to achieve: It’s about training an agent to develop a strategy (a “policy”) that leads to the best possible long-term outcomes, as measured by the sum of rewards it receives from the environment.* This is different from supervised learning (where you have explicit correct labels) and unsupervised learning (where you’re finding structure in unlabeled data). In RL, the “supervision” comes from the sparse and often delayed reward signals.

(Page 612-613: Policy Search)

What is a Policy?
- The algorithm or strategy a software agent uses to determine its actions is called its policy.
- What a policy is ultimately trying to achieve: It’s the agent’s “brain” or decision-making function. Given an observation from the environment, the policy outputs an action to take.
- The policy could be a simple rule-based system, a lookup table, or, very commonly in Deep RL, a neural network (Figure 18-2). The neural network would take observations as input and output the action (or probabilities for actions).
Types of Policies:
- Deterministic: For a given observation, always outputs the same action.
- Stochastic: For a given observation, outputs a probability distribution over possible actions. The action is then sampled from this distribution.
  - Example: A robotic vacuum cleaner’s policy might be:
    - Move forward with probability p each second.
    - Randomly rotate left or right (with probability 1-p) by a random angle between -r and +r.
  - This randomness (stochasticity) ensures the robot explores its environment. The question is, what values of p and r will maximize the dust collected in 30 minutes?
Policy Search:
- The process of finding the best policy parameters (like p and r in the vacuum example, or the weights of a neural network policy).
- What policy search is ultimately trying to achieve: To find the settings for the agent’s decision-making process that lead to the maximum cumulative reward.
Methods for Policy Search:
1. Brute Force (Figure 18-3, page 613):
  - Try out many different values for the policy parameters.
  - Pick the combination that performs best.
  - This is like searching for a needle in a haystack if the policy space (the set of all possible parameter combinations) is large, which it usually is.
2. Genetic Algorithms (Page 612):
  - An evolutionary approach.
  - Start with a first generation of, say, 100 randomly created policies.
  - Try them all out in the environment.
  - “Kill” the worst-performing ones (e.g., the bottom 80).
  - Let the survivors “reproduce” to create the next generation. Offspring are copies of their parent(s) plus some random variation (mutation).
  - Iterate through generations until a good policy is found.
  - (Footnote 8 mentions NEAT - NeuroEvolution of Augmenting Topologies - as an interesting example).
3. Optimization Techniques (Policy Gradients - Page 613):
  - This is a more direct approach that we’ll focus on later in the chapter.
  - Evaluate the gradients of the rewards with respect to the policy parameters.
  - Tweak the parameters by following these gradients towards higher rewards. This is called Gradient Ascent (like Gradient Descent, but we want to maximize rewards, not minimize a cost).
  - Example (vacuum robot): Slightly increase p (probability of moving forward). Does it pick up more dust? If yes, increase p more. If no, reduce p.
  - This approach is called Policy Gradients (PG).

The Need for an Environment to Train In:

Before we can implement Policy Gradients (or most RL algorithms), we need an environment for the agent to interact with. This is a major challenge in RL.

For Atari games, you need an Atari game simulator.
For a walking robot, the environment is the real world (with limitations: can’t easily undo mistakes, can’t speed up time, expensive to run many robots in parallel).
So, simulated environments are crucial, at least for initial “bootstrap” training.
- Libraries like PyBullet or MuJoCo are used for 3D physics simulation.

(Page 613-616: Introduction to OpenAI Gym)

This is where OpenAI Gym comes in.

What it is: A toolkit that provides a wide variety of simulated environments for RL.
- Atari games, board games, 2D and 3D physical simulations, etc.
- It’s a standard platform to train agents, compare different RL algorithms, or develop new ones.
Setting up OpenAI Gym (Page 614):
- Activate your virtual environment (if using one).
- pip install -U gym
- May need extra libraries for rendering some environments (e.g., Mesa OpenGL Utility on Ubuntu).
Using OpenAI Gym - Example with CartPole (Page 614): The CartPole environment (Figure 18-4, page 615) is a classic RL “hello world.”
- A cart can move left or right.
- A pole is attached to the cart.
- Goal: Keep the pole balanced upright for as long as possible.
1. Create an environment: import gym env = gym.make("CartPole-v1")
2. Reset the environment: Must be done before starting an episode. Returns the initial observation. obs = env.reset() print(obs) might give array([-0.0125..., -0.0015..., 0.0420..., -0.0018...])
  - Observations: Depend on the environment. For CartPole:
    - A 1D NumPy array with 4 floats:
      1. Cart horizontal position (0.0 = center).
      2. Cart velocity (positive = right).
      3. Pole angle (0.0 = vertical).
      4. Pole angular velocity (positive = clockwise).
3. Render the environment (optional, for visualization - page 615): env.render() (Opens a window showing the cart and pole).
  - If on a headless server (no screen), rendering will fail unless you use a fake X server like Xvfb.
  - env.render(mode="rgb_array") returns the image as a NumPy array.
4. Check Possible Actions (Page 615): print(env.action_space) gives Discrete(2).
  - This means there are 2 discrete actions: integers 0 and 1.
  - For CartPole: 0 = accelerate left, 1 = accelerate right.
  - Other environments might have more actions, or continuous action spaces (e.g., apply force between -1.0 and 1.0).
5. Take a Step in the Environment (Page 616): action = 1 # accelerate right obs, reward, done, info = env.step(action) The step() method executes the chosen action and returns four values:
  - obs: The new observation after the action.
  - reward: The reward received for taking that action in the previous state.
    - For CartPole: You get a reward of 1.0 at every step, no matter what you do.
    - What this reward structure ultimately trying to achieve: The agent’s goal is to keep the episode running as long as possible (to accumulate as many +1 rewards as possible).
  - done: A boolean. True if the episode is over.
    - For CartPole, this happens if:
      - Pole tilts too much.
      - Cart goes off the screen.
      - After 200 steps (in this version, a limit often exists). In this case, you’ve “won” the episode.
    - After done is True, you must call env.reset() before using it again.
  - info: An environment-specific dictionary with extra information (often empty, but can be useful for debugging or specific tasks, e.g., number of lives left in a game).
6. Close the Environment: When done: env.close() to free resources.

Simple Hardcoded Policy Example (Page 617): Let’s try a very simple policy: if pole angle (obs) < 0 (leaning left), accelerate left (action 0). If angle ≥ 0 (leaning right or vertical), accelerate right (action 1).

def basic_policy(obs):
    angle = obs[2]
    return 0 if angle < 0 else 1

totals = []
for episode in range(500):
    episode_rewards = 0
    obs = env.reset()
    for step in range(200): # Max 200 steps per episode for CartPole-v1
        action = basic_policy(obs)
        obs, reward, done, info = env.step(action)
        episode_rewards += reward
        if done:
            break
    totals.append(episode_rewards)

Run this for 500 episodes.
Result: np.mean(totals) is around 41-42. np.max(totals) might be around 68.
This means this simple policy can’t even keep the pole up for the full 200 steps on average. Not great. The cart tends to oscillate more and more until the pole falls.

Key Takeaway from OpenAI Gym Introduction: OpenAI Gym provides a standardized way to interact with various simulated environments, which is essential for developing and testing RL algorithms. The core interaction loop is: reset -> loop (get action from policy, step, observe reward/done) -> close.

(Page 617-618: Neural Network Policies)

We just saw a simple, hardcoded policy for CartPole. Now, let’s create a policy using a neural network.

Input: The neural network will take an observation from the environment as input.
Output: It will output the action to be executed.
Stochastic Policy (More Precisely): The neural network will usually estimate a probability for each possible action. We then select an action randomly according to these estimated probabilities (Figure 18-5, page 618).
- CartPole Example:
  - Two possible actions: 0 (left) and 1 (right).
  - We only need one output neuron in our neural network.
  - This neuron will output p, the probability of action 0 (going left).
  - The probability of action 1 (going right) will then be 1 - p.
  - If the network outputs p = 0.7, we’ll pick action 0 with 70% probability and action 1 with 30% probability.
Why Pick Randomly (Exploration vs. Exploitation - Page 618)?
- Why not just always pick the action with the highest estimated probability (greedy approach)?
- This is the classic exploration vs. exploitation dilemma in RL.
  - Exploitation: Choosing actions that are known to give good rewards based on current knowledge.
  - Exploration: Trying out new actions to discover potentially even better rewards or to learn more about the environment.
- If you only exploit, you might get stuck with a suboptimal policy because you never try actions that could lead to better long-term outcomes. (Analogy: always ordering your favorite dish at a restaurant and never trying others that might be even better).
- Sampling actions based on their probabilities allows the agent to strike a balance: mostly picking actions it thinks are good, but occasionally trying out less probable ones.
Considering Past Actions/Observations (State Representation - Page 618):
- CartPole Simplicity: In the CartPole environment, each observation (cart position, cart velocity, pole angle, pole angular velocity) contains the environment’s full state. Past actions and observations can be safely ignored because the current observation tells you everything you need to know about the current situation.
- Hidden State in the Environment: If the environment had some hidden state (i.e., the observation doesn’t fully describe the situation), then the agent might need to consider past actions and observations to infer the true state.
  - Example: If the environment only revealed cart position but not velocity, the agent would need to look at the current and previous positions to estimate velocity.
  - Example: If observations are noisy, an agent might want to average a few past observations to get a more stable estimate of the current state.
- What this means for policy design: For simple, fully observable environments like CartPole, a policy that only looks at the current observation is sufficient. For partially observable environments, the policy (or the agent’s architecture) might need some form of memory (like an RNN, or by explicitly feeding sequences of observations).
Building the Neural Network Policy with tf.keras (Page 619):
```
import tensorflow as tf
from tensorflow import keras

n_inputs = 4 # For CartPole: env.observation_space.shape[0]

model = keras.models.Sequential([
    keras.layers.Dense(5, activation="elu", input_shape=[n_inputs]),
    keras.layers.Dense(1, activation="sigmoid") # Output probability of action 0 (left)
])
```
- Input Layer: Implicitly defined by input_shape=[n_inputs] in the first Dense layer. n_inputs is 4 for CartPole.
- Hidden Layer: A single Dense layer with 5 neurons and “elu” activation. This is a simple problem, so a small hidden layer is likely sufficient.
- Output Layer: A single Dense neuron with “sigmoid” activation.
  - Why sigmoid? Sigmoid outputs a value between 0 and 1, which is perfect for representing the probability p of choosing action 0 (left).
  - If there were more than two possible actions (e.g., up, down, left, right), we would use one output neuron per action and a softmax activation function in the output layer to get a probability distribution over all actions.

We now have a neural network policy! It takes observations and outputs action probabilities. The Big Question: But how do we train it? We don’t have explicit “correct action” labels like in supervised learning. This is where the “credit assignment problem” comes in.

Key Takeaway for Neural Network Policies: Neural networks provide a flexible and powerful way to define an agent’s policy, mapping observations to action probabilities. Using probabilities allows for a natural way to balance exploration and exploitation. The next step is to figure out how to adjust the network’s weights based on the rewards received.

Perfect! That concept of a neural network outputting action probabilities, which then guide the agent’s choices (allowing for exploration), is fundamental to many RL approaches.

Now we hit a core challenge in Reinforcement Learning: Evaluating Actions: The Credit Assignment Problem (Pages 619-620).

(Page 619: Evaluating Actions: The Credit Assignment Problem)

We have our neural network policy that can suggest actions. But how do we train it?

If we knew the best action at each step: We could just do supervised learning. We’d train the network to output a high probability for the “correct” action and low probabilities for others, likely by minimizing cross-entropy between its output distribution and the target distribution (which would be a one-hot vector for the correct action).
The RL Reality: The only guidance the agent gets is through rewards, and these rewards are typically:
- Sparse: You don’t get a reward after every single action. You might only get a reward at the very end of an episode (e.g., winning or losing a game).
- Delayed: The action that truly led to a future reward might have occurred many steps earlier.
The Credit Assignment Problem:
- When an agent receives a reward (positive or negative), it’s hard for it to know which of its past actions actually contributed to that reward (or penalty).
- Example: If the CartPole agent balances the pole for 100 steps and then it falls, was the 100th action solely responsible? Probably not. Some earlier actions were good (kept it balanced), and some (perhaps the last few) were bad.
- Analogy: Rewarding a dog hours after it behaved well. Will the dog understand what it’s being rewarded for? Unlikely.
- What the credit assignment problem is ultimately trying to solve: How do we assign “credit” or “blame” to individual actions in a sequence when the feedback (reward) is sparse and delayed?
Common Strategy: Sum of Discounted Future Rewards (Returns)
- To tackle this, a common strategy is to evaluate an action based on the sum of all the rewards that come after it in that episode.
- Usually, we apply a discount factor γ (gamma) at each step.
- This sum of discounted rewards is called the return for that action.
- Figure 18-6 (page 620) illustrates this:
  - Agent takes action A₁ -> gets R₁ = +10
  - Agent takes action A₂ -> gets R₂ = 0
  - Agent takes action A₃ -> gets R₃ = -50
  - Assume discount factor γ = 0.8.
  - Return for action A₁: G₁ = R₁ + γ*R₂ + γ²*R₃ = 10 + (0.8 * 0) + (0.8² * -50) = 10 + 0 + (0.64 * -50) = 10 - 32 = -22.
  - Return for action A₂: G₂ = R₂ + γ*R₃ = 0 + (0.8 * -50) = -40.
  - Return for action A₃: G₃ = R₃ = -50.
- Discount Factor γ:
  - A value between 0 and 1.
  - If γ is close to 0: Future rewards don’t count for much compared to immediate rewards. The agent becomes very “short-sighted.”
  - If γ is close to 1: Rewards far into the future count almost as much as immediate rewards. The agent becomes more “far-sighted.”
  - Typical values: 0.9 to 0.99.
  - For CartPole, actions have fairly short-term effects, so γ = 0.95 is suggested as reasonable. (With γ=0.95, rewards 13 steps ahead are discounted by about half: 0.95¹³ ≈ 0.5).
From Returns to Action Advantage (Page 620, bottom):
- A good action might be followed by a sequence of unlucky bad actions by the agent, leading to a low return for that initially good action.
- However, if we run many episodes, on average, good actions will tend to get higher returns than bad ones.
- We want to estimate how much better or worse an action is compared to other possible actions on average. This is called the action advantage.
- Estimating Advantage:
  1. Run many episodes.
  2. For each action taken in each episode, calculate its return (sum of discounted future rewards for that episode).
  3. Normalize all these action returns across all episodes (e.g., by subtracting the mean of all returns and dividing by their standard deviation).
  4. After normalization:
    - Actions with a positive advantage were probably good.
    - Actions with a negative advantage were probably bad.
The “Perfect” Moment: Now that we have a way to evaluate each action (by its normalized advantage), we are ready to train our first agent using Policy Gradients.

Key Takeaway for Credit Assignment and Returns: The core challenge is linking delayed rewards back to the actions that caused them.

The return (sum of discounted future rewards) is a way to estimate the long-term value of taking an action from a certain state.
By running many episodes and normalizing these returns into advantages, we get a good signal about which actions were “good” (positive advantage) and which were “bad” (negative advantage).
What this process is ultimately trying to achieve: To generate a learning signal for each action taken, even if the explicit reward from the environment is sparse or delayed. This learning signal (the advantage) will tell the policy gradient algorithm whether to make that action more or less likely in the future.

This concept of discounted returns is fundamental to many RL algorithms. It’s how we assign value to actions that don’t necessarily yield immediate rewards.