Chapter 15: Processing Sequences Using RNNs and CNNs

48 min read

The World of Sequences

The chapter opens with a vivid example: an outfielder catching a ball. This involves:

Anticipating the future: Predicting the ball’s trajectory.
Adapting to new information: Tracking the ball and adjusting movements.

This ability to process information over time and predict what comes next is something humans do constantly (finishing a friend’s sentence, smelling coffee brewing). This chapter introduces Recurrent Neural Networks (RNNs) as a class of neural networks designed for precisely this kind of task.

What RNNs are good for:
- Analyzing time series data (e.g., stock prices, weather patterns).
- Anticipating trajectories (e.g., in autonomous driving).
- Working on sequences of arbitrary lengths (unlike the fixed-sized inputs we’ve seen so far with MLPs and standard CNNs). This is a key differentiator.
- Processing sentences, documents, or audio samples.
- Hugely important for Natural Language Processing (NLP) tasks like automatic translation or speech-to-text.
Chapter Focus (for Chapter 15):
1. Fundamental concepts of RNNs.
2. How to train them using backpropagation through time (BPTT).
3. Using RNNs to forecast a time series.
4. Exploring two main difficulties RNNs face:
  - Unstable gradients (vanishing/exploding, as discussed in Chapter 11), and techniques to alleviate them (recurrent dropout, recurrent layer normalization).
  - Limited short-term memory, and how to extend it using LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) cells.
Alternatives to RNNs for Sequential Data (Page 497, bottom):
- The chapter acknowledges that RNNs aren’t the only way to handle sequential data.
- For small sequences, a regular dense network (MLP) might suffice.
- For very long sequences (audio, long text), Convolutional Neural Networks (CNNs) can also work surprisingly well. The chapter will discuss this and implement WaveNet, a CNN architecture for long sequences.
- (Chapter 16 will continue with RNNs for NLP and then move to attention mechanisms).

What this chapter is ultimately trying to achieve: To introduce a new type of neural network architecture capable of understanding and making predictions based on ordered sequences of data, where the order and context over time are crucial.*

Recurrent Neurons and Layers

Up to now, we’ve focused on feedforward neural networks, where activations flow in one direction: input -> hidden layers -> output.

Recurrent Neural Network (RNN) Difference: An RNN looks like a feedforward network, but it also has connections pointing backward. This creates a loop, allowing information to persist.
Simplest RNN: A Single Recurrent Neuron (Figure 15-1, page 498):
- Left diagram (rolled): A single neuron receives an input x, produces an output y, and importantly, sends its own output back to itself as an input for the next step.
- At each time step t (also called a frame):
  - The recurrent neuron receives the current external input x(t).
  - It also receives its own output from the previous time step, y(t-1). This y(t-1) is the “memory” or “state” from the past.
  - At the very first time step (t=0), there’s no previous output, so y(-1) is generally initialized to 0.
- Right diagram (unrolled through time): This is a crucial visualization. It shows the same recurrent neuron being represented at different points in time (t-3, t-2, t-1, t).
  - At time t, the neuron takes x(t) and the output from itself at t-1 (which is y(t-1)) to produce y(t).
  - This “unrolling” helps visualize how information flows and how gradients will be calculated.
- What this recurrent connection is ultimately trying to achieve: It allows the neuron to maintain a “state” or “memory” of past inputs, influencing its current output based not just on the current input but also on what it has seen before.
Layer of Recurrent Neurons (Figure 15-2, page 499):
- You can have a whole layer of these recurrent neurons.
- At each time step t:
  - Every neuron in the layer receives the entire input vector x(t).
  - Every neuron also receives the entire output vector y(t-1) from all neurons in this same layer at the previous time step.
- Weights: Each recurrent neuron (or the layer as a whole) now has two sets of weights:
  - Wₓ: Weights for the current inputs x(t).
  - Wᵧ: Weights for the outputs from the previous time step y(t-1).
- Equation 15-1: Output of a recurrent layer for a single instance at time t: y(t) = φ( Wₓᵀx(t) + Wᵧᵀy(t-1) + b )
  - φ is the activation function (e.g., tanh is common, ReLU can be used but needs care as mentioned in Ch 11).
  - b is the bias vector.
  - What this equation is ultimately trying to achieve: It formalizes how the current output y(t) is a function of both the current input x(t) and the collective state/output of the layer from the previous time step y(t-1).
- Equation 15-2: Output for a mini-batch: Y(t) = φ( X(t)Wₓ + Y(t-1)Wᵧ + b ) This is the vectorized form for a whole mini-batch X(t).
  - Y(t): Matrix of outputs for all instances in the batch at time step t.
  - X(t): Matrix of inputs for all instances at time step t.
  - Y(t-1): Matrix of outputs from the previous time step. The equation also shows a common convention: Wₓ and Wᵧ are often concatenated into a single weight matrix W, and inputs [X(t) Y(t-1)] are concatenated before multiplication.
Memory and State (Page 500):
- Since the output Y(t) depends on all inputs from X(0) up to X(t), the RNN has a form of memory.
- A part of a neural network that preserves some state across time steps is called a memory cell (or simply a “cell”).
- A single recurrent neuron or a layer of them is a very basic cell, typically capable of learning only short patterns (e.g., ~10 steps). More complex cells (LSTM, GRU) can learn longer patterns.
- Cell State h(t): In general, a cell’s state at time t is denoted h(t) (“h” for hidden state). It’s a function of the previous state h(t-1) and the current input x(t): h(t) = f(h(t-1), x(t)).
- Output y(t): The cell’s output y(t) is also a function of the previous state and current inputs.
- Figure 15-3: For basic recurrent cells like SimpleRNN in Keras, the output y(t) is simply equal to the hidden state h(t). However, for more complex cells like LSTMs (which we’ll see later), the hidden state h(t) and the output y(t) can be different. The cell might maintain an internal state that is more complex than what it chooses to output at each time step.
Input and Output Sequences (Figure 15-4, page 501): RNNs can handle various types of input/output sequence configurations:
1. Sequence-to-Sequence (Top-Left):
  - Input: A sequence of vectors X(0), X(1), ..., X(T).
  - Output: A sequence of vectors Y(0), Y(1), ..., Y(T).
  - Example: Predicting stock prices (input last N days’ prices, output prices shifted one day into the future).
  - What it’s ultimately trying to achieve: Transform an input sequence into an output sequence, often for prediction or transduction.
2. Sequence-to-Vector (Top-Right):
  - Input: A sequence of vectors.
  - Output: A single vector (usually from the last time step’s output). All other outputs are ignored.
  - Example: Sentiment analysis of a movie review (input sequence of words, output a single sentiment score like -1 to +1).
  - What it’s ultimately trying to achieve: Summarize or classify an entire input sequence into a single vector representation or decision.
3. Vector-to-Sequence (Bottom-Left):
  - Input: A single fixed vector, fed repeatedly at each time step (or just at the first time step with subsequent inputs being zero).
  - Output: A sequence of vectors.
  - Example: Image captioning (input an image or its CNN feature vector, output a sequence of words forming a caption).
  - What it’s ultimately trying to achieve: Generate a sequence based on a static input.
4. Encoder-Decoder (Bottom-Right):
  - A sequence-to-vector network (the encoder) is followed by a vector-to-sequence network (the decoder).
  - Example: Neural Machine Translation (NMT).
    - Encoder reads an input sentence (e.g., English) and compresses it into a single vector representation (often called the “context vector” or “thought vector”).
    - Decoder takes this context vector and generates an output sentence (e.g., French), word by word.
  - This two-step model generally works much better for tasks like translation than a single sequence-to-sequence RNN because the meaning of the whole input sentence needs to be captured before starting to generate the translation (the end of an English sentence can affect the beginning of its French translation).
  - (More complex Encoder-Decoder details in Chapter 16).

The core idea so far is that recurrent neurons have a “loop” that allows them to pass their previous output (their state) as an input to their next computation, enabling them to process sequences and maintain a memory of past information.

The next logical step in the chapter is Training RNNs (Page 502).

Training RNNs - Backpropagation Through Time (BPTT)

So, we have this recurrent structure. How do we train it? How do we adjust Wₓ, Wᵧ, and b?

The Trick: Unroll and Use Regular Backpropagation:
- The core idea is surprisingly simple:
  1. Unroll the RNN through time for the length of your input sequences (as we saw in Figure 15-1 and 15-2). When you unroll it, it looks just like a very deep feedforward network, where each “time slice” of the RNN becomes a layer in this unrolled network.
  2. Crucially, the weights (Wₓ, Wᵧ, b) are shared across all these time-step “layers” in the unrolled network. The Wₓ used at time t=1 is the same Wₓ used at t=2, t=3, etc.
  3. Then, you simply apply regular backpropagation (as discussed in Chapter 10) to this unrolled network.
- This strategy is called Backpropagation Through Time (BPTT).
The BPTT Process (Figure 15-5, page 503):
1. Forward Pass:
  - Feed the input sequence through the unrolled network, time step by time step (dashed arrows in the figure).
  - At each time step, calculate the outputs Y(0), Y(1), ..., Y(T) (where T is the maximum time step or sequence length).
2. Evaluate Loss:
  - The output sequence Y(0)...Y(T) is evaluated using a cost function C(Y(0), ..., Y(T)).
  - Important Note: The cost function might ignore some outputs. For example:
    - In a sequence-to-vector RNN (like sentiment analysis), the cost function might only care about the very last output Y(T).
    - In Figure 15-5, the cost function C(Y(2), Y(3), Y(4)) uses only the last three outputs.
3. Backward Pass:
  - The gradients of the cost function are then propagated backward through the unrolled network (solid arrows in the figure).
  - Gradients flow backward through all the outputs that were used by the cost function. In Figure 15-5, gradients flow from Y(4), Y(3), and Y(2). They do not flow from Y(1) or Y(0) because those were not part of the cost calculation in this specific example.
4. Update Parameters:
  - Since the same parameters Wₓ, Wᵧ, and b are used at each time step in the unrolled network, the gradients calculated for these parameters at each time step are summed up (or averaged) across all time steps.
  - These aggregated gradients are then used to update Wₓ, Wᵧ, and b using a standard gradient descent step.
  - What this summing of gradients across time achieves: It ensures that the parameter updates reflect how those parameters contributed to the error across the entire sequence processing, not just at one particular time step.
Keras Handles the Complexity:
- The good news is that tf.keras takes care of all this BPTT complexity for you when you use its recurrent layers! You define the recurrent layer, and during fit(), Keras handles the unrolling, BPTT, and parameter updates.

Key Takeaway for BPTT: BPTT is the standard way to train RNNs. It cleverly transforms the problem of training a network with loops into training a deep feedforward network (the unrolled version) where weights are shared across “layers” (time steps). Regular backpropagation can then be applied, with the understanding that gradients for shared weights are aggregated across all time steps.

This is a foundational concept for training any kind of recurrent network. Next, the chapter moves into a practical example: forecasting a time series.

How does the idea of “unrolling in time” and then applying standard backpropagation feel as a strategy for training RNNs?

Excellent! It’s good that BPTT makes conceptual sense. It’s a clever way to adapt a known algorithm (backpropagation) to a new type of network architecture (RNNs).

Now, let’s dive into the first practical application in Chapter 15: Forecasting a Time Series (Pages 503-511). This section will also involve Keras implementations, but we’ll continue to focus on the “what are we trying to achieve” for the RNN-specific concepts and the data handling.

Forecasting a Time Series - The Task

What is a Time Series?
- A sequence of one or more values per time step.
- Univariate time series: Single value per time step (e.g., daily temperature in your city, active users per hour on a website).
- Multivariate time series: Multiple values per time step (e.g., a company’s quarterly revenue, debt, profit, etc.).
Common Tasks:
- Forecasting: Predicting future values.
- Imputation: Predicting (or “postdicting”) missing values from the past.
Example for this section (Figure 15-6, page 504):
- Three univariate time series, each 50 time steps long.
- Goal: Forecast the value at the next time step (the ‘X’ in the figure) for each of them.

Generating Time Series Data

For simplicity, the book uses a generate_time_series() function to create synthetic data.
- It creates batch_size time series, each of length n_steps.
- Each series is univariate (one value per time step).
- Each series is a sum of two sine waves with random frequencies and phases, plus some noise.
- The function returns a NumPy array of shape [batch_size, time_steps, 1] (the last dimension is for the number of features per time step, which is 1 for univariate).
Input Data Shape for RNNs (Bird Icon, page 504):
- When dealing with sequences, input features are generally represented as 3D arrays: [batch_size, time_steps, dimensionality]
  - dimensionality: 1 for univariate time series, more for multivariate.
Creating Train/Validation/Test Sets:
- n_steps = 50
- Generate 10,000 series, each n_steps + 1 long (the extra step is for the target). series = generate_time_series(10000, n_steps + 1)
- Split the data:
  - X_train: First 7000 series, first n_steps time steps each [:7000, :n_steps]
  - y_train: First 7000 series, the last time step each [:7000, -1] (this is the value we want to predict after seeing n_steps).
  - Similarly for X_valid (next 2000) and X_test (last 1000).
- Shapes: X_train is . `y_train` is .

Baseline Metrics

Before building complex RNNs, it’s crucial to establish some simple baselines.

What baselines are ultimately trying to achieve: They give you a simple reference point. If your fancy RNN can’t beat a very basic model, something is wrong, or the task might be too simple for an RNN, or the basic model is surprisingly effective.

Naive Forecasting:
- Predict the last observed value in each series as the next value.
- y_pred_naive = X_valid[:, -1, :] (Take the last time step from each validation sequence).
- Calculate Mean Squared Error (MSE): np.mean(keras.losses.mean_squared_error(y_valid, y_pred_naive))
- Result: ~0.020 MSE.
Simple Linear Regression Model:
- Use a fully connected network (a Dense layer) with a linear activation.
- Since Dense layers expect a flat list of features, add a Flatten layer first.
```
model_linear = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[50, 1]), # Input: 50 time steps, 1 feature each
    keras.layers.Dense(1) # Output: 1 predicted value
])
```
- Compile with MSE loss and Adam optimizer. Train for 20 epochs.
- Result: ~0.004 MSE. Much better than naive forecasting!

Implementing a Simple RNN

Now let’s try a very basic RNN. Can it beat the linear model?

model_simple_rnn = keras.models.Sequential([
    keras.layers.SimpleRNN(1, input_shape=[None, 1]) # Single layer, single neuron
])

keras.layers.SimpleRNN(1, ...): Creates a single recurrent layer with just one recurrent neuron.
input_shape=[None, 1]:
- None: We don’t need to specify the length of the input sequences (n_steps). RNNs can process sequences of arbitrary length. This is a major advantage.
- 1: Dimensionality of the input at each time step (univariate).
Default Behavior of SimpleRNN:
- Uses tanh (hyperbolic tangent) as the activation function.
- Returns only the final output: By default, recurrent layers in Keras only return the output of the last time step. If our input sequence has 50 steps (y(0) to y(49)), this layer will output y(49). (Bird icon, page 506).
- This is exactly what we want for this forecasting task (predicting the value at the next step after seeing n_steps).
How it works (conceptually, as in Figure 15-1):
- Initial hidden state h_init is set to 0.
- For t=0: Neuron takes x(0) and h_init, computes h(0) (which is also y(0) for SimpleRNN).
- For t=1: Neuron takes x(1) and h(0), computes h(1) (which is also y(1)).
- …
- For t=49: Neuron takes x(49) and h(48), computes h(49) (which is also y(49)).
- The layer outputs this final y(49).
Training: Compile with MSE, Adam optimizer, train for 20 epochs.
Result: MSE of ~0.014.
- Better than naive forecasting (0.020).
- But worse than the simple linear model (0.004)!
Why is it worse? Parameters:
- The linear model had 50 inputs * 1 weight_per_input + 1 bias = 51 parameters.
- A simple RNN neuron has:
  - Weights for current input x(t) (vector Wₓ, if x(t) is a vector). Here, x(t) is scalar, so 1 weight.
  - Weights for previous hidden state h(t-1) (vector Wᵧ, if h(t-1) is a vector). Here, h(t-1) is scalar (output of one neuron), so 1 weight.
  - One bias term.
  - Total: 3 parameters for this single SimpleRNN(1) neuron.
- With only 3 parameters, it’s much less powerful than the 51-parameter linear model for this task.
Trend and Seasonality (Sidebar, page 506):
- Traditional time series models (like ARIMA) often require you to manually remove trends (e.g., 10% monthly growth) and seasonality (e.g., sunscreen sales peak in summer) from the data before training. After prediction, you add them back.
- With RNNs, this is generally not necessary. The RNN can learn trends and seasonality if they exist in the data. However, preprocessing to remove them might sometimes improve performance, as the model then has a simpler pattern to learn.

Deep RNNs

The simple RNN with one neuron was too simple. Let’s try stacking multiple layers of cells – a deep RNN (Figure 15-7, page 507).

Keras Implementation: Just stack recurrent layers.
```
model_deep_rnn = keras.models.Sequential([
    keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.SimpleRNN(20, return_sequences=True), # Second layer
    keras.layers.SimpleRNN(1)                          # Output layer
])
```
- First SimpleRNN(20, return_sequences=True, ...):
  - 20 recurrent neurons (units).
  - return_sequences=True: This is crucial! By default, a recurrent layer only outputs the hidden state of the last time step. For stacking, the next recurrent layer needs to receive a full sequence of outputs from the current layer (one output vector per time step). Setting return_sequences=True makes it output the hidden state at every time step. So, if input is 50 steps, output is 50 steps (each step being a 20-dim vector).
- Second SimpleRNN(20, return_sequences=True):
  - Also 20 units. Also returns sequences to feed into what would be the next layer.
  - (The scorpion icon on page 507 emphasizes setting return_sequences=True for all recurrent layers except possibly the last one if you only care about its final output).
- Third SimpleRNN(1):
  - The final layer has 1 unit (to predict a single value).
  - It does not have return_sequences=True (or it’s False by default), so it will only output the value from the very last time step.
Training: Compile with MSE, Adam optimizer, train.
Result: MSE of ~0.003! We finally beat the linear model (0.004).
Improving the Output Layer (Page 507):
- The last SimpleRNN(1) layer is not ideal:
  1. Its hidden state is just a single number, which isn’t much memory. The model likely relies more on the hidden states of the previous 20-unit layers.
  2. SimpleRNN uses tanh by default, so predictions are bounded between -1 and 1. What if our time series values go outside this range?
- Better approach: Replace the final recurrent layer with a Dense layer.
```
model_deep_rnn_dense_output = keras.models.Sequential([
    keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.SimpleRNN(20), # Now this is the last recurrent layer, so no return_sequences=True
    keras.layers.Dense(1)
])
```
  - The second SimpleRNN now only outputs its final state (a 20-dim vector).
  - The Dense(1) layer takes this 20-dim vector and outputs a single value.
  - Benefits:
    - Dense layer is often faster.
    - No tanh constraint on the output value (unless you add an activation to the Dense layer).
    - This model converges faster and performs just as well or better.

Forecasting Several Time Steps Ahead

So far, we’ve only predicted the very next value (t+1). What if we want to predict multiple steps ahead (e.g., t+1 to t+10)?

Option 1: Iterative Predictions (Predict one step, use it as input for next, etc. - Page 508)

Use one of the models trained to predict just one step ahead.
To predict 10 steps:
1. Feed the last n_steps of known data to the model, get prediction for step n_steps+1.
2. Take this prediction, append it to your known sequence (as if it were a true value).
3. Feed the new last n_steps (which now includes the first prediction) to the model to get prediction for n_steps+2.
4. Repeat 10 times.

Code Example:

# series is [1, n_steps + 10, 1]
# X_new is initial sequence [1, n_steps, 1]
# Y_new is target future sequence [1, 10, 1]
X = X_new 
for step_ahead in range(10):
    y_pred_one = model.predict(X[:, step_ahead:]) # Predict one step
    X = np.concatenate([X, y_pred_one[:, np.newaxis, :]], axis=1) # Append prediction
Y_pred_iterative = X[:, n_steps:] # The 10 appended predictions

Problem: Errors can accumulate. If the prediction for step 1 is slightly off, it makes the input for predicting step 2 slightly off, and so on. (Figure 15-8, page 509, shows this).
Result: MSE on validation set is ~0.029. Much higher than previous models, but it’s a harder task.
- Naive forecast (predict last value for all 10 future steps) gives MSE ~0.223 (terrible).
- A simple linear model (trained to predict 10 steps ahead) gives MSE ~0.0188 – much better than this iterative RNN approach and faster to train.
When it might work: For a few steps ahead, or for complex tasks where linear models fail, this iterative approach might be okay.

Option 2: Train an RNN to Predict All Future Values at Once (Sequence-to-Vector - Page 509)

Change the targets: Instead of y_train being the value at n_steps+1, make Y_train a vector of the 10 values from n_steps+1 to n_steps+10. Y_train = series[:7000, -10:, 0] (if series has n_steps + 10 length). The last dimension 0 is to flatten it if the series had multiple features per step. If it’s already [batch, 10], no need. The book actually means the targets are series[:7000, n_steps:n_steps+10, 0]. Let’s assume the targets are reshaped to [batch_size, 10].

Modify the model: Output layer now needs 10 units (instead of 1).

model_seq_to_vec_10 = keras.models.Sequential([
    keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.SimpleRNN(20), # Last RNN outputs its final state (20-dim vector)
    keras.layers.Dense(10)      # Dense layer predicts 10 values
])

Result: MSE for next 10 steps is ~0.008. Much better than the linear model (0.0188)!

Option 3: Train an RNN to Predict Future Values at Each Time Step (Sequence-to-Sequence - Page 509-510) This is even better. Instead of only forecasting the next 10 values at the very last time step, train the model to forecast the next 10 values at each and every time step of the input.

Targets Y (Page 510):
- For each input sequence X(0)...X(n_steps-1), the target Y needs to be a sequence of the same length.
- At time step t of the input, the target Y(t) is a vector of the actual future values [actual(t+1), actual(t+2), ..., actual(t+10)].
- The code snippet shows how to construct this Y array of shape [batch_size, n_steps, 10].
Causal Model (Bird Icon, page 510):
- The targets Y_train will contain values that also appear in X_train (e.g., X_train[0, t+1] is part of the target for X_train[0,t]). Is this cheating?
- No, because at each time step t, the model only knows about past time steps up to t. It cannot look ahead in the input X to see X(t+1) when predicting the target for Y(t). It’s a causal model.
Model Architecture (Sequence-to-Sequence - Page 510):
1. All recurrent layers must have return_sequences=True (even the last recurrent one).
2. The output Dense(10) layer must be applied at every time step. Keras offers a TimeDistributed layer wrapper for this.
```
model_seq_to_seq = keras.models.Sequential([
    keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.SimpleRNN(20, return_sequences=True), # Both RNNs return sequences
    keras.layers.TimeDistributed(keras.layers.Dense(10)) # Apply Dense(10) at each time step
])
```
- TimeDistributed(Dense(10)): Wraps the Dense(10) layer. It takes the input sequence from the previous SimpleRNN (shape [batch_size, time_steps, 20]). It reshapes it to [batch_size * time_steps, 20], applies the Dense(10) layer (getting [batch_size * time_steps, 10]), and then reshapes it back to [batch_size, time_steps, 10].
- The footnote 2 on page 510 mentions that a Dense layer itself can often handle sequence inputs correctly (applying to the last dimension independently across time steps), making TimeDistributed(Dense(...)) sometimes redundant. However, TimeDistributed makes the intent very clear.
Loss and Evaluation (Page 511):
- During training, all outputs are needed, so MSE over all outputs is fine.
- For prediction and final evaluation, often only the output at the last time step is useful (i.e., given the full input sequence X(0)...X(n_steps-1), what’s the forecast for X(n_steps)...X(n_steps+9)?).
- A custom metric last_time_step_mse is defined to compute MSE only on the output at the last time step. model.compile(loss="mse", optimizer="adam", metrics=[last_time_step_mse])
Result: Validation MSE (for the last time step’s 10-step forecast) is ~0.006. This is 25% better than the sequence-to-vector model (0.008)!
Why it’s better:
- The loss includes terms for the RNN’s output at every time step.
- This means many more error gradients flow through the model during BPTT.
- Gradients don’t just flow through time (horizontally in unrolled view); they also flow from the output of each time step (vertically).
- This stabilizes and speeds up training.
Generating Arbitrarily Long Sequences (Page 511): You can combine this sequence-to-sequence approach (predicting next 10 values) with the iterative approach (feed predictions back as input) to generate very long sequences (e.g., for music or text generation, Chapter 16). May not be accurate for long-term forecasting but fine for generation.
Error Bars with MC Dropout (Bird Icon, page 511): For forecasting, it’s useful to have error bars. MC Dropout (from Chapter 11) can be used: add MC Dropout layers within recurrent cells, then at inference, run the model many times with dropout active and compute mean/std dev of predictions.

Handling Long Sequences - The Problems

The chapter states: “Simple RNNs can be quite good at forecasting time series or handling other kinds of sequences, but they do not perform as well on long time series or sequences.”

Why? To train an RNN on long sequences, we must run it over many time steps. This means the unrolled RNN becomes a very deep network.
Problems with Deep Unrolled RNNs:
1. Unstable Gradients Problem (Vanishing/Exploding): Just like any deep neural network (as discussed in Chapter 11), it may suffer from gradients becoming too small or too large as they propagate back through many time steps. This makes training very slow or unstable.
2. Forgetting Early Inputs (Limited Short-Term Memory): As an RNN processes a long sequence, the information from the earlier time steps has to be carried through many transformations in the hidden state. Due to these repeated transformations (and often the “squashing” nature of activation functions like tanh), the information from the initial inputs tends to get diluted or lost. The RNN’s state effectively “forgets” what happened much earlier in the sequence.

This section will address both these problems.

Fighting the Unstable Gradients Problem in RNNs

Many tricks from Chapter 11 for deep feedforward nets can also be used for RNNs:

Good parameter initialization (Glorot, He).
Faster optimizers (Adam, Nadam).
Dropout (with caveats).

However, some techniques behave differently or have limitations with RNNs:

Nonsaturating Activation Functions (e.g., ReLU):
- May not help as much with unstable gradients in RNNs and can even make them more unstable.
- Why? The same weights (Wₓ, Wᵧ) are applied at every time step. If GD updates these weights in a way that slightly increases the outputs at time step 1, this effect can compound at time step 2, then time step 3, and so on, potentially leading to outputs exploding. A nonsaturating activation function (like ReLU) doesn’t prevent this explosion for positive values.
- Solution/Preference: A saturating activation function like tanh is often the default and preferred choice for simple RNN cells because its bounded output (-1 to 1) helps control the explosion of activations. The vanishing gradient issue (due to tanh saturation) is then tackled by other means (like LSTMs/GRUs or a smaller learning rate).
- If training is unstable (gradients exploding), monitor gradient sizes (e.g., with TensorBoard) and perhaps use Gradient Clipping (as discussed in Chapter 11).
Batch Normalization (BN) with RNNs:
- Cannot be used as efficiently with RNNs as with deep feedforward nets.
- You cannot use standard BN between time steps (i.e., to normalize the hidden state h(t) before it’s fed into the next time step t+1 along with x(t+1)). This is because BN’s moving averages for mean/std dev are calculated per feature across a batch, and the statistics of the hidden state can vary significantly from time step to time step. Applying the same BN parameters (learned across all time steps) to each step doesn’t work well.
- Where it can be used:
  - It is technically possible to add a BN layer inside a memory cell so it’s applied at each time step (to both x(t) and h(t-1) inputs to the cell). However, the same BN parameters (γ, β, and the moving averages μ, σ) would be used at each time step, regardless of the actual scale/offset of inputs at that step. A 2015 paper by Laurent et al. (footnote 3) found this didn’t yield good results when applied to hidden states, only slightly beneficial when applied to the inputs x(t).
  - What works (somewhat): Apply BN between recurrent layers (i.e., “vertically” in the unrolled view of Figure 15-7, not “horizontally” across time steps). You can do this in Keras by adding a BatchNormalization layer before each recurrent layer. Don’t expect miracles, but it might help a bit.
Layer Normalization (LN) with RNNs (Page 512):
- Introduced by Ba et al. in a 2016 paper (footnote 4). Often works better than BN with RNNs.
- Key Difference from BN:
  - BN normalizes across the batch dimension (for each feature, calculate mean/std over all instances in the batch).
  - LN normalizes across the features dimension (for each instance, calculate mean/std over all features/units at the current time step).
- Advantages for RNNs:
  - LN can compute its required statistics (mean/std for normalization) on the fly at each time step, independently for each instance. It doesn’t need to average over a batch or estimate population stats with moving averages.
  - This means it behaves the same way during training and testing.
- Like BN, LN learns scale (γ) and offset (β) parameters per layer.
- In an RNN, it’s typically used right after the linear combination of inputs x(t) and hidden states h(t-1), before the activation function.
- Implementing Layer Normalization in a Custom Keras Cell (Page 513): The book shows how to create a custom RNN cell (LNSimpleRNNCell) that incorporates Layer Normalization.
  - It inherits from keras.layers.Layer.
  - The constructor (__init__) sets up:
    - self.state_size and self.output_size (both equal to units for a simple RNN cell).
    - An internal keras.layers.SimpleRNNCell(units, activation=None) (note: activation=None because LN will happen before the final activation).
    - A keras.layers.LayerNormalization() layer.
    - The desired final activation function (e.g., keras.activations.get("tanh")).
  - The call(self, inputs, states) method (which defines the cell’s logic for one time step):
    1. outputs, new_states = self.simple_rnn_cell(inputs, states): Get the linear combination from the internal simple RNN cell. (outputs and new_states are the same here).
    2. norm_outputs = self.layer_norm(outputs): Apply Layer Normalization to these pre-activation values.
    3. activated_norm_outputs = self.activation(norm_outputs): Apply the final activation.
    4. Return activated_norm_outputs, [activated_norm_outputs] (output and new state are the same).
  - Using the Custom Cell (Page 514): Wrap this custom cell in a keras.layers.RNN layer: model.add(keras.layers.RNN(LNSimpleRNNCell(20), return_sequences=True, ...)) The RNN layer handles the unrolling through time, feeding states, etc., using the logic defined in your custom cell.
Dropout in RNNs (Page 514):
- You can apply dropout to the inputs fed to the recurrent layer (at each time step).
- You can also apply dropout to the hidden states as they are passed between time steps. This is often called recurrent dropout.
- Most Keras recurrent layers (like SimpleRNN, LSTM, GRU, but not the generic keras.layers.RNN wrapper unless the cell implements it) have two dropout hyperparameters:
  - dropout: Dropout rate for the inputs to the cell.
  - recurrent_dropout: Dropout rate for the hidden states.
- No need for a custom cell just for dropout if using built-in LSTM/GRU layers.

With these techniques (proper initialization, careful choice of activation, Layer Normalization, recurrent dropout, Gradient Clipping), you can significantly alleviate the unstable gradients problem and train RNNs much more effectively, even fairly deep ones.

Key Takeaway for Fighting Unstable Gradients in RNNs: The same weights being applied repeatedly across time steps makes RNNs particularly susceptible to unstable gradients.

Standard solutions like good initialization and optimizers help.
Nonsaturating activations like ReLU need care; saturating ones like tanh are often safer defaults for basic RNN cells.
Batch Normalization is tricky to apply effectively within the recurrence (horizontally); Layer Normalization is generally better.
Recurrent dropout and gradient clipping are also valuable tools.
What these techniques are ultimately trying to achieve: Create a stable learning environment where gradients can flow effectively through many time steps without vanishing or exploding, allowing the RNN to learn dependencies across time.

Tackling the Short-Term Memory Problem - Introduction

The Problem:
- When an RNN processes a sequence, the data goes through many transformations as the hidden state is updated at each time step.
- Due to these repeated transformations (and often the squashing effect of activation functions like tanh), information from the early parts of the sequence tends to get diluted or completely lost by the time the RNN reaches later time steps.
- The RNN’s state h(t) might contain virtually no trace of the first inputs x(0), x(1), ... if t is large.
- This is a showstopper for tasks requiring understanding long-range dependencies (e.g., understanding the beginning of a long paragraph to make sense of the end, or translating a long sentence where the first word in the source affects the last word in the target).
- The book uses the analogy of Dory the fish from “Finding Nemo” trying to translate a long sentence – by the time she finishes reading it, she has no clue how it started.
The Solution: Cells with Long-Term Memory:
- To tackle this, various types of recurrent cells with mechanisms for long-term memory have been introduced.
- These have proven so successful that basic SimpleRNN cells are not used much anymore for complex sequence tasks.
- We’ll look at the most popular ones: LSTM and GRU cells.

LSTM Cells - Long Short-Term Memory

History: Proposed in 1997 by Sepp Hochreiter and Jürgen Schmidhuber, and gradually improved by others (Alex Graves, Haşim Sak, Wojciech Zaremba).
As a Black Box (Page 515):
- From the outside, an LSTM cell can be used much like a basic SimpleRNN cell. You can just replace keras.layers.SimpleRNN with keras.layers.LSTM.
- Benefits:
  - Performs much better.
  - Training converges faster.
  - Can detect long-term dependencies in the data.
- Keras Implementation:
```
model = keras.models.Sequential([
    keras.layers.LSTM(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.LSTM(20, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])
```
  Or, using the generic RNN layer with an LSTMCell (though the specialized LSTM layer is usually preferred as it can use optimized GPU implementations, see Chapter 19):
```
model = keras.models.Sequential([
    keras.layers.RNN(keras.layers.LSTMCell(20), return_sequences=True, input_shape=[None, 1]),
    # ...
])
```
How Does an LSTM Cell Work? (The “Internals” - Figure 15-9, page 516): This is where the magic happens.
- Split State: An LSTM cell’s state is split into two vectors:
  - h(t) (hidden state): The short-term state. This is also the output y(t) of the cell at time step t.
  - c(t) (cell state or memory cell): The long-term state.
- Core Idea: The network can learn what to store in the long-term state c(t), what to throw away from c(t-1), and what to read from c(t) to influence the current output h(t).
- Data Flow (Left to Right for c(t)):
  1. The previous long-term state c(t-1) first goes through a forget gate.
  2. Some new memories (selected by an input gate) are added to it via an addition operation.
  3. The result is the new long-term state c(t), which is sent straight out (without further transformation) to be the long-term state for the next time step.
- Output Path (for h(t)):
  1. After the addition operation (that produced c(t)), the long-term state c(t) is copied.
  2. This copy is passed through a tanh function.
  3. The result is then filtered by an output gate.
  4. This produces the short-term state h(t) (which is also the cell’s output y(t)).
The Gates (The Key Components - Page 516): The “forget,” “input,” and “output” gates are the controllers.
- The current input vector x(t) and the previous short-term state h(t-1) are fed into four different fully connected layers (these are internal to the LSTM cell, each with its own weights and biases).
- 1. Main Layer (outputs g(t)):
  - This layer analyzes x(t) and h(t-1). Its output g(t) (often passed through tanh) represents candidate values/memories to be potentially added to the long-term state.
  - In a basic RNN, this g(t) would directly become h(t) and y(t). In an LSTM, its most important parts are stored in c(t).
- 2, 3, 4. Gate Controller Layers (output f(t), i(t), o(t)):
  - These three layers use the logistic (sigmoid) activation function, so their outputs range from 0 to 1.
  - These outputs are then used in element-wise multiplication operations, acting like gates:
    - Output near 0: “Closes” the gate (blocks information flow).
    - Output near 1: “Opens” the gate (lets information through).
  - Forget Gate (controlled by f(t)): c(t) ← f(t) ⊗ c(t-1) + ... It decides which parts of the previous long-term state c(t-1) should be erased or kept. If an element of f(t) is 0, the corresponding part of c(t-1) is forgotten.
  - Input Gate (controlled by i(t)): c(t) ← ... + i(t) ⊗ g(t) It decides which parts of the candidate memories g(t) (from the main layer) should be added to the long-term state.
  - Output Gate (controlled by o(t)): h(t) = y(t) = o(t) ⊗ tanh(c(t)) It decides which parts of the current long-term state c(t) (after passing through tanh) should be read out and output as the short-term state h(t) and the cell output y(t).
Equation 15-3: LSTM Computations (Page 517): Summarizes the math for a single instance:
- i(t) = σ(Wₓᵢᵀx(t) + Wₕᵢᵀh(t-1) + bᵢ) (Input gate)
- f(t) = σ(Wₓfᵀx(t) + Wₕfᵀh(t-1) + b_f) (Forget gate - TensorFlow initializes b_f to 1s to prevent forgetting everything at start)
- o(t) = σ(Wₓₒᵀx(t) + Wₕₒᵀh(t-1) + bₒ) (Output gate)
- g(t) = tanh(Wₓgᵀx(t) + Wₕgᵀh(t-1) + b_g) (Main layer transforming inputs/prev_state)
- c(t) = f(t) ⊗ c(t-1) + i(t) ⊗ g(t) (New long-term state: forget old, add new)
- y(t) = h(t) = o(t) ⊗ tanh(c(t)) (Output/new short-term state: filter long-term state)
- Each gate and the main layer have their own weight matrices for x(t) (e.g., Wₓᵢ) and h(t-1) (e.g., Wₕᵢ), and their own bias terms (e.g., bᵢ). These are all learned.
What the LSTM cell is ultimately trying to achieve: By having these gate mechanisms, the LSTM can learn:
- When to let new important information in (input gate).
- What old information to forget (forget gate).
- What information to preserve in its long-term memory cell c(t) over many time steps.
- What part of its current memory to output or use for its short-term state h(t) (output gate). This allows it to capture long-term patterns and dependencies much more effectively than simple RNNs.
Peephole Connections (Felix Gers & Jürgen Schmidhuber, 2000 - Page 518):
- An LSTM variant where the gate controllers can also “peek” at the long-term state c(t-1) (for input/forget gates) or c(t) (for output gate) in addition to x(t) and h(t-1).
- This can sometimes improve performance, but not always.
- Standard Keras LSTMCell doesn’t support them, but tf.keras.experimental.PeepholeLSTMCell does.

1. Why Sigmoid (Logistic) for the Gates (f(t), i(t), o(t))?

What a Gate Does: A gate’s purpose is to control the flow of information. It needs to decide “how much” of some signal should be allowed to pass through. This is like a valve that can be fully open, fully closed, or partially open.
Sigmoid Output Range (0 to 1): The sigmoid function σ(z) = 1 / (1 + exp(-z)) outputs values strictly between 0 and 1.
- Output near 0: “Close the gate” – block most/all of the information.
- Output near 1: “Open the gate” – let most/all of the information pass through.
- Output around 0.5: “Partially open the gate” – let some proportion of the information pass.
Use in Element-wise Multiplication: The outputs of these sigmoid gates are then used in element-wise multiplications (the ⊗ symbol in Equation 15-3):
- f(t) ⊗ c(t-1): How much of the old long-term state to keep (if f(t) is near 1) or forget (if f(t) is near 0).
- i(t) ⊗ g(t): How much of the new candidate memory g(t) to let in.
- o(t) ⊗ tanh(c(t)): How much of the (processed) long-term state to output.
Differentiability: Sigmoid is differentiable, which is essential for backpropagation to learn the weights of the gate controller layers.
What if we used Tanh for Gates?
- Tanh outputs values between -1 and 1.
- If a gate outputted -1, and you multiplied it by a value, it would invert the sign of that value and also scale it. This isn’t really what “gating” or “controlling flow” means in an intuitive sense. We want to scale information between “none of it” (0) and “all of it” (1).
- It might be possible to design a system where tanh is used and its output is then rescaled (e.g., (tanh(z) + 1) / 2 to map it to 0-1), but sigmoid naturally provides the 0-1 range needed for this multiplicative gating effect.
What if we used ReLU for Gates?
- ReLU outputs 0 for negative inputs and z for positive inputs (unbounded positive output).
- An output of 0 would effectively “close the gate.”
- An output > 1 would not just “open the gate” but also amplify the signal passing through. This could lead to exploding values and instability, which is precisely what LSTMs are trying to manage better than simple RNNs.
- The gating mechanism generally needs a bounded “how much” signal, and sigmoid’s 0-1 range is perfect for this.

2. Why Tanh for the Main Layer (g(t)) and for Processing c(t) before the Output Gate?

Main Layer g(t): This layer g(t) = tanh(W_xgᵀx(t) + W_hgᵀh(t-1) + b_g) computes the “candidate values” to be potentially added to the long-term cell state c(t).
Processing c(t) for Output: The output y(t) = o(t) ⊗ tanh(c(t)). The long-term state c(t) is passed through tanh before being gated by o(t).
Tanh Output Range (-1 to 1):
- The tanh function squashes its input to a range between -1 and 1.
- What this achieves for g(t) (candidate values): It keeps the potential updates to the cell state bounded. This helps prevent the cell state c(t) from growing uncontrollably large (exploding).
- What this achieves for tanh(c(t)) before output: It ensures that the values being considered for output are also in a bounded range before the output gate decides how much of it to pass through.
Zero-Centered Output: The output of tanh is roughly centered around 0. This can be beneficial for learning in subsequent layers or time steps, as discussed in Chapter 10 (it’s a property that sometimes helps with convergence speed compared to sigmoid which is centered at 0.5).
Differentiability: Tanh is also differentiable.
What if we used Sigmoid for g(t) or tanh(c(t))?
- Sigmoid outputs 0 to 1. If you continually add positive values (from sigmoid) to the cell state, the cell state could still grow very large without bound (unless the forget gate is very aggressive).
- Tanh allowing for both positive and negative values (and being zero-centered) gives the network more flexibility to increment or decrement values in the cell state c(t) and to output values that are centered around zero. This can lead to more stable dynamics.
What if we used ReLU for g(t) or tanh(c(t))?
- For g(t): If ReLU were used, g(t) could be unbounded positive. If the input gate i(t) is open, this could lead to the cell state c(t) growing indefinitely, causing exploding values. LSTMs are designed to prevent this kind of instability that plagues simple RNNs with ReLU.
- For tanh(c(t)) before output: Using ReLU here might also be problematic if c(t) can take on large values, as the output h(t) could then become unbounded before being used by subsequent layers or calculations. The tanh here provides a final “squashing” to keep things in a controlled range.

In Summary:

Sigmoid for Gates (0 to 1 range):
- Ultimately trying to achieve: A probabilistic-like control over information flow – “block” (0), “pass through” (1), or “partially pass through” (between 0 and 1). This is ideal for multiplicative gating.
Tanh for Candidate Values and Pre-Output Processing (-1 to 1 range):
- Ultimately trying to achieve: Keep the values that modify the cell state and the values considered for output bounded and roughly zero-centered. This contributes to the stability of the LSTM cell and helps prevent the cell state from exploding, which is a key part of solving the long-term memory problem.

The choice of these specific activation functions is a result of careful design and empirical evidence showing that this combination allows LSTMs to effectively learn, store, and retrieve information over long sequences while maintaining training stability. They are not arbitrary choices but rather integral to the LSTM’s mechanism for managing information flow and memory.

GRU Cells - Gated Recurrent Unit

Proposed by Kyunghyun Cho et al. in 2014 (in the same paper that introduced the Encoder-Decoder architecture we briefly touched upon).
What it is: A simplified version of the LSTM cell.
Performance: It seems to perform just as well as LSTMs on many tasks (footnote 12 refers to a 2015 paper by Greff et al. that found LSTM variants perform roughly the same). This, along with its simplicity, explains its growing popularity.
Main Simplifications Compared to LSTM (Figure 15-10, page 519):
1. Merged State Vectors: Both state vectors (c(t) for long-term and h(t) for short-term in LSTM) are merged into a single state vector h(t) in GRU.
2. Single “Update” Gate z(t) controls both Forget and Input:
  - LSTM has a separate forget gate f(t) and an input gate i(t).
  - GRU has a single update gate z(t).
  - If z(t) is close to 1 (gate “open” for update), it means “forget the previous state and update with the new candidate state.” Specifically, h(t) = z(t) ⊗ h(t-1) + (1-z(t)) ⊗ g(t).
    - If z(t) ≈ 1: h(t) ≈ 1 ⊗ h(t-1) + 0 ⊗ g(t) ≈ h(t-1) (keep old state, ignore candidate). Correction: The equation in the book and standard GRU is h(t) = (1-z(t)) ⊗ h(t-1) + z(t) ⊗ g(t). So if z(t) ≈ 1 (update gate is “active for new candidate”), then h(t) ≈ 0 ⊗ h(t-1) + 1 ⊗ g(t) ≈ g(t) (take the new candidate). If z(t) ≈ 0, then h(t) ≈ 1 ⊗ h(t-1) + 0 ⊗ g(t) ≈ h(t-1) (keep old state).
    - What the update gate z(t) is ultimately trying to achieve: It learns to decide how much of the previous state h(t-1) to keep versus how much of the new candidate state g(t) to incorporate.
3. No Output Gate: The full state vector h(t) is output at every time step. (In LSTM, the output gate o(t) controlled which parts of tanh(c(t)) were output as h(t)).
4. New “Reset” Gate r(t):
  - There’s a new gate controller r(t) (also using sigmoid, outputting 0 to 1).
  - This gate controls which part of the previous state h(t-1) will be shown to the main layer g(t) when calculating the candidate state.
  - The candidate state is g(t) = tanh(W_xgᵀx(t) + W_hgᵀ(r(t) ⊗ h(t-1)) + b_g).
  - If r(t) is close to 0 for some components, those components of h(t-1) are effectively ignored when computing the new candidate g(t).
  - What the reset gate r(t) is ultimately trying to achieve: It allows the cell to “forget” or ignore parts of its previous state that are deemed irrelevant for computing the next candidate state, before deciding how to update the state with that candidate.
Equation 15-4: GRU Computations (Page 520): Summarizes the math for a single instance.
- z(t) = σ(W_xzᵀx(t) + W_hzᵀh(t-1) + b_z) (Update gate)
- r(t) = σ(W_xrᵀx(t) + W_hrᵀh(t-1) + b_r) (Reset gate)
- g(t) = tanh(W_xgᵀx(t) + W_hgᵀ(r(t) ⊗ h(t-1)) + b_g) (Candidate hidden state)
- h(t) = (1-z(t)) ⊗ h(t-1) + z(t) ⊗ g(t) (Final hidden state for the time step)
- Parameters to Learn: W_xz, W_hz, b_z, W_xr, W_hr, b_r, and W_xg, W_hg, b_g. Notice fewer weight matrices than LSTM because there are fewer gates and only one state vector.
Keras Implementation:
- Keras provides a keras.layers.GRU layer (based on keras.layers.GRUCell).
- Using it is just a matter of replacing SimpleRNN or LSTM with GRU in your Sequential model.
Why LSTMs and GRUs are Successful:
- “LSTM and GRU cells are one of the main reasons behind the success of RNNs.”
- They can tackle much longer sequences than simple RNNs.
- However, they still have fairly limited short-term memory and can have a hard time learning long-term patterns in sequences of 100 time steps or more (e.g., very long audio samples, long time series, long sentences).

One way to help LSTMs/GRUs handle even longer sequences is to shorten the input sequences they directly see, for example, by using 1D convolutional layers first.

Key Takeaway for GRUs: GRUs are a streamlined version of LSTMs that often perform just as well but are computationally simpler (fewer parameters, one state vector).

They use an update gate (to decide how much of the past state to keep vs. new candidate state) and a reset gate (to decide how much of the past state influences the current candidate state).
What GRUs are ultimately trying to achieve: Similar to LSTMs, they aim to capture dependencies over longer sequences by using gating mechanisms to control information flow and memory, but with a more compact architecture.

This covers the main advanced recurrent cell types (LSTM and GRU) designed to combat the short-term memory problem of simple RNNs. The next step in the chapter is to see how to combine these with other types of layers, like 1D convolutions, to process sequences even more effectively.

Using 1D Convolutional Layers to Process Sequences

While LSTMs and GRUs are much better than simple RNNs at handling longer sequences, they can still struggle with very long dependencies (e.g., hundreds or thousands of time steps, like in raw audio or very long text documents).

The Idea: One way to solve this is to shorten the input sequences before feeding them to the recurrent layers. How? By using 1D convolutional layers.
Recall 2D Convolutional Layers (from Chapter 14, though not yet covered by us in detail):
- They work by sliding small kernels (filters) across an image (a 2D grid of pixels).
- They produce multiple 2D feature maps (one per kernel).
- Each kernel learns to detect a specific local pattern (e.g., an edge, a texture).
1D Convolutional Layers for Sequences:
- Similarly, a 1D convolutional layer slides several kernels (filters) across an input sequence (a 1D array of features over time).
- It produces a 1D feature map per kernel.
- What each 1D kernel is ultimately trying to achieve: It learns to detect a single, very short sequential pattern (no longer than the kernel size). For example, if processing text character by character, a kernel of size 3 might learn to detect common trigrams like “the” or “ing”.
- If you use, say, 10 kernels, the layer’s output will be composed of 10 1-dimensional sequences (feature maps). You can view this output as a single sequence where each time step now has 10 features (instead of, say, 1 feature if it was a univariate time series).
Building Hybrid Networks:
- This means you can build neural networks composed of a mix of:
  - Recurrent layers (LSTMs, GRUs).
  - 1D convolutional layers.
  - Even 1D pooling layers (which downsample the sequence, similar to 2D pooling for images).
Impact on Sequence Length:
- If a 1D convolutional layer uses a stride of 1 and "same" padding, the output sequence will have the same length as the input sequence.
- If it uses "valid" padding or a stride greater than 1, then the output sequence will be shorter than the input sequence.
  - This downsampling of the sequence by the convolutional layer is the key benefit here.
  - The convolutional layer can learn to preserve useful local information while dropping unimportant details, effectively compressing the sequence.
  - The subsequent recurrent layers then have a shorter sequence to process, making it easier for them to detect longer-range patterns within that compressed representation.
Keras Example (Page 521): The book shows modifying the time series forecasting model to include a 1D convolutional layer at the beginning.
```
model = keras.models.Sequential([
    keras.layers.Conv1D(filters=20, kernel_size=4, strides=2, padding="valid",
                        input_shape=[None, 1]), # Input is [batch, time_steps, features]
    keras.layers.GRU(20, return_sequences=True),
    keras.layers.GRU(20, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10)) # Assuming seq-to-seq for 10-step forecast
])
```
- keras.layers.Conv1D(...):
  - filters=20: It will learn 20 different short patterns.
  - kernel_size=4: Each pattern is 4 time steps long.
  - strides=2: The kernel slides by 2 steps each time, effectively downsampling the input sequence by a factor of 2.
  - padding="valid": No padding, so output sequence length will be (input_length - kernel_size) / strides + 1.
  - input_shape=[None, 1]: Takes sequences of any length, with 1 feature per time step.
- Target Adjustment: Because the Conv1D layer shortens the sequence (and the first output corresponds to input steps 0-3), the target sequences Y_train need to be adjusted (cropped and downsampled) to match the output length and alignment of the Conv1D layer. The example Y_train[:, 3::2] does this:
  - 3::: Skips the first 3 time steps in the target (because the first Conv1D output depends on inputs 0-3, so it effectively predicts starting from a later point relative to the original targets).
  - ::2: Downsamples the targets by a factor of 2 to match the strides=2.
Result: The book states that if you train and evaluate this model, “you will find that it is the best model so far. The convolutional layer really helps.”
It even mentions that it’s possible to use only 1D convolutional layers (and drop recurrent layers entirely) for sequence processing, which leads to the WaveNet architecture.

Key Takeaway for 1D Convolutions in Sequence Processing: 1D convolutional layers can be used as a preprocessing step for RNNs (or even as a replacement for them in some cases like WaveNet).

What they are ultimately trying to achieve when used with RNNs:
- Feature Extraction: Learn to detect relevant local patterns in the input sequence.
- Downsampling/Shortening Sequences: By using strides, they can reduce the length of the sequence fed into the RNN layers. This allows the RNNs (like LSTMs/GRUs) to focus their memory capacity on longer-range dependencies within a more compressed, feature-rich representation of the original sequence.

(Page 521-522: WaveNet)

This is an example of a powerful architecture that uses only 1D convolutional layers (no recurrent layers) to process very long sequences.

Introduced in a 2016 paper by van den Oord et al. (DeepMind) for generative modeling of raw audio.
Architecture (Figure 15-11, page 522):
- It stacks 1D convolutional layers.
- Crucially, it doubles the dilation rate at every layer.
  - Dilation Rate: Controls how spread apart the kernel’s input taps are.
    - Layer 1: Dilation rate 1 (looks at, say, 2 adjacent time steps).
    - Layer 2: Dilation rate 2 (looks at inputs 2 steps apart, e.g., time t and t-2). Its receptive field is wider.
    - Layer 3: Dilation rate 4 (looks at inputs 4 steps apart).
    - And so on (1, 2, 4, 8, 16, …).
  - What doubling dilation rates achieves: The receptive field of the network (how far back in time it can “see”) grows exponentially with depth. This allows the network to capture very long-range dependencies efficiently.
    - Lower layers learn short-term patterns.
    - Higher layers learn long-term patterns by combining features from lower layers that already cover wider time spans.
- The paper stacked blocks of 10 such convolutional layers (dilation rates 1, 2, …, 512), and repeated these blocks.
- Used “causal” padding to ensure convolutions don’t peek into the future.

Keras Implementation of a Simplified WaveNet (Page 522):

model = keras.models.Sequential()
model.add(keras.layers.InputLayer(input_shape=[None, 1]))
for rate in (1, 2, 4, 8) * 2: # Two blocks of these dilation rates
    model.add(keras.layers.Conv1D(filters=20, kernel_size=2, padding="causal",
                                   activation="relu", dilation_rate=rate))
model.add(keras.layers.Conv1D(filters=10, kernel_size=1)) # Output layer

padding="causal": Ensures that the output at time t only depends on inputs up to time t (no future peeking).
The output layer is a Conv1D with kernel_size=1. This is like applying a Dense layer independently at each time step to the features extracted by the dilated convolutions.

Performance:
- The book states that these last two models (RNN with initial Conv1D, and the simplified WaveNet) offer the best performance so far on the time series forecasting task.
- The original WaveNet achieved state-of-the-art on audio generation (text-to-speech, music), handling tens of thousands of time steps per second of audio. This is something LSTMs/GRUs would struggle immensely with.

Key Takeaway for WaveNet: WaveNet demonstrates that stacked 1D convolutional layers with exponentially increasing dilation rates can be extremely effective at capturing long-range dependencies in sequences, offering an alternative to RNNs for very long sequences.

What it’s ultimately trying to achieve: Efficiently model long-term patterns by having a hierarchy of convolutions that look at progressively larger time scales, without the computational cost of very large kernels or the memory limitations of very deep unrolled RNNs.

This concludes the main content of Chapter 15! It’s taken us from the basic concept of recurrence to sophisticated cells like LSTMs and GRUs, and finally to using CNNs (either with RNNs or alone like WaveNet) for sequence processing.

The core theme has been how to enable neural networks to effectively “remember” and utilize information from previous parts of a sequence to make current predictions or classifications.