Chapter 15: Processing Sequences Using RNNs and CNNs
48 min readThe World of Sequences
The chapter opens with a vivid example: an outfielder catching a ball. This involves:
- Anticipating the future: Predicting the ball’s trajectory.
- Adapting to new information: Tracking the ball and adjusting movements.
This ability to process information over time and predict what comes next is something humans do constantly (finishing a friend’s sentence, smelling coffee brewing). This chapter introduces Recurrent Neural Networks (RNNs) as a class of neural networks designed for precisely this kind of task.
What RNNs are good for:
- Analyzing time series data (e.g., stock prices, weather patterns).
- Anticipating trajectories (e.g., in autonomous driving).
- Working on sequences of arbitrary lengths (unlike the fixed-sized inputs we’ve seen so far with MLPs and standard CNNs). This is a key differentiator.
- Processing sentences, documents, or audio samples.
- Hugely important for Natural Language Processing (NLP) tasks like automatic translation or speech-to-text.
Chapter Focus (for Chapter 15):
- Fundamental concepts of RNNs.
- How to train them using backpropagation through time (BPTT).
- Using RNNs to forecast a time series.
- Exploring two main difficulties RNNs face:
- Unstable gradients (vanishing/exploding, as discussed in Chapter 11), and techniques to alleviate them (recurrent dropout, recurrent layer normalization).
- Limited short-term memory, and how to extend it using LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) cells.
Alternatives to RNNs for Sequential Data (Page 497, bottom):
- The chapter acknowledges that RNNs aren’t the only way to handle sequential data.
- For small sequences, a regular dense network (MLP) might suffice.
- For very long sequences (audio, long text), Convolutional Neural Networks (CNNs) can also work surprisingly well. The chapter will discuss this and implement WaveNet, a CNN architecture for long sequences.
- (Chapter 16 will continue with RNNs for NLP and then move to attention mechanisms).
What this chapter is ultimately trying to achieve: To introduce a new type of neural network architecture capable of understanding and making predictions based on ordered sequences of data, where the order and context over time are crucial.*
Recurrent Neurons and Layers
Up to now, we’ve focused on feedforward neural networks, where activations flow in one direction: input -> hidden layers -> output.
Recurrent Neural Network (RNN) Difference: An RNN looks like a feedforward network, but it also has connections pointing backward. This creates a loop, allowing information to persist.
Simplest RNN: A Single Recurrent Neuron (Figure 15-1, page 498):
- Left diagram (rolled): A single neuron receives an input
x, produces an outputy, and importantly, sends its own output back to itself as an input for the next step. - At each time step
t(also called a frame):- The recurrent neuron receives the current external input
x(t). - It also receives its own output from the previous time step,
y(t-1). Thisy(t-1)is the “memory” or “state” from the past. - At the very first time step (
t=0), there’s no previous output, soy(-1)is generally initialized to 0.
- The recurrent neuron receives the current external input
- Right diagram (unrolled through time): This is a crucial visualization. It shows the same recurrent neuron being represented at different points in time (
t-3, t-2, t-1, t).- At time
t, the neuron takesx(t)and the output from itself att-1(which isy(t-1)) to producey(t). - This “unrolling” helps visualize how information flows and how gradients will be calculated.
- At time
- What this recurrent connection is ultimately trying to achieve: It allows the neuron to maintain a “state” or “memory” of past inputs, influencing its current output based not just on the current input but also on what it has seen before.
- Left diagram (rolled): A single neuron receives an input
Layer of Recurrent Neurons (Figure 15-2, page 499):
You can have a whole layer of these recurrent neurons.
At each time step
t:- Every neuron in the layer receives the entire input vector
x(t). - Every neuron also receives the entire output vector
y(t-1)from all neurons in this same layer at the previous time step.
- Every neuron in the layer receives the entire input vector
Weights: Each recurrent neuron (or the layer as a whole) now has two sets of weights:
Wₓ: Weights for the current inputsx(t).Wᵧ: Weights for the outputs from the previous time stepy(t-1).
Equation 15-1: Output of a recurrent layer for a single instance at time
t:y(t) = φ( Wₓᵀx(t) + Wᵧᵀy(t-1) + b )φis the activation function (e.g., tanh is common, ReLU can be used but needs care as mentioned in Ch 11).bis the bias vector.- What this equation is ultimately trying to achieve: It formalizes how the current output
y(t)is a function of both the current inputx(t)and the collective state/output of the layer from the previous time stepy(t-1).
Equation 15-2: Output for a mini-batch:
Y(t) = φ( X(t)Wₓ + Y(t-1)Wᵧ + b )This is the vectorized form for a whole mini-batchX(t).Y(t): Matrix of outputs for all instances in the batch at time stept.X(t): Matrix of inputs for all instances at time stept.Y(t-1): Matrix of outputs from the previous time step. The equation also shows a common convention:WₓandWᵧare often concatenated into a single weight matrixW, and inputs[X(t) Y(t-1)]are concatenated before multiplication.
Memory and State (Page 500):
- Since the output
Y(t)depends on all inputs fromX(0)up toX(t), the RNN has a form of memory. - A part of a neural network that preserves some state across time steps is called a memory cell (or simply a “cell”).
- A single recurrent neuron or a layer of them is a very basic cell, typically capable of learning only short patterns (e.g., ~10 steps). More complex cells (LSTM, GRU) can learn longer patterns.
- Cell State
h(t): In general, a cell’s state at timetis denotedh(t)(“h” for hidden state). It’s a function of the previous stateh(t-1)and the current inputx(t):h(t) = f(h(t-1), x(t)). - Output
y(t): The cell’s outputy(t)is also a function of the previous state and current inputs. - Figure 15-3: For basic recurrent cells like
SimpleRNNin Keras, the outputy(t)is simply equal to the hidden stateh(t). However, for more complex cells like LSTMs (which we’ll see later), the hidden stateh(t)and the outputy(t)can be different. The cell might maintain an internal state that is more complex than what it chooses to output at each time step.
- Since the output
Input and Output Sequences (Figure 15-4, page 501): RNNs can handle various types of input/output sequence configurations:
- Sequence-to-Sequence (Top-Left):
- Input: A sequence of vectors
X(0), X(1), ..., X(T). - Output: A sequence of vectors
Y(0), Y(1), ..., Y(T). - Example: Predicting stock prices (input last N days’ prices, output prices shifted one day into the future).
- What it’s ultimately trying to achieve: Transform an input sequence into an output sequence, often for prediction or transduction.
- Input: A sequence of vectors
- Sequence-to-Vector (Top-Right):
- Input: A sequence of vectors.
- Output: A single vector (usually from the last time step’s output). All other outputs are ignored.
- Example: Sentiment analysis of a movie review (input sequence of words, output a single sentiment score like -1 to +1).
- What it’s ultimately trying to achieve: Summarize or classify an entire input sequence into a single vector representation or decision.
- Vector-to-Sequence (Bottom-Left):
- Input: A single fixed vector, fed repeatedly at each time step (or just at the first time step with subsequent inputs being zero).
- Output: A sequence of vectors.
- Example: Image captioning (input an image or its CNN feature vector, output a sequence of words forming a caption).
- What it’s ultimately trying to achieve: Generate a sequence based on a static input.
- Encoder-Decoder (Bottom-Right):
- A sequence-to-vector network (the encoder) is followed by a vector-to-sequence network (the decoder).
- Example: Neural Machine Translation (NMT).
- Encoder reads an input sentence (e.g., English) and compresses it into a single vector representation (often called the “context vector” or “thought vector”).
- Decoder takes this context vector and generates an output sentence (e.g., French), word by word.
- This two-step model generally works much better for tasks like translation than a single sequence-to-sequence RNN because the meaning of the whole input sentence needs to be captured before starting to generate the translation (the end of an English sentence can affect the beginning of its French translation).
- (More complex Encoder-Decoder details in Chapter 16).
- Sequence-to-Sequence (Top-Left):
The core idea so far is that recurrent neurons have a “loop” that allows them to pass their previous output (their state) as an input to their next computation, enabling them to process sequences and maintain a memory of past information.
The next logical step in the chapter is Training RNNs (Page 502).
Training RNNs - Backpropagation Through Time (BPTT)
So, we have this recurrent structure. How do we train it? How do we adjust Wₓ, Wᵧ, and b?
The Trick: Unroll and Use Regular Backpropagation:
- The core idea is surprisingly simple:
- Unroll the RNN through time for the length of your input sequences (as we saw in Figure 15-1 and 15-2). When you unroll it, it looks just like a very deep feedforward network, where each “time slice” of the RNN becomes a layer in this unrolled network.
- Crucially, the weights (
Wₓ,Wᵧ,b) are shared across all these time-step “layers” in the unrolled network. TheWₓused at timet=1is the sameWₓused att=2,t=3, etc. - Then, you simply apply regular backpropagation (as discussed in Chapter 10) to this unrolled network.
- This strategy is called Backpropagation Through Time (BPTT).
- The core idea is surprisingly simple:
The BPTT Process (Figure 15-5, page 503):
- Forward Pass:
- Feed the input sequence through the unrolled network, time step by time step (dashed arrows in the figure).
- At each time step, calculate the outputs
Y(0), Y(1), ..., Y(T)(whereTis the maximum time step or sequence length).
- Evaluate Loss:
- The output sequence
Y(0)...Y(T)is evaluated using a cost functionC(Y(0), ..., Y(T)). - Important Note: The cost function might ignore some outputs. For example:
- In a sequence-to-vector RNN (like sentiment analysis), the cost function might only care about the very last output
Y(T). - In Figure 15-5, the cost function
C(Y(2), Y(3), Y(4))uses only the last three outputs.
- In a sequence-to-vector RNN (like sentiment analysis), the cost function might only care about the very last output
- The output sequence
- Backward Pass:
- The gradients of the cost function are then propagated backward through the unrolled network (solid arrows in the figure).
- Gradients flow backward through all the outputs that were used by the cost function. In Figure 15-5, gradients flow from
Y(4),Y(3), andY(2). They do not flow fromY(1)orY(0)because those were not part of the cost calculation in this specific example.
- Update Parameters:
- Since the same parameters
Wₓ,Wᵧ, andbare used at each time step in the unrolled network, the gradients calculated for these parameters at each time step are summed up (or averaged) across all time steps. - These aggregated gradients are then used to update
Wₓ,Wᵧ, andbusing a standard gradient descent step. - What this summing of gradients across time achieves: It ensures that the parameter updates reflect how those parameters contributed to the error across the entire sequence processing, not just at one particular time step.
- Since the same parameters
- Forward Pass:
Keras Handles the Complexity:
- The good news is that
tf.kerastakes care of all this BPTT complexity for you when you use its recurrent layers! You define the recurrent layer, and duringfit(), Keras handles the unrolling, BPTT, and parameter updates.
- The good news is that
Key Takeaway for BPTT: BPTT is the standard way to train RNNs. It cleverly transforms the problem of training a network with loops into training a deep feedforward network (the unrolled version) where weights are shared across “layers” (time steps). Regular backpropagation can then be applied, with the understanding that gradients for shared weights are aggregated across all time steps.
This is a foundational concept for training any kind of recurrent network. Next, the chapter moves into a practical example: forecasting a time series.
How does the idea of “unrolling in time” and then applying standard backpropagation feel as a strategy for training RNNs?
Excellent! It’s good that BPTT makes conceptual sense. It’s a clever way to adapt a known algorithm (backpropagation) to a new type of network architecture (RNNs).
Now, let’s dive into the first practical application in Chapter 15: Forecasting a Time Series (Pages 503-511). This section will also involve Keras implementations, but we’ll continue to focus on the “what are we trying to achieve” for the RNN-specific concepts and the data handling.
Forecasting a Time Series - The Task
- What is a Time Series?
- A sequence of one or more values per time step.
- Univariate time series: Single value per time step (e.g., daily temperature in your city, active users per hour on a website).
- Multivariate time series: Multiple values per time step (e.g., a company’s quarterly revenue, debt, profit, etc.).
- Common Tasks:
- Forecasting: Predicting future values.
- Imputation: Predicting (or “postdicting”) missing values from the past.
- Example for this section (Figure 15-6, page 504):
- Three univariate time series, each 50 time steps long.
- Goal: Forecast the value at the next time step (the ‘X’ in the figure) for each of them.
Generating Time Series Data
For simplicity, the book uses a
generate_time_series()function to create synthetic data.- It creates
batch_sizetime series, each of lengthn_steps. - Each series is univariate (one value per time step).
- Each series is a sum of two sine waves with random frequencies and phases, plus some noise.
- The function returns a NumPy array of shape
[batch_size, time_steps, 1](the last dimension is for the number of features per time step, which is 1 for univariate).
- It creates
Input Data Shape for RNNs (Bird Icon, page 504):
- When dealing with sequences, input features are generally represented as 3D arrays:
[batch_size, time_steps, dimensionality]dimensionality: 1 for univariate time series, more for multivariate.
- When dealing with sequences, input features are generally represented as 3D arrays:
Creating Train/Validation/Test Sets:
n_steps = 50- Generate 10,000 series, each
n_steps + 1long (the extra step is for the target).series = generate_time_series(10000, n_steps + 1) - Split the data:
X_train: First 7000 series, firstn_stepstime steps each[:7000, :n_steps]y_train: First 7000 series, the last time step each[:7000, -1](this is the value we want to predict after seeingn_steps).- Similarly for
X_valid(next 2000) andX_test(last 1000).
- Shapes:
X_trainis. `y_train` is.
Baseline Metrics
Before building complex RNNs, it’s crucial to establish some simple baselines.
- What baselines are ultimately trying to achieve: They give you a simple reference point. If your fancy RNN can’t beat a very basic model, something is wrong, or the task might be too simple for an RNN, or the basic model is surprisingly effective.
Naive Forecasting:
- Predict the last observed value in each series as the next value.
y_pred_naive = X_valid[:, -1, :](Take the last time step from each validation sequence).- Calculate Mean Squared Error (MSE):
np.mean(keras.losses.mean_squared_error(y_valid, y_pred_naive)) - Result: ~0.020 MSE.
Simple Linear Regression Model:
- Use a fully connected network (a Dense layer) with a linear activation.
- Since Dense layers expect a flat list of features, add a
Flattenlayer first.
model_linear = keras.models.Sequential([ keras.layers.Flatten(input_shape=[50, 1]), # Input: 50 time steps, 1 feature each keras.layers.Dense(1) # Output: 1 predicted value ])- Compile with MSE loss and Adam optimizer. Train for 20 epochs.
- Result: ~0.004 MSE. Much better than naive forecasting!
Implementing a Simple RNN
Now let’s try a very basic RNN. Can it beat the linear model?
model_simple_rnn = keras.models.Sequential([
keras.layers.SimpleRNN(1, input_shape=[None, 1]) # Single layer, single neuron
])
keras.layers.SimpleRNN(1, ...): Creates a single recurrent layer with just one recurrent neuron.input_shape=[None, 1]:None: We don’t need to specify the length of the input sequences (n_steps). RNNs can process sequences of arbitrary length. This is a major advantage.1: Dimensionality of the input at each time step (univariate).
Default Behavior of
SimpleRNN:- Uses
tanh(hyperbolic tangent) as the activation function. - Returns only the final output: By default, recurrent layers in Keras only return the output of the last time step. If our input sequence has 50 steps (
y(0)toy(49)), this layer will outputy(49). (Bird icon, page 506). - This is exactly what we want for this forecasting task (predicting the value at the next step after seeing
n_steps).
- Uses
How it works (conceptually, as in Figure 15-1):
- Initial hidden state
h_initis set to 0. - For
t=0: Neuron takesx(0)andh_init, computesh(0)(which is alsoy(0)forSimpleRNN). - For
t=1: Neuron takesx(1)andh(0), computesh(1)(which is alsoy(1)). - …
- For
t=49: Neuron takesx(49)andh(48), computesh(49)(which is alsoy(49)). - The layer outputs this final
y(49).
- Initial hidden state
Training: Compile with MSE, Adam optimizer, train for 20 epochs.
Result: MSE of ~0.014.
- Better than naive forecasting (0.020).
- But worse than the simple linear model (0.004)!
Why is it worse? Parameters:
- The linear model had
50 inputs * 1 weight_per_input + 1 bias = 51parameters. - A simple RNN neuron has:
- Weights for current input
x(t)(vectorWₓ, ifx(t)is a vector). Here,x(t)is scalar, so 1 weight. - Weights for previous hidden state
h(t-1)(vectorWᵧ, ifh(t-1)is a vector). Here,h(t-1)is scalar (output of one neuron), so 1 weight. - One bias term.
- Total: 3 parameters for this single
SimpleRNN(1)neuron.
- Weights for current input
- With only 3 parameters, it’s much less powerful than the 51-parameter linear model for this task.
- The linear model had
Trend and Seasonality (Sidebar, page 506):
- Traditional time series models (like ARIMA) often require you to manually remove trends (e.g., 10% monthly growth) and seasonality (e.g., sunscreen sales peak in summer) from the data before training. After prediction, you add them back.
- With RNNs, this is generally not necessary. The RNN can learn trends and seasonality if they exist in the data. However, preprocessing to remove them might sometimes improve performance, as the model then has a simpler pattern to learn.
Deep RNNs
The simple RNN with one neuron was too simple. Let’s try stacking multiple layers of cells – a deep RNN (Figure 15-7, page 507).
Keras Implementation: Just stack recurrent layers.
model_deep_rnn = keras.models.Sequential([ keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]), keras.layers.SimpleRNN(20, return_sequences=True), # Second layer keras.layers.SimpleRNN(1) # Output layer ])- First
SimpleRNN(20, return_sequences=True, ...):- 20 recurrent neurons (units).
return_sequences=True: This is crucial! By default, a recurrent layer only outputs the hidden state of the last time step. For stacking, the next recurrent layer needs to receive a full sequence of outputs from the current layer (one output vector per time step). Settingreturn_sequences=Truemakes it output the hidden state at every time step. So, if input is 50 steps, output is 50 steps (each step being a 20-dim vector).
- Second
SimpleRNN(20, return_sequences=True):- Also 20 units. Also returns sequences to feed into what would be the next layer.
- (The scorpion icon on page 507 emphasizes setting
return_sequences=Truefor all recurrent layers except possibly the last one if you only care about its final output).
- Third
SimpleRNN(1):- The final layer has 1 unit (to predict a single value).
- It does not have
return_sequences=True(or it’sFalseby default), so it will only output the value from the very last time step.
- First
Training: Compile with MSE, Adam optimizer, train.
Result: MSE of ~0.003! We finally beat the linear model (0.004).
Improving the Output Layer (Page 507):
- The last
SimpleRNN(1)layer is not ideal:- Its hidden state is just a single number, which isn’t much memory. The model likely relies more on the hidden states of the previous 20-unit layers.
SimpleRNNusestanhby default, so predictions are bounded between -1 and 1. What if our time series values go outside this range?
- Better approach: Replace the final recurrent layer with a
Denselayer.model_deep_rnn_dense_output = keras.models.Sequential([ keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]), keras.layers.SimpleRNN(20), # Now this is the last recurrent layer, so no return_sequences=True keras.layers.Dense(1) ])- The second
SimpleRNNnow only outputs its final state (a 20-dim vector). - The
Dense(1)layer takes this 20-dim vector and outputs a single value. - Benefits:
Denselayer is often faster.- No
tanhconstraint on the output value (unless you add an activation to theDenselayer). - This model converges faster and performs just as well or better.
- The second
- The last
Forecasting Several Time Steps Ahead
So far, we’ve only predicted the very next value (t+1). What if we want to predict multiple steps ahead (e.g., t+1 to t+10)?
Option 1: Iterative Predictions (Predict one step, use it as input for next, etc. - Page 508)
- Use one of the models trained to predict just one step ahead.
- To predict 10 steps:
- Feed the last
n_stepsof known data to the model, get prediction for stepn_steps+1. - Take this prediction, append it to your known sequence (as if it were a true value).
- Feed the new last
n_steps(which now includes the first prediction) to the model to get prediction forn_steps+2. - Repeat 10 times.
- Feed the last
- Code Example:
# series is [1, n_steps + 10, 1] # X_new is initial sequence [1, n_steps, 1] # Y_new is target future sequence [1, 10, 1] X = X_new for step_ahead in range(10): y_pred_one = model.predict(X[:, step_ahead:]) # Predict one step X = np.concatenate([X, y_pred_one[:, np.newaxis, :]], axis=1) # Append prediction Y_pred_iterative = X[:, n_steps:] # The 10 appended predictions - Problem: Errors can accumulate. If the prediction for step 1 is slightly off, it makes the input for predicting step 2 slightly off, and so on. (Figure 15-8, page 509, shows this).
- Result: MSE on validation set is ~0.029. Much higher than previous models, but it’s a harder task.
- Naive forecast (predict last value for all 10 future steps) gives MSE ~0.223 (terrible).
- A simple linear model (trained to predict 10 steps ahead) gives MSE ~0.0188 – much better than this iterative RNN approach and faster to train.
- When it might work: For a few steps ahead, or for complex tasks where linear models fail, this iterative approach might be okay.
Option 2: Train an RNN to Predict All Future Values at Once (Sequence-to-Vector - Page 509)
- Change the targets: Instead of
y_trainbeing the value atn_steps+1, makeY_traina vector of the 10 values fromn_steps+1ton_steps+10.Y_train = series[:7000, -10:, 0](ifserieshasn_steps + 10length). The last dimension0is to flatten it if the series had multiple features per step. If it’s already[batch, 10], no need. The book actually means the targets areseries[:7000, n_steps:n_steps+10, 0]. Let’s assume the targets are reshaped to[batch_size, 10]. - Modify the model: Output layer now needs 10 units (instead of 1).
model_seq_to_vec_10 = keras.models.Sequential([ keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]), keras.layers.SimpleRNN(20), # Last RNN outputs its final state (20-dim vector) keras.layers.Dense(10) # Dense layer predicts 10 values ]) - Result: MSE for next 10 steps is ~0.008. Much better than the linear model (0.0188)!
Option 3: Train an RNN to Predict Future Values at Each Time Step (Sequence-to-Sequence - Page 509-510) This is even better. Instead of only forecasting the next 10 values at the very last time step, train the model to forecast the next 10 values at each and every time step of the input.
Targets
Y(Page 510):- For each input sequence
X(0)...X(n_steps-1), the targetYneeds to be a sequence of the same length. - At time step
tof the input, the targetY(t)is a vector of the actual future values[actual(t+1), actual(t+2), ..., actual(t+10)]. - The code snippet shows how to construct this
Yarray of shape[batch_size, n_steps, 10].
- For each input sequence
Causal Model (Bird Icon, page 510):
- The targets
Y_trainwill contain values that also appear inX_train(e.g.,X_train[0, t+1]is part of the target forX_train[0,t]). Is this cheating? - No, because at each time step
t, the model only knows about past time steps up tot. It cannot look ahead in the inputXto seeX(t+1)when predicting the target forY(t). It’s a causal model.
- The targets
Model Architecture (Sequence-to-Sequence - Page 510):
- All recurrent layers must have
return_sequences=True(even the last recurrent one). - The output
Dense(10)layer must be applied at every time step. Keras offers aTimeDistributedlayer wrapper for this.
model_seq_to_seq = keras.models.Sequential([ keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]), keras.layers.SimpleRNN(20, return_sequences=True), # Both RNNs return sequences keras.layers.TimeDistributed(keras.layers.Dense(10)) # Apply Dense(10) at each time step ])TimeDistributed(Dense(10)): Wraps theDense(10)layer. It takes the input sequence from the previousSimpleRNN(shape[batch_size, time_steps, 20]). It reshapes it to[batch_size * time_steps, 20], applies theDense(10)layer (getting[batch_size * time_steps, 10]), and then reshapes it back to[batch_size, time_steps, 10].- The footnote 2 on page 510 mentions that a
Denselayer itself can often handle sequence inputs correctly (applying to the last dimension independently across time steps), makingTimeDistributed(Dense(...))sometimes redundant. However,TimeDistributedmakes the intent very clear.
- All recurrent layers must have
Loss and Evaluation (Page 511):
- During training, all outputs are needed, so MSE over all outputs is fine.
- For prediction and final evaluation, often only the output at the last time step is useful (i.e., given the full input sequence
X(0)...X(n_steps-1), what’s the forecast forX(n_steps)...X(n_steps+9)?). - A custom metric
last_time_step_mseis defined to compute MSE only on the output at the last time step.model.compile(loss="mse", optimizer="adam", metrics=[last_time_step_mse])
Result: Validation MSE (for the last time step’s 10-step forecast) is ~0.006. This is 25% better than the sequence-to-vector model (0.008)!
Why it’s better:
- The loss includes terms for the RNN’s output at every time step.
- This means many more error gradients flow through the model during BPTT.
- Gradients don’t just flow through time (horizontally in unrolled view); they also flow from the output of each time step (vertically).
- This stabilizes and speeds up training.
Generating Arbitrarily Long Sequences (Page 511): You can combine this sequence-to-sequence approach (predicting next 10 values) with the iterative approach (feed predictions back as input) to generate very long sequences (e.g., for music or text generation, Chapter 16). May not be accurate for long-term forecasting but fine for generation.
Error Bars with MC Dropout (Bird Icon, page 511): For forecasting, it’s useful to have error bars. MC Dropout (from Chapter 11) can be used: add MC Dropout layers within recurrent cells, then at inference, run the model many times with dropout active and compute mean/std dev of predictions.
Handling Long Sequences - The Problems
The chapter states: “Simple RNNs can be quite good at forecasting time series or handling other kinds of sequences, but they do not perform as well on long time series or sequences.”
- Why? To train an RNN on long sequences, we must run it over many time steps. This means the unrolled RNN becomes a very deep network.
- Problems with Deep Unrolled RNNs:
- Unstable Gradients Problem (Vanishing/Exploding): Just like any deep neural network (as discussed in Chapter 11), it may suffer from gradients becoming too small or too large as they propagate back through many time steps. This makes training very slow or unstable.
- Forgetting Early Inputs (Limited Short-Term Memory): As an RNN processes a long sequence, the information from the earlier time steps has to be carried through many transformations in the hidden state. Due to these repeated transformations (and often the “squashing” nature of activation functions like tanh), the information from the initial inputs tends to get diluted or lost. The RNN’s state effectively “forgets” what happened much earlier in the sequence.
This section will address both these problems.
Fighting the Unstable Gradients Problem in RNNs
Many tricks from Chapter 11 for deep feedforward nets can also be used for RNNs:
- Good parameter initialization (Glorot, He).
- Faster optimizers (Adam, Nadam).
- Dropout (with caveats).
However, some techniques behave differently or have limitations with RNNs:
Nonsaturating Activation Functions (e.g., ReLU):
- May not help as much with unstable gradients in RNNs and can even make them more unstable.
- Why? The same weights (
Wₓ,Wᵧ) are applied at every time step. If GD updates these weights in a way that slightly increases the outputs at time step 1, this effect can compound at time step 2, then time step 3, and so on, potentially leading to outputs exploding. A nonsaturating activation function (like ReLU) doesn’t prevent this explosion for positive values. - Solution/Preference: A saturating activation function like
tanhis often the default and preferred choice for simple RNN cells because its bounded output (-1 to 1) helps control the explosion of activations. The vanishing gradient issue (due to tanh saturation) is then tackled by other means (like LSTMs/GRUs or a smaller learning rate). - If training is unstable (gradients exploding), monitor gradient sizes (e.g., with TensorBoard) and perhaps use Gradient Clipping (as discussed in Chapter 11).
Batch Normalization (BN) with RNNs:
- Cannot be used as efficiently with RNNs as with deep feedforward nets.
- You cannot use standard BN between time steps (i.e., to normalize the hidden state
h(t)before it’s fed into the next time stept+1along withx(t+1)). This is because BN’s moving averages for mean/std dev are calculated per feature across a batch, and the statistics of the hidden state can vary significantly from time step to time step. Applying the same BN parameters (learned across all time steps) to each step doesn’t work well. - Where it can be used:
- It is technically possible to add a BN layer inside a memory cell so it’s applied at each time step (to both
x(t)andh(t-1)inputs to the cell). However, the same BN parameters (γ,β, and the moving averagesμ,σ) would be used at each time step, regardless of the actual scale/offset of inputs at that step. A 2015 paper by Laurent et al. (footnote 3) found this didn’t yield good results when applied to hidden states, only slightly beneficial when applied to the inputsx(t). - What works (somewhat): Apply BN between recurrent layers (i.e., “vertically” in the unrolled view of Figure 15-7, not “horizontally” across time steps). You can do this in Keras by adding a
BatchNormalizationlayer before each recurrent layer. Don’t expect miracles, but it might help a bit.
- It is technically possible to add a BN layer inside a memory cell so it’s applied at each time step (to both
Layer Normalization (LN) with RNNs (Page 512):
Introduced by Ba et al. in a 2016 paper (footnote 4). Often works better than BN with RNNs.
Key Difference from BN:
- BN normalizes across the batch dimension (for each feature, calculate mean/std over all instances in the batch).
- LN normalizes across the features dimension (for each instance, calculate mean/std over all features/units at the current time step).
Advantages for RNNs:
- LN can compute its required statistics (mean/std for normalization) on the fly at each time step, independently for each instance. It doesn’t need to average over a batch or estimate population stats with moving averages.
- This means it behaves the same way during training and testing.
Like BN, LN learns scale (
γ) and offset (β) parameters per layer.In an RNN, it’s typically used right after the linear combination of inputs
x(t)and hidden statesh(t-1), before the activation function.Implementing Layer Normalization in a Custom Keras Cell (Page 513): The book shows how to create a custom RNN cell (
LNSimpleRNNCell) that incorporates Layer Normalization.- It inherits from
keras.layers.Layer. - The constructor (
__init__) sets up:self.state_sizeandself.output_size(both equal tounitsfor a simple RNN cell).- An internal
keras.layers.SimpleRNNCell(units, activation=None)(note:activation=Nonebecause LN will happen before the final activation). - A
keras.layers.LayerNormalization()layer. - The desired final activation function (e.g.,
keras.activations.get("tanh")).
- The
call(self, inputs, states)method (which defines the cell’s logic for one time step):outputs, new_states = self.simple_rnn_cell(inputs, states): Get the linear combination from the internal simple RNN cell. (outputsandnew_statesare the same here).norm_outputs = self.layer_norm(outputs): Apply Layer Normalization to these pre-activation values.activated_norm_outputs = self.activation(norm_outputs): Apply the final activation.- Return
activated_norm_outputs, [activated_norm_outputs](output and new state are the same).
- Using the Custom Cell (Page 514):
Wrap this custom cell in a
keras.layers.RNNlayer:model.add(keras.layers.RNN(LNSimpleRNNCell(20), return_sequences=True, ...))TheRNNlayer handles the unrolling through time, feeding states, etc., using the logic defined in your custom cell.
- It inherits from
Dropout in RNNs (Page 514):
- You can apply dropout to the inputs fed to the recurrent layer (at each time step).
- You can also apply dropout to the hidden states as they are passed between time steps. This is often called recurrent dropout.
- Most Keras recurrent layers (like
SimpleRNN,LSTM,GRU, but not the generickeras.layers.RNNwrapper unless the cell implements it) have two dropout hyperparameters:dropout: Dropout rate for the inputs to the cell.recurrent_dropout: Dropout rate for the hidden states.
- No need for a custom cell just for dropout if using built-in LSTM/GRU layers.
With these techniques (proper initialization, careful choice of activation, Layer Normalization, recurrent dropout, Gradient Clipping), you can significantly alleviate the unstable gradients problem and train RNNs much more effectively, even fairly deep ones.
Key Takeaway for Fighting Unstable Gradients in RNNs: The same weights being applied repeatedly across time steps makes RNNs particularly susceptible to unstable gradients.
- Standard solutions like good initialization and optimizers help.
- Nonsaturating activations like ReLU need care; saturating ones like
tanhare often safer defaults for basic RNN cells. - Batch Normalization is tricky to apply effectively within the recurrence (horizontally); Layer Normalization is generally better.
- Recurrent dropout and gradient clipping are also valuable tools.
- What these techniques are ultimately trying to achieve: Create a stable learning environment where gradients can flow effectively through many time steps without vanishing or exploding, allowing the RNN to learn dependencies across time.
Tackling the Short-Term Memory Problem - Introduction
The Problem:
- When an RNN processes a sequence, the data goes through many transformations as the hidden state is updated at each time step.
- Due to these repeated transformations (and often the squashing effect of activation functions like
tanh), information from the early parts of the sequence tends to get diluted or completely lost by the time the RNN reaches later time steps. - The RNN’s state
h(t)might contain virtually no trace of the first inputsx(0), x(1), ...iftis large. - This is a showstopper for tasks requiring understanding long-range dependencies (e.g., understanding the beginning of a long paragraph to make sense of the end, or translating a long sentence where the first word in the source affects the last word in the target).
- The book uses the analogy of Dory the fish from “Finding Nemo” trying to translate a long sentence – by the time she finishes reading it, she has no clue how it started.
The Solution: Cells with Long-Term Memory:
- To tackle this, various types of recurrent cells with mechanisms for long-term memory have been introduced.
- These have proven so successful that basic
SimpleRNNcells are not used much anymore for complex sequence tasks. - We’ll look at the most popular ones: LSTM and GRU cells.
LSTM Cells - Long Short-Term Memory
History: Proposed in 1997 by Sepp Hochreiter and Jürgen Schmidhuber, and gradually improved by others (Alex Graves, Haşim Sak, Wojciech Zaremba).
As a Black Box (Page 515):
- From the outside, an LSTM cell can be used much like a basic
SimpleRNNcell. You can just replacekeras.layers.SimpleRNNwithkeras.layers.LSTM. - Benefits:
- Performs much better.
- Training converges faster.
- Can detect long-term dependencies in the data.
- Keras Implementation:Or, using the generic
model = keras.models.Sequential([ keras.layers.LSTM(20, return_sequences=True, input_shape=[None, 1]), keras.layers.LSTM(20, return_sequences=True), keras.layers.TimeDistributed(keras.layers.Dense(10)) ])RNNlayer with anLSTMCell(though the specializedLSTMlayer is usually preferred as it can use optimized GPU implementations, see Chapter 19):model = keras.models.Sequential([ keras.layers.RNN(keras.layers.LSTMCell(20), return_sequences=True, input_shape=[None, 1]), # ... ])
- From the outside, an LSTM cell can be used much like a basic
How Does an LSTM Cell Work? (The “Internals” - Figure 15-9, page 516): This is where the magic happens.
- Split State: An LSTM cell’s state is split into two vectors:
h(t)(hidden state): The short-term state. This is also the outputy(t)of the cell at time stept.c(t)(cell state or memory cell): The long-term state.
- Core Idea: The network can learn what to store in the long-term state
c(t), what to throw away fromc(t-1), and what to read fromc(t)to influence the current outputh(t). - Data Flow (Left to Right for
c(t)):- The previous long-term state
c(t-1)first goes through a forget gate. - Some new memories (selected by an input gate) are added to it via an addition operation.
- The result is the new long-term state
c(t), which is sent straight out (without further transformation) to be the long-term state for the next time step.
- The previous long-term state
- Output Path (for
h(t)):- After the addition operation (that produced
c(t)), the long-term statec(t)is copied. - This copy is passed through a
tanhfunction. - The result is then filtered by an output gate.
- This produces the short-term state
h(t)(which is also the cell’s outputy(t)).
- After the addition operation (that produced
- Split State: An LSTM cell’s state is split into two vectors:
The Gates (The Key Components - Page 516): The “forget,” “input,” and “output” gates are the controllers.
- The current input vector
x(t)and the previous short-term stateh(t-1)are fed into four different fully connected layers (these are internal to the LSTM cell, each with its own weights and biases). - 1. Main Layer (outputs
g(t)):- This layer analyzes
x(t)andh(t-1). Its outputg(t)(often passed throughtanh) represents candidate values/memories to be potentially added to the long-term state. - In a basic RNN, this
g(t)would directly becomeh(t)andy(t). In an LSTM, its most important parts are stored inc(t).
- This layer analyzes
- 2, 3, 4. Gate Controller Layers (output
f(t),i(t),o(t)):- These three layers use the logistic (sigmoid) activation function, so their outputs range from 0 to 1.
- These outputs are then used in element-wise multiplication operations, acting like gates:
- Output near 0: “Closes” the gate (blocks information flow).
- Output near 1: “Opens” the gate (lets information through).
- Forget Gate (controlled by
f(t)):c(t) ← f(t) ⊗ c(t-1) + ...It decides which parts of the previous long-term statec(t-1)should be erased or kept. If an element off(t)is 0, the corresponding part ofc(t-1)is forgotten. - Input Gate (controlled by
i(t)):c(t) ← ... + i(t) ⊗ g(t)It decides which parts of the candidate memoriesg(t)(from the main layer) should be added to the long-term state. - Output Gate (controlled by
o(t)):h(t) = y(t) = o(t) ⊗ tanh(c(t))It decides which parts of the current long-term statec(t)(after passing throughtanh) should be read out and output as the short-term stateh(t)and the cell outputy(t).
- The current input vector
Equation 15-3: LSTM Computations (Page 517): Summarizes the math for a single instance:
i(t) = σ(Wₓᵢᵀx(t) + Wₕᵢᵀh(t-1) + bᵢ)(Input gate)f(t) = σ(Wₓfᵀx(t) + Wₕfᵀh(t-1) + b_f)(Forget gate - TensorFlow initializesb_fto 1s to prevent forgetting everything at start)o(t) = σ(Wₓₒᵀx(t) + Wₕₒᵀh(t-1) + bₒ)(Output gate)g(t) = tanh(Wₓgᵀx(t) + Wₕgᵀh(t-1) + b_g)(Main layer transforming inputs/prev_state)c(t) = f(t) ⊗ c(t-1) + i(t) ⊗ g(t)(New long-term state: forget old, add new)y(t) = h(t) = o(t) ⊗ tanh(c(t))(Output/new short-term state: filter long-term state)- Each gate and the main layer have their own weight matrices for
x(t)(e.g.,Wₓᵢ) andh(t-1)(e.g.,Wₕᵢ), and their own bias terms (e.g.,bᵢ). These are all learned.
What the LSTM cell is ultimately trying to achieve: By having these gate mechanisms, the LSTM can learn:
- When to let new important information in (
input gate). - What old information to forget (
forget gate). - What information to preserve in its long-term memory cell
c(t)over many time steps. - What part of its current memory to output or use for its short-term state
h(t)(output gate). This allows it to capture long-term patterns and dependencies much more effectively than simple RNNs.
- When to let new important information in (
Peephole Connections (Felix Gers & Jürgen Schmidhuber, 2000 - Page 518):
- An LSTM variant where the gate controllers can also “peek” at the long-term state
c(t-1)(for input/forget gates) orc(t)(for output gate) in addition tox(t)andh(t-1). - This can sometimes improve performance, but not always.
- Standard Keras
LSTMCelldoesn’t support them, buttf.keras.experimental.PeepholeLSTMCelldoes.
- An LSTM variant where the gate controllers can also “peek” at the long-term state
1. Why Sigmoid (Logistic) for the Gates (f(t), i(t), o(t))?
What a Gate Does: A gate’s purpose is to control the flow of information. It needs to decide “how much” of some signal should be allowed to pass through. This is like a valve that can be fully open, fully closed, or partially open.
Sigmoid Output Range (0 to 1): The sigmoid function
σ(z) = 1 / (1 + exp(-z))outputs values strictly between 0 and 1.- Output near 0: “Close the gate” – block most/all of the information.
- Output near 1: “Open the gate” – let most/all of the information pass through.
- Output around 0.5: “Partially open the gate” – let some proportion of the information pass.
Use in Element-wise Multiplication: The outputs of these sigmoid gates are then used in element-wise multiplications (the
⊗symbol in Equation 15-3):f(t) ⊗ c(t-1): How much of the old long-term state to keep (iff(t)is near 1) or forget (iff(t)is near 0).i(t) ⊗ g(t): How much of the new candidate memoryg(t)to let in.o(t) ⊗ tanh(c(t)): How much of the (processed) long-term state to output.
Differentiability: Sigmoid is differentiable, which is essential for backpropagation to learn the weights of the gate controller layers.
What if we used Tanh for Gates?
- Tanh outputs values between -1 and 1.
- If a gate outputted -1, and you multiplied it by a value, it would invert the sign of that value and also scale it. This isn’t really what “gating” or “controlling flow” means in an intuitive sense. We want to scale information between “none of it” (0) and “all of it” (1).
- It might be possible to design a system where tanh is used and its output is then rescaled (e.g.,
(tanh(z) + 1) / 2to map it to 0-1), but sigmoid naturally provides the 0-1 range needed for this multiplicative gating effect.
What if we used ReLU for Gates?
- ReLU outputs 0 for negative inputs and
zfor positive inputs (unbounded positive output). - An output of 0 would effectively “close the gate.”
- An output > 1 would not just “open the gate” but also amplify the signal passing through. This could lead to exploding values and instability, which is precisely what LSTMs are trying to manage better than simple RNNs.
- The gating mechanism generally needs a bounded “how much” signal, and sigmoid’s 0-1 range is perfect for this.
- ReLU outputs 0 for negative inputs and
2. Why Tanh for the Main Layer (g(t)) and for Processing c(t) before the Output Gate?
Main Layer
g(t): This layerg(t) = tanh(W_xgᵀx(t) + W_hgᵀh(t-1) + b_g)computes the “candidate values” to be potentially added to the long-term cell statec(t).Processing
c(t)for Output: The outputy(t) = o(t) ⊗ tanh(c(t)). The long-term statec(t)is passed throughtanhbefore being gated byo(t).Tanh Output Range (-1 to 1):
- The
tanhfunction squashes its input to a range between -1 and 1. - What this achieves for
g(t)(candidate values): It keeps the potential updates to the cell state bounded. This helps prevent the cell statec(t)from growing uncontrollably large (exploding). - What this achieves for
tanh(c(t))before output: It ensures that the values being considered for output are also in a bounded range before the output gate decides how much of it to pass through.
- The
Zero-Centered Output: The output of
tanhis roughly centered around 0. This can be beneficial for learning in subsequent layers or time steps, as discussed in Chapter 10 (it’s a property that sometimes helps with convergence speed compared to sigmoid which is centered at 0.5).Differentiability: Tanh is also differentiable.
What if we used Sigmoid for
g(t)ortanh(c(t))?- Sigmoid outputs 0 to 1. If you continually add positive values (from sigmoid) to the cell state, the cell state could still grow very large without bound (unless the forget gate is very aggressive).
- Tanh allowing for both positive and negative values (and being zero-centered) gives the network more flexibility to increment or decrement values in the cell state
c(t)and to output values that are centered around zero. This can lead to more stable dynamics.
What if we used ReLU for
g(t)ortanh(c(t))?- For
g(t): If ReLU were used,g(t)could be unbounded positive. If the input gatei(t)is open, this could lead to the cell statec(t)growing indefinitely, causing exploding values. LSTMs are designed to prevent this kind of instability that plagues simple RNNs with ReLU. - For
tanh(c(t))before output: Using ReLU here might also be problematic ifc(t)can take on large values, as the outputh(t)could then become unbounded before being used by subsequent layers or calculations. Thetanhhere provides a final “squashing” to keep things in a controlled range.
- For
In Summary:
- Sigmoid for Gates (0 to 1 range):
- Ultimately trying to achieve: A probabilistic-like control over information flow – “block” (0), “pass through” (1), or “partially pass through” (between 0 and 1). This is ideal for multiplicative gating.
- Tanh for Candidate Values and Pre-Output Processing (-1 to 1 range):
- Ultimately trying to achieve: Keep the values that modify the cell state and the values considered for output bounded and roughly zero-centered. This contributes to the stability of the LSTM cell and helps prevent the cell state from exploding, which is a key part of solving the long-term memory problem.
The choice of these specific activation functions is a result of careful design and empirical evidence showing that this combination allows LSTMs to effectively learn, store, and retrieve information over long sequences while maintaining training stability. They are not arbitrary choices but rather integral to the LSTM’s mechanism for managing information flow and memory.
GRU Cells - Gated Recurrent Unit
Proposed by Kyunghyun Cho et al. in 2014 (in the same paper that introduced the Encoder-Decoder architecture we briefly touched upon).
What it is: A simplified version of the LSTM cell.
Performance: It seems to perform just as well as LSTMs on many tasks (footnote 12 refers to a 2015 paper by Greff et al. that found LSTM variants perform roughly the same). This, along with its simplicity, explains its growing popularity.
Main Simplifications Compared to LSTM (Figure 15-10, page 519):
- Merged State Vectors: Both state vectors (
c(t)for long-term andh(t)for short-term in LSTM) are merged into a single state vectorh(t)in GRU. - Single “Update” Gate
z(t)controls both Forget and Input:- LSTM has a separate forget gate
f(t)and an input gatei(t). - GRU has a single update gate
z(t). - If
z(t)is close to 1 (gate “open” for update), it means “forget the previous state and update with the new candidate state.” Specifically,h(t) = z(t) ⊗ h(t-1) + (1-z(t)) ⊗ g(t).- If
z(t) ≈ 1:h(t) ≈ 1 ⊗ h(t-1) + 0 ⊗ g(t) ≈ h(t-1)(keep old state, ignore candidate). Correction: The equation in the book and standard GRU ish(t) = (1-z(t)) ⊗ h(t-1) + z(t) ⊗ g(t). So ifz(t) ≈ 1(update gate is “active for new candidate”), thenh(t) ≈ 0 ⊗ h(t-1) + 1 ⊗ g(t) ≈ g(t)(take the new candidate). Ifz(t) ≈ 0, thenh(t) ≈ 1 ⊗ h(t-1) + 0 ⊗ g(t) ≈ h(t-1)(keep old state). - What the update gate
z(t)is ultimately trying to achieve: It learns to decide how much of the previous stateh(t-1)to keep versus how much of the new candidate stateg(t)to incorporate.
- If
- LSTM has a separate forget gate
- No Output Gate: The full state vector
h(t)is output at every time step. (In LSTM, the output gateo(t)controlled which parts oftanh(c(t))were output ash(t)). - New “Reset” Gate
r(t):- There’s a new gate controller
r(t)(also using sigmoid, outputting 0 to 1). - This gate controls which part of the previous state
h(t-1)will be shown to the main layerg(t)when calculating the candidate state. - The candidate state is
g(t) = tanh(W_xgᵀx(t) + W_hgᵀ(r(t) ⊗ h(t-1)) + b_g). - If
r(t)is close to 0 for some components, those components ofh(t-1)are effectively ignored when computing the new candidateg(t). - What the reset gate
r(t)is ultimately trying to achieve: It allows the cell to “forget” or ignore parts of its previous state that are deemed irrelevant for computing the next candidate state, before deciding how to update the state with that candidate.
- There’s a new gate controller
- Merged State Vectors: Both state vectors (
Equation 15-4: GRU Computations (Page 520): Summarizes the math for a single instance.
z(t) = σ(W_xzᵀx(t) + W_hzᵀh(t-1) + b_z)(Update gate)r(t) = σ(W_xrᵀx(t) + W_hrᵀh(t-1) + b_r)(Reset gate)g(t) = tanh(W_xgᵀx(t) + W_hgᵀ(r(t) ⊗ h(t-1)) + b_g)(Candidate hidden state)h(t) = (1-z(t)) ⊗ h(t-1) + z(t) ⊗ g(t)(Final hidden state for the time step)- Parameters to Learn:
W_xz, W_hz, b_z,W_xr, W_hr, b_r, andW_xg, W_hg, b_g. Notice fewer weight matrices than LSTM because there are fewer gates and only one state vector.
Keras Implementation:
- Keras provides a
keras.layers.GRUlayer (based onkeras.layers.GRUCell). - Using it is just a matter of replacing
SimpleRNNorLSTMwithGRUin yourSequentialmodel.
- Keras provides a
Why LSTMs and GRUs are Successful:
- “LSTM and GRU cells are one of the main reasons behind the success of RNNs.”
- They can tackle much longer sequences than simple RNNs.
- However, they still have fairly limited short-term memory and can have a hard time learning long-term patterns in sequences of 100 time steps or more (e.g., very long audio samples, long time series, long sentences).
One way to help LSTMs/GRUs handle even longer sequences is to shorten the input sequences they directly see, for example, by using 1D convolutional layers first.
Key Takeaway for GRUs: GRUs are a streamlined version of LSTMs that often perform just as well but are computationally simpler (fewer parameters, one state vector).
- They use an update gate (to decide how much of the past state to keep vs. new candidate state) and a reset gate (to decide how much of the past state influences the current candidate state).
- What GRUs are ultimately trying to achieve: Similar to LSTMs, they aim to capture dependencies over longer sequences by using gating mechanisms to control information flow and memory, but with a more compact architecture.
This covers the main advanced recurrent cell types (LSTM and GRU) designed to combat the short-term memory problem of simple RNNs. The next step in the chapter is to see how to combine these with other types of layers, like 1D convolutions, to process sequences even more effectively.
Using 1D Convolutional Layers to Process Sequences
While LSTMs and GRUs are much better than simple RNNs at handling longer sequences, they can still struggle with very long dependencies (e.g., hundreds or thousands of time steps, like in raw audio or very long text documents).
The Idea: One way to solve this is to shorten the input sequences before feeding them to the recurrent layers. How? By using 1D convolutional layers.
Recall 2D Convolutional Layers (from Chapter 14, though not yet covered by us in detail):
- They work by sliding small kernels (filters) across an image (a 2D grid of pixels).
- They produce multiple 2D feature maps (one per kernel).
- Each kernel learns to detect a specific local pattern (e.g., an edge, a texture).
1D Convolutional Layers for Sequences:
- Similarly, a 1D convolutional layer slides several kernels (filters) across an input sequence (a 1D array of features over time).
- It produces a 1D feature map per kernel.
- What each 1D kernel is ultimately trying to achieve: It learns to detect a single, very short sequential pattern (no longer than the kernel size). For example, if processing text character by character, a kernel of size 3 might learn to detect common trigrams like “the” or “ing”.
- If you use, say, 10 kernels, the layer’s output will be composed of 10 1-dimensional sequences (feature maps). You can view this output as a single sequence where each time step now has 10 features (instead of, say, 1 feature if it was a univariate time series).
Building Hybrid Networks:
- This means you can build neural networks composed of a mix of:
- Recurrent layers (LSTMs, GRUs).
- 1D convolutional layers.
- Even 1D pooling layers (which downsample the sequence, similar to 2D pooling for images).
- This means you can build neural networks composed of a mix of:
Impact on Sequence Length:
- If a 1D convolutional layer uses a
strideof 1 and"same"padding, the output sequence will have the same length as the input sequence. - If it uses
"valid"padding or astridegreater than 1, then the output sequence will be shorter than the input sequence.- This downsampling of the sequence by the convolutional layer is the key benefit here.
- The convolutional layer can learn to preserve useful local information while dropping unimportant details, effectively compressing the sequence.
- The subsequent recurrent layers then have a shorter sequence to process, making it easier for them to detect longer-range patterns within that compressed representation.
- If a 1D convolutional layer uses a
Keras Example (Page 521): The book shows modifying the time series forecasting model to include a 1D convolutional layer at the beginning.
model = keras.models.Sequential([ keras.layers.Conv1D(filters=20, kernel_size=4, strides=2, padding="valid", input_shape=[None, 1]), # Input is [batch, time_steps, features] keras.layers.GRU(20, return_sequences=True), keras.layers.GRU(20, return_sequences=True), keras.layers.TimeDistributed(keras.layers.Dense(10)) # Assuming seq-to-seq for 10-step forecast ])keras.layers.Conv1D(...):filters=20: It will learn 20 different short patterns.kernel_size=4: Each pattern is 4 time steps long.strides=2: The kernel slides by 2 steps each time, effectively downsampling the input sequence by a factor of 2.padding="valid": No padding, so output sequence length will be(input_length - kernel_size) / strides + 1.input_shape=[None, 1]: Takes sequences of any length, with 1 feature per time step.
- Target Adjustment: Because the
Conv1Dlayer shortens the sequence (and the first output corresponds to input steps 0-3), the target sequencesY_trainneed to be adjusted (cropped and downsampled) to match the output length and alignment of theConv1Dlayer. The exampleY_train[:, 3::2]does this:3::: Skips the first 3 time steps in the target (because the first Conv1D output depends on inputs 0-3, so it effectively predicts starting from a later point relative to the original targets).::2: Downsamples the targets by a factor of 2 to match thestrides=2.
Result: The book states that if you train and evaluate this model, “you will find that it is the best model so far. The convolutional layer really helps.”
It even mentions that it’s possible to use only 1D convolutional layers (and drop recurrent layers entirely) for sequence processing, which leads to the WaveNet architecture.
Key Takeaway for 1D Convolutions in Sequence Processing: 1D convolutional layers can be used as a preprocessing step for RNNs (or even as a replacement for them in some cases like WaveNet).
- What they are ultimately trying to achieve when used with RNNs:
- Feature Extraction: Learn to detect relevant local patterns in the input sequence.
- Downsampling/Shortening Sequences: By using strides, they can reduce the length of the sequence fed into the RNN layers. This allows the RNNs (like LSTMs/GRUs) to focus their memory capacity on longer-range dependencies within a more compressed, feature-rich representation of the original sequence.
(Page 521-522: WaveNet)
This is an example of a powerful architecture that uses only 1D convolutional layers (no recurrent layers) to process very long sequences.
Introduced in a 2016 paper by van den Oord et al. (DeepMind) for generative modeling of raw audio.
Architecture (Figure 15-11, page 522):
- It stacks 1D convolutional layers.
- Crucially, it doubles the dilation rate at every layer.
- Dilation Rate: Controls how spread apart the kernel’s input taps are.
- Layer 1: Dilation rate 1 (looks at, say, 2 adjacent time steps).
- Layer 2: Dilation rate 2 (looks at inputs 2 steps apart, e.g., time
tandt-2). Its receptive field is wider. - Layer 3: Dilation rate 4 (looks at inputs 4 steps apart).
- And so on (1, 2, 4, 8, 16, …).
- What doubling dilation rates achieves: The receptive field of the network (how far back in time it can “see”) grows exponentially with depth. This allows the network to capture very long-range dependencies efficiently.
- Lower layers learn short-term patterns.
- Higher layers learn long-term patterns by combining features from lower layers that already cover wider time spans.
- Dilation Rate: Controls how spread apart the kernel’s input taps are.
- The paper stacked blocks of 10 such convolutional layers (dilation rates 1, 2, …, 512), and repeated these blocks.
- Used “causal” padding to ensure convolutions don’t peek into the future.
Keras Implementation of a Simplified WaveNet (Page 522):
model = keras.models.Sequential() model.add(keras.layers.InputLayer(input_shape=[None, 1])) for rate in (1, 2, 4, 8) * 2: # Two blocks of these dilation rates model.add(keras.layers.Conv1D(filters=20, kernel_size=2, padding="causal", activation="relu", dilation_rate=rate)) model.add(keras.layers.Conv1D(filters=10, kernel_size=1)) # Output layerpadding="causal": Ensures that the output at timetonly depends on inputs up to timet(no future peeking).- The output layer is a
Conv1Dwithkernel_size=1. This is like applying aDenselayer independently at each time step to the features extracted by the dilated convolutions.
Performance:
- The book states that these last two models (RNN with initial Conv1D, and the simplified WaveNet) offer the best performance so far on the time series forecasting task.
- The original WaveNet achieved state-of-the-art on audio generation (text-to-speech, music), handling tens of thousands of time steps per second of audio. This is something LSTMs/GRUs would struggle immensely with.
Key Takeaway for WaveNet: WaveNet demonstrates that stacked 1D convolutional layers with exponentially increasing dilation rates can be extremely effective at capturing long-range dependencies in sequences, offering an alternative to RNNs for very long sequences.
- What it’s ultimately trying to achieve: Efficiently model long-term patterns by having a hierarchy of convolutions that look at progressively larger time scales, without the computational cost of very large kernels or the memory limitations of very deep unrolled RNNs.
This concludes the main content of Chapter 15! It’s taken us from the basic concept of recurrence to sophisticated cells like LSTMs and GRUs, and finally to using CNNs (either with RNNs or alone like WaveNet) for sequence processing.
The core theme has been how to enable neural networks to effectively “remember” and utilize information from previous parts of a sequence to make current predictions or classifications.