Chapter 2: Language Modeling Basics

10 min read

2.1 Bag of Words (BoW)

What it’s ultimately trying to achieve: The most fundamental problem when dealing with text in machine learning is that algorithms understand numbers, not words. BoW is one of the simplest, oldest, yet often effective, methods to convert a piece of text (a document) into a numerical vector that an ML model can then use, typically for tasks like classification (e.g., “Is this email spam or not?” or “What is the topic of this news article?”).
The Intuition (The “Bag”): Imagine you have a document. You take all the words, throw them into a “bag,” shake it up, and then count how many times each unique word appears (or simply note its presence/absence). The order of words is lost, just like items in a bag.
- Corpus: This is just the collection of all documents you’re working with.
- Tokenization: Before you can count, you need to define what a “word” (or more generally, a “token”) is. This involves splitting text into pieces – usually words, but sometimes subwords. We often convert everything to lowercase and remove punctuation to treat “The” and “the” as the same.
- Vocabulary: This is the list of all unique tokens found across your entire corpus.
- Vectorization: For each document, you create a vector. The length of this vector is the size of your vocabulary. Each position in the vector corresponds to a unique word in the vocabulary. The value at that position can be:
  - Binary: 1 if the word is in the document, 0 if not.
  - Count: The number of times the word appears in the document.
  - (Later, we’d see TF-IDF, but for basic BoW, counts or binary are common). This collection of vectors forms a Document-Term Matrix (DTM), where rows are documents and columns are terms (tokens) from the vocabulary.
Realism and Limitations:
- Sparsity: Most documents only contain a small subset of the entire vocabulary. So, these BoW vectors are usually very sparse (mostly zeros), which can be inefficient.
- Loss of Order: “Man bites dog” and “Dog bites man” would have very similar (or identical) BoW representations, even though their meanings are opposite. Context is lost.
- No Semantic Understanding: “Automobile” and “car” are treated as completely different words, even though they are synonyms. The model has no inherent understanding of their similarity.
- Out-of-Vocabulary (OOV) words: If a new word appears during prediction that wasn’t in the training corpus (and thus not in the vocabulary), BoW has no way to represent it.
Connecting to ML (for classification, as in the book): Once you have these numerical vectors for your documents, and you have labels (e.g., “Cinema,” “Music,” “Science”), you can train a classifier. For more than two classes (multiclass classification), we typically use the softmax activation function in the output layer of a neural network. Softmax converts raw scores (logits) from the network into a probability distribution over the classes – each class gets a probability, and they all sum to 1. The cross-entropy loss is then used to measure how far these predicted probabilities are from the true (one-hot encoded) labels.

2.2 Word Embeddings

What it’s ultimately trying to achieve: To overcome the limitations of BoW, especially the lack of semantic understanding and sparsity. Word embeddings aim to represent words as dense, lower-dimensional vectors such that words with similar meanings have similar vector representations (i.e., they are “close” in the vector space).
The Intuition (Words as Dense Vectors): Instead of a massive, sparse vector where only one position is ‘1’ (like in a one-hot encoding of a word from BoW), each word gets a shorter vector (e.g., 100-300 dimensions instead of 50,000) where most numbers are non-zero. These numbers aren’t just counts; they are learned parameters that capture semantic properties. The classic example is that the vector operation vector("king") - vector("man") + vector("woman") results in a vector very close to vector("queen").
How it’s typically done (conceptual, like with Word2vec’s skip-gram): The core idea, popularized by algorithms like Word2vec (Tomáš Mikolov’s work!), is that words that appear in similar contexts tend to have similar meanings.
- A neural network is trained on a large corpus of text. Its task isn’t classification, but something like: given a word (the “target” word), predict its surrounding words (the “context” words), or vice-versa.
- The magic happens in the “embedding layer” of this network. After training, this layer contains the dense vector representations for each word in the vocabulary. We then discard the rest of the network and just keep these learned embeddings.
- These embeddings are learned from unlabeled text – you don’t need explicit labels like “spam” or “sports” for every document.
Benefits:
- Semantic Similarity: Cosine similarity between word vectors can quantify how similar words are in meaning.
- Dimensionality Reduction: Vectors are much smaller than BoW vocabularies.
- Improved Generalization: Models can generalize better because they understand that “car” and “automobile” are related.
Realism: These embeddings are powerful but are only as good as the data they were trained on. They can still inherit biases present in the training text.

2.3 Byte-Pair Encoding (BPE)

What it’s ultimately trying to achieve: To address two main problems:
1. Huge Vocabularies: If you treat every unique word as a token, vocabularies can become enormous, especially with morphologically rich languages or typos.
2. Out-of-Vocabulary (OOV) Words: Word embedding models still struggle if a word was never seen during their training. BPE is a subword tokenization algorithm. It breaks words into smaller, frequently occurring units (subwords).
The Intuition (Smart Splitting): Start by treating every individual character as a token. Then, iteratively find the most frequent pair of adjacent tokens (characters or already formed subwords) in your corpus and merge them into a new, single token. Repeat this for a fixed number of merges or until your vocabulary reaches a desired size. For example, “lowest” might initially be l o w e s t. If es is a frequent pair, it becomes l o w es t. If est then becomes frequent, it’s l o w est. The underscore _ is often added to mark the start of a word (e.g., _low est) to differentiate subwords at the beginning of words from those in the middle.
Benefits:
- Manages Vocabulary Size: You can control the final vocabulary size.
- Handles OOV Words: Unknown words can often be represented as a sequence of known subwords (e.g., “unfamiliarword” might become “_un”, “famil”, “iar”, “word”).
- Captures Morphology: “running” might be tokenized as “_runn”, “ing”, sharing the “_runn” part with “runner.”
Realism: The choice of vocabulary size is a hyperparameter. There are other subword tokenization methods too (like WordPiece, SentencePiece). BPE is a common and foundational one.

2.4 Language Model (The Core Definition)

What it’s ultimately trying to achieve: To assign a probability to a sequence of tokens. More practically, and more commonly for autoregressive models (which we focus on), it’s about predicting the next token in a sequence given all the preceding tokens. P(current_token | previous_tokens)
The Intuition (Super Autocomplete): Think of the autocomplete on your phone, but much more powerful. Given “The cat sat on the…”, a good language model would assign a high probability to “mat,” a lower one to “table,” and a very low one to “banana.”
Key Idea: It learns the statistical patterns of language. It’s not understanding in a human sense, but it’s very good at figuring out what typically follows what. The output for the next token is a probability distribution over the entire vocabulary.

2.5 Count-Based Language Model (e.g., n-gram models)

What it’s ultimately trying to achieve: To implement the language model definition P(current_token | previous_tokens) using simple frequency counts of n-grams from a large text corpus. “n-gram” just means a sequence of ’n’ tokens.
The Intuition (Simple Frequencies): If we want to predict the word after “the cat sat on,” we look at our corpus and see how many times “the cat sat on” appears. Then, among those occurrences, we count how many times each different word followed.
- A unigram model considers only the current word’s frequency (P(word)).
- A bigram model considers the previous word (P(word_i | word_{i-1})).
- A trigram model (as in the book example) considers the two previous words (P(word_i | word_{i-2}, word_{i-1})). The probability is calculated as: P(w_i | w_{i-2}, w_{i-1}) = count(w_{i-2}, w_{i-1}, w_i) / count(w_{i-2}, w_{i-1}) This is a Maximum Likelihood Estimate (MLE).
Realism and Limitations:
- Sparsity & Zero Probabilities: What if the specific trigram (w_{i-2}, w_{i-1}, w_i) or even the bigram (w_{i-2}, w_{i-1}) never appeared in your training corpus? You’d get a zero probability, which is bad because it means the model thinks a perfectly valid sequence is impossible.
- Solutions:
  - Backoff: If the trigram count is zero, “back off” and use the bigram probability. If that’s zero, use the unigram probability.
  - Smoothing (e.g., Laplace / Add-one Smoothing): Add a small count (like 1) to all observed n-gram counts, and adjust the denominator accordingly. This ensures no probability is ever strictly zero.
- Limited Context: N-gram models can only look back ’n-1’ tokens. They can’t capture long-range dependencies in text (e.g., a pronoun referring to a noun many sentences earlier).
- Storage: Storing counts for all possible n-grams (especially for n > 3 or 4) can become massive.

2.6 Evaluating Language Models

What it’s ultimately trying to achieve: To objectively measure how good a language model is. This is crucial for comparing different models or different versions of the same model during development.
Key Metrics:
- Perplexity (PPL):
  - Intuition: A measure of how “surprised” or “perplexed” the model is by a sequence of text it hasn’t seen before (the test set). Lower perplexity is better. It’s related to the average number of choices the model effectively has at each step of prediction. If PPL is 10, it’s as if the model is choosing uniformly from 10 words at each step.
  - Calculation: It’s the exponential of the average negative log-likelihood of the test set words. The negative log-likelihood is high if the model assigns low probabilities to the words that actually occurred.
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
  - Intuition: Primarily used for evaluating tasks like text summarization or machine translation, where the model generates a chunk of text. It measures the overlap (e.g., of unigrams, bigrams, or longest common subsequences) between the model-generated text and one or more human-written reference texts (ground truth).
  - Variants: ROUGE-1 (unigram overlap), ROUGE-N (n-gram overlap), ROUGE-L (longest common subsequence). Higher is generally better.
- Human Evaluation:
  - Intuition: Automated metrics don’t capture everything (e.g., fluency, coherence, factual correctness, engagingness). Humans are often the gold standard.
  - Methods:
    - Likert Scales: Raters score model outputs on predefined scales (e.g., 1-5 for fluency).
    - Pairwise Comparison / Elo Ratings: Raters are shown outputs from two different models and choose which one is better. Over many comparisons, this can produce a relative ranking (like Elo ratings in chess).
Realism: No single metric is perfect. The best metric often depends on the specific downstream task the language model is intended for. Comparing perplexity scores is only meaningful if the models use the exact same vocabulary and tokenization.