Equation for AutoRegressive Models : As explained by Claude sprinkled with info by me

Harsimar Singh
2 min readJul 30, 2024

--

This equation is from the paper : Fine-Tuning Language Models from Human Preferences (https://arxiv.org/pdf/1909.08593)

This equation describes how a language model defines the probability of a sequence of tokens. Let’s break it down:

1. Vocabulary Σ: This is the set of all possible tokens (words, subwords, or characters) the model can use.

2. Language model ρ: This is a probability distribution over sequences of tokens.

3. Σⁿ: This represents all possible sequences of n tokens from the vocabulary.

4. ρ(x₀ … xₙ₋₁): This is the probability of a specific sequence of n tokens (from x₀ to xₙ₋₁).

5. The equation: ρ(x₀ … xₙ₋₁) = ∏₀≤k<n ρ(xₖ|x₀ … xₖ₋₁)

This says that the probability of the entire sequence is equal to the product of the probabilities of each token, given all the tokens that came before it.

6. ρ(xₖ|x₀ … xₖ₋₁): This is the conditional probability of token xₖ given all previous tokens x₀ to xₖ₋₁.

7. The product ∏ is taken over all k from 0 to n-1, multiplying all these conditional probabilities together.

This equation encapsulates the fundamental principle of autoregressive language models: they predict each token based on all the previous tokens, and the probability of a full sequence is the product of these individual predictions. This allows the model to generate coherent sequences of text by sampling from these probabilities one token at a time.

--

--

Harsimar Singh
Harsimar Singh

Written by Harsimar Singh

Co-Founder VAAR Lab, Alumni IIT Ropar ( 2018-2020 ) M.Tech CSE. Loves breaking complex things into granular objects like Rutherford did.

No responses yet