Equation for AutoRegressive Models : As explained by Claude sprinkled with info by me
This equation is from the paper : Fine-Tuning Language Models from Human Preferences (https://arxiv.org/pdf/1909.08593)
This equation describes how a language model defines the probability of a sequence of tokens. Let’s break it down:
1. Vocabulary Σ: This is the set of all possible tokens (words, subwords, or characters) the model can use.
2. Language model ρ: This is a probability distribution over sequences of tokens.
3. Σⁿ: This represents all possible sequences of n tokens from the vocabulary.
4. ρ(x₀ … xₙ₋₁): This is the probability of a specific sequence of n tokens (from x₀ to xₙ₋₁).
5. The equation: ρ(x₀ … xₙ₋₁) = ∏₀≤k<n ρ(xₖ|x₀ … xₖ₋₁)
This says that the probability of the entire sequence is equal to the product of the probabilities of each token, given all the tokens that came before it.
6. ρ(xₖ|x₀ … xₖ₋₁): This is the conditional probability of token xₖ given all previous tokens x₀ to xₖ₋₁.
7. The product ∏ is taken over all k from 0 to n-1, multiplying all these conditional probabilities together.
This equation encapsulates the fundamental principle of autoregressive language models: they predict each token based on all the previous tokens, and the probability of a full sequence is the product of these individual predictions. This allows the model to generate coherent sequences of text by sampling from these probabilities one token at a time.