AI Glossary
What people say vs what things actually mean
69 terms found
Agent
An autonomous AI that thinks and acts on its own
A while loop where an LLM decides what tool to call next, executes it, sees the result, and repeats
Attention
How the AI focuses on important parts
A mechanism where every token computes a weighted sum of all other tokens' values, with weights determined by how relevant they are (via dot product of query and key vectors)
Alignment
Making AI safe
The technical challenge of making an AI system's behavior match human intentions, values, and preferences, including edge cases the designer didn't anticipate
Autoregressive
The AI generates one word at a time
A model that predicts the next token conditioned on all previous tokens, then feeds that prediction back as input for the next step. GPT, LLaMA, and Claude are all autoregressive.
Activation Function
The nonlinear thing between layers
A function applied after each linear layer that introduces nonlinearity. Without it, stacking any number of linear layers collapses to a single linear transformation. ReLU, GELU, and SiLU are the most common. The choice directly affects whether gradients flow during training.
Adam (Optimizer)
The default optimizer
Adaptive Moment Estimation. Combines momentum (first moment) with adaptive learning rates per parameter (second moment). Has bias correction for early steps. Works well across most tasks without much tuning.
AdamW
Adam but better
Adam with decoupled weight decay. In standard Adam, L2 regularization gets scaled by the adaptive learning rate per parameter, which is not what you want. AdamW applies weight decay directly to the weights, independent of the gradient statistics. The default optimizer for training transformers.
Autograd
Automatic gradients
A system that records operations on tensors and automatically computes gradients via reverse-mode differentiation. PyTorch's autograd builds a computation graph on-the-fly (dynamic graph), while JAX uses function transformations (grad). This is what makes backpropagation practical -- you write the forward pass, and the framework computes all the derivatives.
Batch Size
How many examples at once
The number of training examples processed in one forward/backward pass before updating weights. Larger batches give more stable gradient estimates but use more memory. Typical values: 32-512 for training, larger for inference. Batch size interacts with learning rate -- double the batch, double the LR (linear scaling rule).
Backpropagation
How neural networks learn
An algorithm that computes how much each weight contributed to the error by applying the chain rule backward through the network, then adjusts weights proportionally
Context Window
How much the AI can remember
The maximum number of tokens (input + output) that fit in a single API call. Not memory — it's a fixed-size buffer that resets every call
Chain of Thought (CoT)
Making the AI think step by step
A prompting technique where you ask the model to show its reasoning steps, which improves accuracy on multi-step problems because each step conditions the next token generation
CNN (Convolutional Neural Network)
Image AI
A neural network that uses convolution operations (sliding filters over the input) to detect local patterns. Stacking convolutions detects increasingly complex features: edges, textures, objects.
CUDA
GPU programming
NVIDIA's parallel computing platform. Lets you run matrix operations on thousands of GPU cores simultaneously. PyTorch and TensorFlow use CUDA under the hood.
Chunking
Splitting documents into pieces
Breaking long text into smaller segments for retrieval. Choices include: character count (simple but ignores semantic boundaries), sentence-based (better semantic coherence), or semantic chunking (clusters similar sentences). Chunks are embedded and stored in a vector database. Chunk size and overlap directly affect retrieval quality.
Constitutional AI
AI that follows rules
A technique where you give the model a set of principles (the 'constitution') and have it critique and revise its own outputs based on these principles. Anthropic introduced this for Claude, using RLHF-like feedback without human labeling of every instance.
Context Engineering
Putting stuff in the prompt
The discipline of structuring, ordering, and optimizing all the information passed to an LLM in a single call. Includes: instruction placement (system prompt first), few-shot examples, retrieved context windows, chain of thought, and output format constraints. Small changes in ordering can dramatically change results.
Contrastive Learning
Learning by comparing
A self-supervised technique where the model learns representations by pulling similar examples together and pushing dissimilar ones apart. SimCLR, CLIP, and sentence transformers use this. Works by creating augmentations of the same image/text and training the model to recognize they're similar.
Convolution
A sliding window thing
An operation where a small filter (kernel) slides across the input, computing a weighted sum at each position. Detects features like edges in images or patterns in time series. The weights are learned during training. Padding and stride control output size.
Cross-Entropy Loss
The default loss function
Measures the difference between the predicted probability distribution and the true distribution. For classification, it's the negative log likelihood of the correct class. Minimizing cross-entropy is equivalent to maximum likelihood estimation. Widely used because it has nice gradients and works well with softmax.
Decoder
The part that generates output
The component that produces tokens one at a time, conditioned on all previous tokens (autoregressive). In transformers, the decoder has masked self-attention (can't see future tokens) and cross-attention to encoder outputs. GPT is a decoder-only model.
Dimensionality Reduction
Making things smaller
Techniques to represent high-dimensional data in fewer dimensions while preserving important structure. PCA finds linear projections with maximum variance. t-SNE and UMAP are nonlinear and better at preserving local structure for visualization.
Distributed Training
Training on multiple GPUs
Techniques to parallelize training across multiple GPUs/machines. Data parallelism splits batches across GPUs. Model parallelism splits the model itself. FSDP (Fully Sharded Data Parallel) shards model weights. Synchronous vs asynchronous updates. Communication bandwidth is often the bottleneck.
DPO (Direct Preference Optimization)
Better than PPO for alignment
A simplification of RLHF that formulates preference learning as a binary classification problem. Uses a reference model to compute which response is preferred, without the complexity of training a separate reward model. Simpler implementation, faster training, often comparable results to RLHF.
Embedding
Converting text to numbers
A learned vector representation of tokens, sentences, or documents that captures semantic meaning. Trained by predicting neighboring context (word2vec) or through contrastive learning (CLIP). Enables math on text: similar meanings = similar vectors. Stored in vector databases for RAG.
Encoder
The part that reads input
The component that processes the full input sequence simultaneously (bidirectional). In transformers, the encoder has self-attention without masking. Produces representations used by decoder for generation. BERT is encoder-only.
Fine-tuning
Training on your data
Continuing training of a pre-trained model on a specific task or domain. Full fine-tuning updates all parameters. Parameter-efficient methods (LoRA, adapters) update only a small subset, reducing memory and catastrophic forgetting. Task-specific data usually improves performance over base models.
Flash Attention
Fast attention
An attention implementation that uses tiling to compute attention without materializing the full attention matrix. Reduces memory from O(n²) to O(n) and speeds up training 2-4x. Exploits GPU memory hierarchy. Now standard in modern LLM training.
GAN (Generative Adversarial Network)
Two AIs fighting
A generator creates samples; a discriminator tries to distinguish real from fake. They train jointly in a minimax game. Generator improves until discriminator can't tell the difference. Famous for images (StyleGAN, BigGAN) but training is unstable and mode collapse is a common problem.
GELU (Gaussian Error Linear Unit)
The fancy activation
Gaussian Error Linear Unit. Approximates x * sigmoid(1.702x) but is actually x * Φ(x) where Φ is the Gaussian CDF. More expensive than ReLU but smoother gradients. The default activation in modern transformers (BERT, GPT, ViT).
Gradient Clipping
Preventing exploding gradients
Scaling gradients when their norm exceeds a threshold (typically 1.0). Prevents gradient explosion during training, especially in RNNs and early transformer训练. Doesn't solve the underlying problem but is a simple safeguard.
Gradient Descent
Following the slope downhill
An optimization algorithm that updates parameters in the direction of steepest descent of the loss function. The learning rate determines step size. Variants include SGD (stochastic, uses batches), Adam (adaptive rates), and RMSprop (per-parameter learning rates).
Hallucination
When AI makes stuff up
Confident generation of factually incorrect or nonsensical content. Root causes: training on flawed data, memorization rather than reasoning, lack of grounding in facts, and maximization of fluency over accuracy. Mitigations: retrieval augmentation, chain of thought, uncertainty quantification, and human feedback.
Hyperparameter
Settings to tune
Parameters set before training that control the learning process. Examples: learning rate, batch size, number of layers, attention heads, dropout rate. Unlike model weights, hyperparameters aren't learned from data. Tuned via grid search, random search, or Bayesian optimization.
Inference
Running the model
Using a trained model to make predictions on new data. Contrast with training. For LLMs, inference is often the expensive part in production — models are large and generation is slow. Optimization techniques: quantization, batching, KV caching, speculative decoding.
KV Cache
Caching for faster generation
Storing key and value matrices from attention computations so they don't need to be recomputed for each new token. Critical for efficient autoregressive generation. Memory grows linearly with sequence length and batch size. Flash attention makes KV cache management more efficient.
Layer Normalization
Normalizing each layer
Normalizes activations across features (channels) within a single example. Computes mean and variance across the feature dimension, then rescales with learned gamma/beta. Applied before each sub-layer in transformers. Stabilizes training and enables deeper networks.
Learning Rate
How fast it learns
The step size multiplier in gradient descent. Too high: training diverges. Too low: training is slow and may get stuck. Often uses schedules: warmup (gradually increase), decay (reduce over time), or cyclical. Adam adapts learning rates per parameter automatically.
LLaMA
Meta's open model
A family of open-source language models by Meta. Ranges from 7B to 70B parameters. Uses transformer architecture with improvements (RMSNorm, SwiGLU, RoPE). The 13B model is competitive with GPT-3 (175B) due to efficiency. Many fine-tuned variants exist.
LoRA (Low-Rank Adaptation)
Efficient fine-tuning
Adds small trainable rank-decomposition matrices alongside frozen pretrained weights. Reduces trainable parameters by 1000x with minimal performance loss. Different adapters can be swapped for different tasks. The go-to method for parameter-efficient fine-tuning.
Loss Function
How we measure error
A function that quantifies how wrong a model's predictions are. The training objective is to minimize this. Cross-entropy for classification, MSE for regression. The choice affects what the model learns and the geometry of the optimization landscape.
Masked Language Modeling
Fill in the blank
A pretraining objective where random tokens are replaced with a [MASK] token and the model must predict the original token. Used by BERT. Doesn't teach the model to generate — only to understand context. Creates unidirectional representations.
Maximum Likelihood Estimation
Finding the most likely parameters
A statistical method that finds parameters that maximize the probability of observing the training data. Equivalent to minimizing cross-entropy loss. The foundation of most neural network training.
Mixture of Experts (MoE)
Different parts for different inputs
A model architecture where different 'expert' networks handle different types of inputs. A routing mechanism selects which experts to activate for each input. Allows massive models with constant compute cost per token. Switch Transformer, Mixtral, and GPT-4 are rumored to use MoE.
Model Distillation
Small model learns from big model
Training a smaller model to mimic a larger model's outputs or internal representations. Uses soft targets (probability distributions) from the teacher instead of hard labels. Can compress knowledge 10-100x while retaining most performance.
Multimodal
AI that sees and hears
Models that process multiple types of data: text, images, audio, video. Requires aligned representations across modalities. CLIP aligned images and text. GPT-4V processes images. Audio-Visual models exist. True any-to-any generation is an active research area.
Neural Network
AI brain
A differentiable function that learns to map inputs to outputs via gradient descent. Composed of layers of simple functions (linear transformation + nonlinear activation). 'Hidden layers' learn intermediate representations. Depth (layers) enables hierarchical feature learning.
One-Shot Learning
Learning from one example
A model must learn to recognize a category after seeing only a single example. Contrast with few-shot (few examples) and zero-shot (no examples, uses transferred knowledge). Metric learning approaches (Siamese networks) are common.
Overfitting
Memorizing instead of learning
When a model learns the training data too well, including noise and irrelevant patterns, and performs poorly on new data. The model has high variance. Mitigations: regularization, dropout, early stopping, data augmentation, cross-validation.
Perceptron
The simplest neural network
A single neuron that computes a weighted sum of inputs and applies a step function. The original neural network (1957). Can't learn XOR — the motivation for multilayer networks. Foundation for understanding modern deep learning.
Positional Encoding
Where words are
A mechanism to inject position information into transformers, which have no inherent notion of order. Options: sinusoidal (fixed patterns), learned (trainable), RoPE (rotary, relative positions), ALiBi (attention with linear biases). Critical for sequence modeling.
Prompt Engineering
Writing instructions for AI
Crafting inputs to get desired outputs from LLMs. Techniques include: clear task description, few-shot examples, chain of thought, output format constraints, and system prompts. More art than science but patterns exist.
Quantization
Making models smaller
Reducing weight precision (e.g., 32-bit float → 8-bit int) to decrease model size and speed up inference. Post-training quantization is simple but loses accuracy. Quantization-aware training preserves more performance. GPTQ, AWQ, and GGUF are popular methods.
RAG (Retrieval-Augmented Generation)
AI that can look things up
A pattern where an LLM retrieves relevant documents from a knowledge base and includes them in the context for generation. Reduces hallucination, enables up-to-date knowledge, and allows grounding in specific documents. Components: embedding model, vector database, retriever, generator.
ReLU (Rectified Linear Unit)
The simple activation
f(x) = max(0, x). Simple and effective. Sparse activation (only active for positive inputs). Solves vanishing gradient problem in shallow networks. Variants: Leaky ReLU (small slope for negatives), PReLU (learned slope). Still common but GELU is preferred for transformers.
RLHF (Reinforcement Learning from Human Feedback)
Training AI with human preferences
A three-stage process: supervised fine-tuning, reward model training (humans rank outputs), and policy optimization (PPO against reward model). Makes models like ChatGPT helpful, harmless, and honest. Expensive due to human labeling. DPO is a simpler alternative.
RoPE (Rotary Position Embedding)
Better position encoding
Encodes position by rotating the key and query vectors in attention. Naturally captures relative positions without needing separate positional bias. Used by LLaMA, PaLM, and others. Better extrapolation to longer sequences than learned or sinusoidal encodings.
Sampling
Picking the next word
The process of selecting the next token from the model's probability distribution. Greedy (always pick most likely) is deterministic but repetitive. Temperature scaling adjusts distribution sharpness. Top-k and top-p (nucleus) sampling introduce controlled randomness for diversity.
Self-Attention
Words looking at other words
A mechanism where each position attends to all positions in the sequence, computing weights based on learned query, key, and value projections. Captures long-range dependencies without recurrence. The core of transformer architectures.
SFT (Supervised Fine-Tuning)
Fine-tuning with examples
Fine-tuning a pretrained model on demonstrations of desired behavior. Uses human-written or AI-generated (with good prompts) input-output pairs. The first stage of RLHF. Much simpler than reinforcement learning but requires high-quality data.
Softmax
Making probabilities
A function that converts a vector of real numbers into a probability distribution (all outputs sum to 1, all positive). Used at the output of classification models. Temperature-adjusted softmax controls how 'peaked' the distribution is.
Token
A piece of a word
The unit of text that models process. Not characters or words — a learned subword unit. GPT-4 uses ~10-15 chars per token on average (larger vocab than GPT-3.5's ~4 chars). Models have context windows measured in tokens, not words.
Tokenizer
Breaking text into pieces
The algorithm that converts text to tokens. Common approaches: BPE (Byte Pair Encoding, used by GPT), WordPiece (used by BERT), SentencePiece (language-agnostic). Training learns frequent subword units from corpus. Tokenizer choice affects vocabulary size and out-of-vocabulary handling.
Transformer
The architecture that changed everything
A neural network architecture using self-attention and feedforward layers. No recurrence — processes entire sequence in parallel. Introduced in 'Attention Is All You Need' (2017). Scales better than RNNs. Foundation of modern LLMs, vision transformers, and more.
Underfitting
Not learning enough
When a model fails to capture the underlying pattern in the data. The model has high bias. Symptoms: poor training and validation performance. Fixes: larger model, more features, longer training, less regularization.
Vector Database
Database for embeddings
A database optimized for storing and searching high-dimensional vectors. Enables nearest neighbor search by semantic similarity. Popular: Pinecone, Weaviate, Chroma, Qdrant. Critical for RAG. Indexes like HNSW enable fast approximate nearest neighbor search.
Vision Transformer (ViT)
Transformers for images
Treats images as sequences of patches (16x16 pixels each), linearly embedded, then processed by transformer encoder. No convolutions. Benefits: learns global dependencies, scales well with data. Outperforms CNNs on large datasets. The foundation for many vision models.
Weight Decay
Regularization for weights
An L2 regularization penalty on model weights. Prevents weights from growing too large, providing implicit regularization. The strength (coefficient) is a hyperparameter. Often decoupled from learning rate in AdamW for better regularization behavior.
Zero-Shot Learning
AI that never saw your task
A model performs a task without any task-specific training examples. Uses knowledge transferred from pretraining. The model reasons about the task description given in the input. GPT-3 showed that large language models exhibit impressive zero-shot capabilities.
The SLAFAI glossary provides clear, practical explanations of AI terminology.
No jargon, no fluff — just what things actually mean.