AI/ML Glossary for Engineers
A working glossary of AI/ML terms aimed at software engineers, not ML researchers. Each entry explains what the term means, what people get wrong, and how it shows up in practice.
Quick Reference
Section titled “Quick Reference”| Term | One-Line Definition |
|---|---|
| Model | A function learned from data that maps inputs to outputs |
| Inference | Running a trained model on new input to get a prediction |
| Token | The atomic unit a language model reads and writes |
| Context window | Maximum number of tokens a model can see in one request |
| Temperature | Controls randomness in output generation |
| Embedding | A fixed-size numeric vector representing meaning |
| RAG | Retrieval-augmented generation — look up, then answer |
| Fine-tuning | Further training a model on your data |
| Hallucination | Model generates plausible but false content |
| Prompt | The text you send to a model as input |
| Agent | A model that calls tools in a loop to accomplish a goal |
| MCP | Model Context Protocol — standard interface for tool servers |
| Quantization | Reducing model precision to shrink size and speed inference |
| Grounding | Tying model output to verifiable sources |
Core Concepts
Section titled “Core Concepts”A mathematical function with learned parameters that maps inputs to outputs. A “70B model” has 70 billion parameters. Larger models generally perform better but cost more to run.
What people get wrong: A model is not a database of memorized facts. It learns statistical patterns during training. It cannot “look up” a specific document it was trained on.
Input: "Translate to French: Hello"Model: f(input, parameters) → outputOutput: "Bonjour"Inference
Section titled “Inference”Running a trained model on new input. Every API call to Claude, GPT, or Gemini is an inference request. The model’s parameters stay fixed — no learning happens.
What people get wrong: Inference does not update the model. Sending corrections in a conversation does not retrain anything — the model just sees your correction as additional context in the current prompt.
Training
Section titled “Training”The process of adjusting a model’s parameters using data. Pre-training runs on massive datasets (books, code, web pages) and costs millions of dollars. Engineers rarely do pre-training; they use pre-trained models via API.
Fine-tuning
Section titled “Fine-tuning”Further training a pre-trained model on a smaller, domain-specific dataset. Changes the model’s weights to specialize its behavior.
What people get wrong: Fine-tuning is not the first tool to reach for. It requires curated datasets, costs money, and can degrade general capabilities. RAG or well-crafted prompts solve most customization needs without touching weights.
| Approach | When to Use | Cost |
|---|---|---|
| Prompting | Behavior change, format control, constraints | Free |
| RAG | Custom knowledge, current data | Moderate |
| Fine-tuning | Tone, style, specialized domain language | High |
| Pre-training | Building a foundation model from scratch | Extreme |
Parameters and Weights
Section titled “Parameters and Weights”Parameters are the numbers inside a model that determine its behavior. Weights are the most common type of parameter — the multipliers applied to inputs at each layer. In casual usage, “parameters” and “weights” are interchangeable.
Model sizes (approximate): 7B parameters → ~4 GB quantized → runs on laptop 70B parameters → ~40 GB quantized → needs a beefy GPU 405B parameters → ~200 GB quantized → needs a clusterLanguage Models
Section titled “Language Models”The atomic unit a language model reads and writes. Not a word — a subword piece determined by the model’s tokenizer. English averages roughly 1 token per 0.75 words (4 characters).
What people get wrong: “1 token = 1 word” is a common but inaccurate
shorthand. The word “unhappiness” might be 3 tokens (un, happi, ness).
Code tokenizes especially poorly — a single line of Python can be 20+ tokens.
# Approximate token counts (varies by model)"Hello" # 1 token"authentication" # 3 tokens"123456" # 2-3 tokens"こんにちは" # 3-5 tokens (non-English costs more)Context Window
Section titled “Context Window”The maximum number of tokens a model can process in a single request — including both the input (prompt) and the output (completion). A 200K context window means the model can see roughly 150,000 words at once.
What people get wrong: A large context window does not mean the model pays equal attention to everything in it. Performance degrades on information buried in the middle of very long contexts (“lost in the middle” effect). Place critical information at the beginning or end of the prompt.
Temperature
Section titled “Temperature”A parameter (0.0–2.0, typically) that controls how random the model’s output is. Low temperature (0.0–0.3) makes output deterministic and focused. High temperature (0.7–1.0) makes output more creative and varied.
| Temperature | Use Case | Character |
|---|---|---|
| 0.0 | Code generation, structured output | Repeatable |
| 0.3–0.5 | Technical writing, summaries | Focused |
| 0.7–1.0 | Creative writing, brainstorming | Varied |
What people get wrong: Temperature 0 does not guarantee identical outputs across requests. Implementation details (batching, floating-point order) can produce slight variations even at temperature 0.
Top-p (Nucleus Sampling)
Section titled “Top-p (Nucleus Sampling)”Limits token selection to the smallest set whose cumulative probability exceeds
p. With top_p=0.9, the model considers only the top 90% of probability mass,
ignoring unlikely tokens.
Practical rule: Adjust either temperature or top_p, not both. Most APIs default top_p to 1.0 (no filtering) and let temperature do the work.
Max Tokens
Section titled “Max Tokens”The maximum number of tokens the model will generate in its response. This is a hard cap — the model stops generating once it hits this limit, even mid-sentence.
What people get wrong: Max tokens applies to the output only. It does not
limit the input. Setting max_tokens=100 with a 10,000-token prompt is valid.
Stop Sequences
Section titled “Stop Sequences”Strings that cause the model to stop generating. Useful for structured output where you know the delimiter.
# Stop at the first newline — useful for single-line extractionresponse = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=100, stop_sequences=["\n"], messages=[{"role": "user", "content": "Extract the email: ..."}])Prompting
Section titled “Prompting”System Prompt
Section titled “System Prompt”A special message that sets the model’s behavior for the entire conversation. Appears before user messages and establishes role, constraints, and output format.
response = client.messages.create( model="claude-sonnet-4-20250514", system="You are a code reviewer. Point out bugs and security issues only. " "Do not suggest style changes.", messages=[{"role": "user", "content": code}])Few-Shot Prompting
Section titled “Few-Shot Prompting”Providing examples of desired input-output pairs inside the prompt. The model infers the pattern and applies it to new input.
Classify the sentiment:
"The deploy was smooth" → positive"Tests are failing again" → negative"We shipped the feature" → positive"The API keeps timing out" →What people get wrong: More examples are not always better. Two or three well-chosen examples often outperform ten mediocre ones. Pick examples that cover edge cases, not just the happy path.
Chain-of-Thought (CoT)
Section titled “Chain-of-Thought (CoT)”Asking the model to show its reasoning step by step before giving a final answer. Improves accuracy on math, logic, and multi-step problems.
# Without CoT"What is 17 * 24?" → often wrong
# With CoT"What is 17 * 24? Think step by step."→ "17 * 24 = 17 * 20 + 17 * 4 = 340 + 68 = 408" → correctStructured Output
Section titled “Structured Output”Constraining the model to output valid JSON, XML, or another format. Achieved through system prompts, function calling, or schema constraints.
response = client.messages.create( model="claude-sonnet-4-20250514", system="Respond only with valid JSON matching this schema: " '{"name": string, "severity": "low"|"medium"|"high"}', messages=[{"role": "user", "content": "Classify this bug report: ..."}])Function Calling (Tool Use)
Section titled “Function Calling (Tool Use)”The model selects a function from a provided list, generates the arguments, and returns a structured call for your code to execute. The model never executes the function — your code does.
User: "What's the weather in Austin?"
Model decides to call: function: get_weather arguments: {"location": "Austin, TX"}
Your code runs get_weather("Austin, TX") → {"temp": 95, "condition": "sunny"}
Model uses the result to answer: "It's 95°F and sunny in Austin."What people get wrong: The model does not execute code. It outputs a JSON object describing which function to call and with what arguments. Your application parses the output and runs the function.
Retrieval & Memory
Section titled “Retrieval & Memory”Embedding
Section titled “Embedding”A fixed-length numeric vector (e.g., 1,536 floats) that represents the semantic meaning of text. Similar meanings produce vectors that are close together in vector space.
from openai import OpenAIclient = OpenAI()
# Two semantically similar sentences produce close vectorsemb1 = client.embeddings.create(input="How do I reset my password?", model="text-embedding-3-small")emb2 = client.embeddings.create(input="I forgot my login credentials", model="text-embedding-3-small")# cosine_similarity(emb1, emb2) ≈ 0.89 (high similarity)What people get wrong: Embeddings capture semantic similarity, not keyword overlap. “Dog” and “canine” produce close embeddings. “Bank” (finance) and “bank” (river) produce distant ones — but only in good embedding models.
Vector Database
Section titled “Vector Database”A database optimized for storing and querying embeddings via similarity search.
Instead of exact matches (SQL WHERE), you search by “nearest neighbors” in
vector space.
| Database | Type | Notes |
|---|---|---|
| pgvector | Postgres extension | Add vectors to existing data |
| Pinecone | Managed cloud | Serverless, pay-per-query |
| ChromaDB | Embedded (in-process) | Good for prototypes |
| Qdrant | Self-hosted or cloud | Rich filtering support |
| Weaviate | Self-hosted or cloud | Hybrid keyword + vector |
RAG (Retrieval-Augmented Generation)
Section titled “RAG (Retrieval-Augmented Generation)”A pattern: retrieve relevant documents from a knowledge base, then include them in the prompt so the model can answer from real data rather than memorized training data.
User question: "What is our refund policy?"
Step 1 — Retrieve: Search vector DB for "refund policy" → top 3 documentsStep 2 — Augment: Insert documents into the prompt as contextStep 3 — Generate: Model answers using the retrieved contextWhat people get wrong: RAG does not eliminate hallucination — it reduces it. The model can still misinterpret, selectively ignore, or incorrectly synthesize retrieved documents. Always verify critical answers.
Semantic Search
Section titled “Semantic Search”Searching by meaning rather than keywords. Uses embeddings to find documents that are conceptually similar to a query, even when they share no words.
Query: "how to handle app crashes"Keyword search finds: documents containing "app" and "crashes"Semantic search also finds: "error recovery strategies", "exception handling"Chunking
Section titled “Chunking”Splitting documents into smaller pieces before embedding them. Chunk size affects retrieval quality — too large and you dilute relevance; too small and you lose context.
| Strategy | Chunk Size | Best For |
|---|---|---|
| Fixed-size | 500–1000 chars | Simple, predictable |
| Sentence | 1–3 sentences | Preserves grammatical boundaries |
| Paragraph | ~200–500 words | Natural topic boundaries |
| Recursive | Varies | Structured docs (Markdown, HTML) |
Agents
Section titled “Agents”Tool Use
Section titled “Tool Use”A model’s ability to call external functions (APIs, databases, file systems) to take actions or retrieve information. The model decides which tool to call and generates the arguments; your runtime executes the call.
Agent Loop
Section titled “Agent Loop”The cycle an agent follows: observe (read context) → think (decide next action) → act (call a tool) → observe (read result) → repeat until done.
Goal: "Find and fix the failing test"
Loop iteration 1: read test output → identify assertion errorLoop iteration 2: read source file → spot the bugLoop iteration 3: edit file → fix the bugLoop iteration 4: run tests → all pass → doneWhat people get wrong: Agents are not magic. Each loop iteration costs tokens and time. A poorly constrained agent can spin for dozens of iterations burning budget without making progress. Set exit conditions and iteration limits.
Orchestration
Section titled “Orchestration”Coordinating multiple agents or model calls to complete a complex task. Patterns include sequential pipelines, parallel fan-out, and hierarchical delegation.
Sequential: plan → research → write → review → publishParallel: [lint, test, typecheck] → merge results → decideHierarchical: lead agent delegates subtasks to specialist agentsMCP (Model Context Protocol)
Section titled “MCP (Model Context Protocol)”An open standard for connecting AI models to external tools and data sources. Defines a JSON-RPC protocol so any MCP-compatible client (Claude Code, Cursor) can discover and call tools from any MCP server.
Client (Claude Code) ←→ MCP Server (your tools) ├── list_tools → returns available functions ├── call_tool → executes a function, returns result └── list_resources → exposes data for contextSafety & Quality
Section titled “Safety & Quality”Hallucination
Section titled “Hallucination”When a model generates content that sounds plausible but is factually wrong. Not a bug — it is a fundamental property of how generative models work. They predict likely next tokens, not truthful ones.
Common forms:
| Type | Example |
|---|---|
| Fabricated facts | Cites a paper that does not exist |
| Wrong code | Calls an API method that the library lacks |
| Confident error | States an incorrect date with full confidence |
| Invented URLs | Generates a plausible but nonexistent link |
Grounding
Section titled “Grounding”Tying model outputs to verifiable sources. RAG is the most common grounding technique — the model answers from retrieved documents rather than parametric memory.
What people get wrong: Grounding reduces hallucination but does not eliminate it. A grounded model can still misread, ignore, or misquote its sources.
Guardrails
Section titled “Guardrails”Filters and checks applied before or after model output to enforce safety, format, or business rules. These run in your application code, not inside the model.
Pre-generation guardrails: - Input validation (reject prompt injection attempts) - PII detection (strip sensitive data before sending)
Post-generation guardrails: - Output validation (parse JSON, check schema) - Content filtering (flag unsafe or off-topic responses) - Fact-checking (compare claims against a source of truth)Red Teaming
Section titled “Red Teaming”Deliberately trying to make a model fail — produce harmful output, leak system prompts, bypass guardrails, or generate incorrect answers. A form of adversarial testing.
Alignment
Section titled “Alignment”Training or steering a model to behave according to intended values and instructions. Techniques include RLHF (reinforcement learning from human feedback) and constitutional AI (rules the model self-enforces).
What people get wrong: Alignment is not a binary state. A model is not “aligned” or “unaligned” — it sits on a spectrum and can fail in novel situations. Treat alignment as defense in depth, not a guarantee.
Infrastructure
Section titled “Infrastructure”Inference Endpoint
Section titled “Inference Endpoint”An HTTP API that serves a model for inference. You send a request with your prompt; it returns the model’s output. All major providers (Anthropic, OpenAI, Google) expose models this way.
# Typical inference endpoint callcurl https://api.anthropic.com/v1/messages \ -H "x-api-key: $ANTHROPIC_API_KEY" \ -H "content-type: application/json" \ -d '{ "model": "claude-sonnet-4-20250514", "max_tokens": 1024, "messages": [{"role": "user", "content": "Hello"}] }'Model Serving
Section titled “Model Serving”The infrastructure for hosting and scaling inference endpoints. Includes GPU allocation, request batching, load balancing, and autoscaling. Tools like vLLM, TGI (Text Generation Inference), and Triton handle this.
Quantization
Section titled “Quantization”Reducing the numerical precision of model weights to shrink file size and speed inference. A 16-bit model quantized to 4-bit runs roughly 4x faster with modest quality loss.
| Precision | Bits per Weight | Quality | Speed | Use Case |
|---|---|---|---|---|
| FP32 | 32 | Baseline | Slow | Research |
| FP16/BF16 | 16 | ~Baseline | Medium | GPU training/serving |
| INT8 | 8 | Very good | Fast | Production serving |
| INT4/Q4 | 4 | Good | Faster | Local/edge inference |
What people get wrong: Quantization is not free compression. Going from 16-bit to 4-bit degrades reasoning on complex tasks. Test your specific workload before deploying a quantized model in production.
Distillation
Section titled “Distillation”Training a small “student” model to mimic a large “teacher” model. The student learns from the teacher’s outputs rather than raw data, capturing much of the teacher’s capability in a fraction of the size.
Teacher (405B params) → generates labeled dataStudent (7B params) → trains on teacher's outputsResult: 7B model that performs close to the 405B on targeted tasksGGUF and ONNX
Section titled “GGUF and ONNX”File formats for distributing models outside their original framework.
| Format | What It Is | Runs On |
|---|---|---|
| GGUF | Quantized model format for llama.cpp | CPU, Apple Silicon |
| ONNX | Open Neural Network Exchange (cross-framework) | Any ONNX runtime |
| SafeTensors | Hugging Face safe serialization format | PyTorch, JAX |
# Run a GGUF model locally with llama.cpp./llama-server -m model.Q4_K_M.gguf --port 8080
# Or with Ollama (wraps llama.cpp)ollama run llama3Myths Engineers Believe
Section titled “Myths Engineers Believe”| Myth | Reality |
|---|---|
| ”AI understands my code” | Models predict likely next tokens based on patterns. They have no runtime, no debugger, no understanding — just statistical correlation over training data. |
| ”A bigger context window fixes everything” | Attention degrades over long contexts. Stuffing 200K tokens into a prompt does not mean the model uses all of it well. Retrieve the right 2K tokens instead. |
| ”Temperature 0 = deterministic” | Implementation details (GPU batching, floating-point non-determinism) can produce different outputs even at temperature 0. Use seed parameters when available. |
| ”Fine-tuning is the answer” | Fine-tuning requires curated data, costs money, and can degrade general capabilities. Prompt engineering and RAG solve 90% of customization needs. |
| ”RAG eliminates hallucination” | RAG reduces hallucination by grounding answers in documents. The model can still misinterpret, selectively ignore, or fabricate beyond its sources. |
| ”More examples = better few-shot” | Two precise examples that cover edge cases outperform ten examples that all show the happy path. Quality over quantity. |
| ”Embeddings capture meaning perfectly” | Embeddings capture distributional similarity, not meaning. “Not bad” and “good” embed similarly despite carrying different nuance. |
| ”Agents can do anything” | Each agent loop iteration costs tokens, time, and money. Poorly scoped agents burn budget spinning in circles. Set exit conditions and iteration caps. |
| ”AI-generated code doesn’t need review” | AI code has 1.75x more logic errors and 45% security flaw rate. Review every line as if a confident but careless junior wrote it. |
| ”Open-source models are free” | The weights are free. The GPUs to run them are not. Serving a 70B model requires hardware that costs $2–5/hour. |
| ”Tokens are words” | Tokens are subword pieces. “unhappiness” is 3 tokens. Code tokenizes poorly. Non-English text costs more tokens per word. |
| ”The model remembers past conversations” | Each API call is stateless. The model has no memory between requests. Your application must manage conversation history by resending prior messages. |
| ”Prompt injection is solved” | No reliable defense exists against all prompt injection attacks. Defense in depth — input validation, output filtering, least-privilege tool access — reduces risk but does not eliminate it. |
See Also
Section titled “See Also”- AI CLI — Claude Code CLI usage and prompting patterns
- Claude Code — Agents, hooks, MCP, memory, configuration
- AI Adoption — When and how to adopt AI tools in a team