OpenAI Chat Completions API Deep Dive: Messages, Roles, and Parameters
Understand the message format, system/user/assistant roles, temperature, max_tokens, top_p, and other parameters that control OpenAI chat completion behavior.
The Anatomy of a Chat Completion Request
Every interaction with OpenAI's chat models goes through the Chat Completions API. Understanding how messages, roles, and parameters work together is essential for getting consistent, high-quality outputs from your applications. This post breaks down every component you need to master.
Message Roles Explained
The messages array is the core of every request. Each message has a role and content:
flowchart LR
INPUT(["User intent"])
PARSE["Parse plus<br/>classify"]
PLAN["Plan and tool<br/>selection"]
AGENT["Agent loop<br/>LLM plus tools"]
GUARD{"Guardrails<br/>and policy"}
EXEC["Execute and<br/>verify result"]
OBS[("Trace and metrics")]
OUT(["Outcome plus<br/>next action"])
INPUT --> PARSE --> PLAN --> AGENT --> GUARD
GUARD -->|Pass| EXEC --> OUT
GUARD -->|Fail| AGENT
AGENT --> OBS
style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
style OUT fill:#059669,stroke:#047857,color:#fff
from openai import OpenAI
client = OpenAI()
messages = [
{"role": "system", "content": "You are a senior Python developer who writes concise, production-ready code."},
{"role": "user", "content": "Write a function to validate email addresses."},
{"role": "assistant", "content": "Here is a robust email validator using regex..."},
{"role": "user", "content": "Now add support for checking MX records."},
]
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
)
Here is what each role does:
- system — Sets the assistant's personality, behavior, and constraints. Processed first and given special weight. Use it for instructions that should persist across the entire conversation.
- user — Messages from the human. These are the questions, prompts, and inputs.
- assistant — Previous responses from the model. Including these creates multi-turn conversations.
Building Multi-Turn Conversations
The API is stateless. You must send the full conversation history with each request:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
conversation = [
{"role": "system", "content": "You are a helpful math tutor. Show your work step by step."},
]
def chat(user_message: str) -> str:
conversation.append({"role": "user", "content": user_message})
response = client.chat.completions.create(
model="gpt-4o",
messages=conversation,
)
assistant_message = response.choices[0].message.content
conversation.append({"role": "assistant", "content": assistant_message})
return assistant_message
print(chat("What is the derivative of x^3 + 2x?"))
print(chat("Now integrate the result."))
Each call sends the growing conversation list, so the model sees the full context.
Key Parameters
temperature and top_p
These control randomness. Use one or the other, not both simultaneously:
# Deterministic output — great for code generation, data extraction
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
temperature=0.0,
)
# Creative output — good for brainstorming, creative writing
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
temperature=1.2,
)
temperature ranges from 0 to 2. At 0, the model is nearly deterministic. At higher values, outputs become more varied and creative.
max_tokens
Limits the length of the generated response:
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
max_tokens=500, # cap response at 500 tokens
)
# Check if the response was cut off
if response.choices[0].finish_reason == "length":
print("Warning: response was truncated")
stop sequences
Tell the model to stop generating when it encounters specific strings:
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "List 5 Python web frameworks, one per line."}],
stop=["6."], # stop before a 6th item
)
n — Multiple Completions
Generate multiple responses in a single request:
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
n=3,
temperature=0.8,
)
for i, choice in enumerate(response.choices):
print(f"Option {i + 1}: {choice.message.content}")
Practical Parameter Combinations
| Use Case | temperature | max_tokens | Notes |
|---|---|---|---|
| Code generation | 0.0 | 2000 | Deterministic, longer output |
| Classification | 0.0 | 10 | Short, consistent labels |
| Creative writing | 1.0 | 1000 | Varied, expressive |
| Summarization | 0.3 | 300 | Slightly varied but focused |
FAQ
Should I always include a system message?
It is not required, but strongly recommended. Without a system message, the model uses a generic helpful assistant persona. A well-crafted system message dramatically improves consistency and output quality.
What happens when the conversation exceeds the model's context window?
The API returns an error if total tokens (messages + response) exceed the model's limit. You need to implement conversation trimming — removing older messages or summarizing them to stay within the token budget.
Is temperature=0 truly deterministic?
Nearly, but not perfectly. OpenAI has noted that identical requests may occasionally produce slightly different outputs due to floating-point computation differences across their infrastructure. For most practical purposes, temperature=0 is effectively deterministic.
#OpenAI #ChatCompletions #APIParameters #Python #LLM #AgenticAI #LearnAI #AIEngineering
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.