A Practical Guide to Tool Use

There is a big difference between a model that can call functions and a system that can be trusted to use tools well.
That difference is where a lot of promising agent systems either grow up or quietly fall apart.
On paper, function calling sounds almost deceptively simple. The model gets a set of tool definitions, decides when one is needed, produces structured arguments, and the application executes the request. It feels clean. It feels composable. It feels like the obvious bridge between language models and real software.
In a controlled demo, it often works exactly that way.
Then production happens.
The model chooses the wrong tool:
- It calls a tool before it has enough information.
- It invents a parameter that looks reasonable but is invalid.
- It retries a failing action without understanding why it failed.
- It chooses a write action when it should have retrieved context first.
- It bundles steps that should be separated.
- It requests something the user is not permitted to do.
- It gets stuck in a loop between ambiguity and action.
That is why “the model supports function calling” is not really the interesting claim.
The interesting claim is much harder:
Can the system reliably choose, validate, sequence, permission, observe, and recover around tool use under messy real-world conditions?
That is the actual problem.
And once you see it clearly, you realize that tool use is not mainly a prompt trick. It is a systems design discipline.
The fantasy version of tool use
A lot of early writing about function calling made it sound like the model would naturally become a kind of competent operator once given enough tools. Hand it the schema, describe the affordances well enough, and let the reasoning take over.
That framing was always too optimistic.
Models are very good at producing plausible structured output. That is not the same as being a reliable process executor. Plausibility is exactly what makes poorly designed tool systems dangerous. A bad tool call can look impressively tidy right up to the moment it fails, mutates state incorrectly, leaks information, or sends a workflow drifting somewhere it should never have gone.
What teams often learn the hard way is that function calling does not remove the need for system design.
It increases it.
Tool use is really about operating boundaries
At a high level, reliable function calling depends on a small set of questions being answered well:
- When should the model call a tool at all?
- Which tool is the right one?
- What arguments are required?
- What can safely be inferred, and what must be confirmed?
- What permissions apply in this context?
- What preconditions must be checked before execution?
- What happens if the call fails?
- How do we know whether the model made a good choice?
- How do we improve the system over time?
If those questions are not answered at the application level, the model ends up carrying too much ambiguity. That is usually where reliability begins to collapse.
The good news is that strong patterns emerge quickly once you stop treating tool use like magic and start treating it like infrastructure.
Principle 1: tools should be easy to choose correctly
One of the biggest self-inflicted problems in tool-calling systems is ambiguity.
If several tools overlap loosely, the model will confuse them. If names are generic, the model will generalize badly. If descriptions are vague, the wrong tool can still look semantically plausible. If a single tool handles too many modes, behavior becomes inconsistent.
The goal is not simply to define tools. The goal is to define them in a way that makes correct selection easier than incorrect selection.
That usually means:
- using precise names
- describing when the tool should be used
- describing when it should not be used
- stating preconditions clearly
- constraining parameters tightly
- avoiding catch-all tool designs
A bad tool often looks “flexible.” A good tool often looks annoyingly specific.
That specificity is a feature. It narrows the action space, helps the model discriminate, and helps humans understand what the system is actually allowed to do.
Here is a simple contrast.
Bad:
{
"name": "process_item",
"description": "Handles different item operations",
"parameters": {
"type": "object",
"properties": {
"item_id": { "type": "string" },
"mode": { "type": "string" }
}
}
}
Better:
{
"name": "approve_invoice",
"description": "Approve an invoice that has already passed validation
checks and is eligible for approval by the current user.",
"parameters": {
"type": "object",
"properties": {
"invoice_id": {
"type": "string",
"description": "Internal invoice identifier.
Required for a single invoice approval action."
},
"approval_note": {
"type": "string",
"description": "Short explanation for the approval decision
that will be stored in the audit log."
}
},
"required": ["invoice_id", "approval_note"],
"additionalProperties": false
}
}
The improved version does more than define a payload. It teaches the model what kind of act this is, what assumptions are already supposed to be true, and how constrained the action should be.
Principle 2: schema design is prompt design in disguise
People often think of schemas as plumbing.
They are not.
A schema is one of the clearest forms of instruction the model receives. It shapes not only the final JSON, but the model’s internal understanding of what the tool is for. Loose schemas invite guesswork. Strong schemas guide behavior.
That means:
- required fields should actually be required
- enum values should be used when the domain is constrained
- descriptions should clarify semantics, not just repeat field names
- unexpected keys should be rejected where possible
- types should be explicit and meaningful
If the model can wander, it will.
The schema is part of how you build the walls.
Principle 3: the model proposes, the application decides
This is one of the most useful mental models in the field.
The model is not the execution authority.
It is a proposal engine.
It can suggest a tool call. It can suggest arguments. It can surface intent. But the application must still decide whether execution should happen.
That means every tool call should pass through independent checks such as:
- payload validation
- permission verification
- resource existence checks
- state checks
- policy checks
- business-rule checks
- idempotency checks where relevant
If your system simply takes the model’s tool call and executes it because the JSON “looks correct,” you are not building a tool-using assistant. You are building a very articulate liability.
That matters even for seemingly low-risk systems, because the failure modes are not just security failures. They are workflow failures, customer trust failures, data integrity failures, and audit failures.
Here is a simple TypeScript-style validator:
type ApproveInvoiceArgs = {
invoice_id: string;
approval_note: string;
};
function validateApproveInvoiceArgs(input: unknown): ApproveInvoiceArgs {
if (!input || typeof input !== "object") {
throw new Error("Invalid payload");
}
const args = input as Record<string, unknown>;
if (typeof args.invoice_id !== "string" || !args.invoice_id.trim()) {
throw new Error("invoice_id is required");
}
if (typeof args.approval_note !== "string" || !args.approval_note.trim()) {
throw new Error("approval_note is required");
}
return {
invoice_id: args.invoice_id,
approval_note: args.approval_note,
};
}
That is not glamorous code.
It is also the kind of code that keeps tool use from turning into a liability.
Principle 4: permission models belong outside the prompt
There is a tempting but dangerous shortcut in agent systems: trying to teach the model what the user is allowed to do and hoping it will behave.
That is not enough.
The prompt can inform. It cannot enforce.
If a user should not be able to refund an order, approve an invoice, access a protected record, modify an account, or trigger an external action, the system needs to check that explicitly before execution.
A useful way to think about permissions is at three layers:
Visibility
Can the model even see that this tool exists?
Proposal
Can the model suggest this tool in the current context?
Execution
Can the application actually perform this action for this user, on this resource, in this state?
Those are different questions. Treating them as one leads to brittle systems.
For higher-risk actions, it is often wise to add explicit confirmation steps or human approval gates. Read tools and write tools should not be treated as morally equivalent just because they share the same interface shape.
Principle 5: separate read tools from action tools
One of the cleanest design moves in real systems is separating tools by risk and function.
A tool that retrieves information is not the same kind of thing as a tool that changes the world. If the system treats them all as flat, interchangeable affordances, you lose an important layer of control.
A useful categorization is:
Read tools
Used to retrieve or inspect information.
Examples:
get_customer_profilesearch_ordersretrieve_contractlist_open_tickets
Decision-support tools
Used to summarize, compare, classify, or analyze.
Examples:
summarize_ticket_historycompare_contract_versionscategorize_exception_case
Action tools
Used to mutate state or trigger external effects.
Examples:
approve_invoicecreate_refundsend_customer_emailupdate_subscription_status
This separation helps with:
- permissioning
- execution policies
- UI/UX expectations
- testing strategy
- audit logging
- blast-radius control
A healthy default in many systems is read before write. Let the assistant gather context first, then ask for confirmation or proceed through a more constrained action path.
Principle 6: error handling is not an edge case — it is part of the product
Weak systems assume the tool call will work. Strong systems assume failure needs to be understandable.
There are several distinct failure types, and they should not all collapse into “something went wrong.”
Validation failure
The arguments are malformed or incomplete.
The assistant should not blindly retry. It should identify what is missing and, when appropriate, ask for it.
Permission failure
The user is not allowed to perform the action.
The assistant should say so clearly and not imply the task was completed.
Business-rule failure
The action is syntactically valid but not allowed in this state.
Examples:
- invoice already approved
- refund window expired
- contract not in editable state
- order already cancelled
The system should explain the condition and suggest the next valid path where possible.
External system failure
The downstream service timed out, returned an error, or became temporarily unavailable.
The assistant should acknowledge that the action did not complete. It should not hallucinate success, and it should retry only when the retry policy actually makes sense.
This is where many agent experiences feel brittle. Not because tools fail — tools always fail sometimes — but because the system has no graceful semantics for failure.
Principle 7: test tool choice, not just tool execution
A lot of teams unit-test their tools and think they are done.
That is only half the problem.
The deeper problem is whether the model chooses well in context.
This means you need to evaluate things like:
- Did the model pick the right tool?
- Did it ask for missing information instead of guessing?
- Did it refrain from acting when the request was ambiguous?
- Did it choose a read tool before a write tool?
- Did it separate actions that should not be bundled?
- Did it respect confirmation requirements?
- Did it stay within the permission model?
These are scenario-level behaviors, not just code-level properties.
For example:
- “What is the status of invoice INV-1427?” should trigger a read path, not an approval path.
- “Approve this invoice” should not execute if the invoice is unclear or the user lacks permission.
- “Please take care of it” should not cause the system to guess which risky action the user means.
- “Refund and email the customer” may require staged execution with separate confirmations or checks.
In other words, you need to test the judgment around tool use, not merely the software behind it.
Principle 8: observability is how tool systems become mature
If you cannot see what the assistant is doing with tools, you cannot improve it systematically.
Good observability in tool-calling systems usually includes:
- tool selection frequency
- argument validation failures
- permission denials
- business-rule denials
- downstream failures
- retries
- multi-step sequences
- abandonment rates
- repeated ambiguity patterns
- suspicious near-misses
- cases where the assistant should have used a tool but did not
This is not just for debugging. It is how you discover structural issues in the design.
Maybe the tool names overlap too much. Maybe the descriptions are unclear. Maybe the model lacks context when choosing. Maybe a tool should be split into two. Maybe risky actions need stronger confirmation policies. Maybe the workflow itself is a bad candidate for free-form interaction.
Observability turns vague frustration into tractable design decisions.
A realistic tool definition example
Here is the kind of tool definition that is much easier to operationalize safely:
{
"name": "create_refund",
"description": "Create a refund for an eligible paid order.
Use only after verifying refund eligibility and
collecting a clear user-visible reason.",
"parameters": {
"type": "object",
"properties": {
"order_id": {
"type": "string",
"description": "Internal order identifier."
},
"amount": {
"type": "number",
"description": "Refund amount in the order currency."
},
"reason_code": {
"type": "string",
"enum": [
"duplicate",
"customer_request",
"service_issue",
"billing_error"
],
"description": "Standard refund reason used for
reporting and audit purposes."
},
"customer_message": {
"type": "string",
"description": "Short message stored in the refund
log and visible to support staff."
}
},
"required": [
"order_id",
"amount",
"reason_code",
"customer_message"
],
"additionalProperties": false
}
}
Notice what this definition does well:
- it narrows the action
- it frames the precondition
- it constrains the reason domain
- it discourages invented fields
- it makes validation straightforward
- it supports better logging and reporting
This is not just a machine-readable contract. It is part of the behavioral control surface.
A practical pattern library for reliable tool use
If you build enough of these systems, certain patterns keep showing up:
Narrow tools over multipurpose tools
One clear action beats one vague Swiss Army knife almost every time.
Strong schemas over permissive schemas
Constrain what you can. Every degree of freedom becomes a chance for drift.
Application-side validation over trust in model output
The model can suggest. The system must verify.
Read-before-write over eager action
Gather context first. Change state later.
Explicit permissions over prompt-only guidance
Security and policy belong in execution logic.
Structured failures over raw exceptions
Return failures the assistant can actually handle well.
Scenario evaluations over isolated unit thinking
Test real conversational situations, not just payload shape.
Observability over intuition
If you cannot inspect tool behavior, you are improving by folklore.
A shift in mindset is part of the transition
Function calling matters because it turns language models from passive generators into participants in real workflows. That is a meaningful shift. It is also why the field is going through a predictable phase of overconfidence, simplification, and gradual maturation.
This is normal.
Whenever a new capability becomes operationally relevant, the first wave focuses on possibility. The next wave discovers the edge cases. The mature wave learns where control, design discipline, and trust actually matter. Tool use in AI systems is moving through that same transition now.
So the goal is not to become cynical about function calling, nor to pretend that reliable tool use should already be effortless. The goal is to understand that the real value does not come from the existence of the feature. It comes from the quality of the operating environment built around it.
The systems that benefit most from function calling will not be the ones with the largest pile of tools. They will be the ones designed with the clearest boundaries, strongest validation, best permission model, and best understanding of where humans should still remain in control.
Function calling is here to stay.
The question is not whether models can use tools. They can.
The question is whether we are prepared to design tool-using systems with enough clarity, restraint, and rigor for them to be worth trusting.
The bottom line
Function calling is not interesting because the model can emit JSON.
It becomes interesting when tool use is:
- correct enough to trust
- constrained enough to control
- observable enough to improve
- permission-aware enough to deploy responsibly
- recoverable enough to survive production failure
- understandable enough for humans to live with
That is the difference between a clever capability demo and a system people can actually rely on.
And that gap — between possibility and reliability — is exactly where the best engineering work still needs to happen.