Function Calling in the Wild

A Practical Guide to Tool Use

Tool Use

There is a big difference between a model that can call functions and a system that can be trusted to use tools well.

That difference is where a lot of promising agent systems either grow up or quietly fall apart.

On paper, function calling sounds almost deceptively simple. The model gets a set of tool definitions, decides when one is needed, produces structured arguments, and the application executes the request. It feels clean. It feels composable. It feels like the obvious bridge between language models and real software.

In a controlled demo, it often works exactly that way.

Then production happens.

The model chooses the wrong tool:

It calls a tool before it has enough information.
It invents a parameter that looks reasonable but is invalid.
It retries a failing action without understanding why it failed.
It chooses a write action when it should have retrieved context first.
It bundles steps that should be separated.
It requests something the user is not permitted to do.
It gets stuck in a loop between ambiguity and action.

That is why “the model supports function calling” is not really the interesting claim.

The interesting claim is much harder:

Can the system reliably choose, validate, sequence, permission, observe, and recover around tool use under messy real-world conditions?

That is the actual problem.

And once you see it clearly, you realize that tool use is not mainly a prompt trick. It is a systems design discipline.

The fantasy version of tool use

A lot of early writing about function calling made it sound like the model would naturally become a kind of competent operator once given enough tools. Hand it the schema, describe the affordances well enough, and let the reasoning take over.

That framing was always too optimistic.

Models are very good at producing plausible structured output. That is not the same as being a reliable process executor. Plausibility is exactly what makes poorly designed tool systems dangerous. A bad tool call can look impressively tidy right up to the moment it fails, mutates state incorrectly, leaks information, or sends a workflow drifting somewhere it should never have gone.

What teams often learn the hard way is that function calling does not remove the need for system design.

It increases it.

Tool use is really about operating boundaries

At a high level, reliable function calling depends on a small set of questions being answered well:

When should the model call a tool at all?
Which tool is the right one?
What arguments are required?
What can safely be inferred, and what must be confirmed?
What permissions apply in this context?
What preconditions must be checked before execution?
What happens if the call fails?
How do we know whether the model made a good choice?
How do we improve the system over time?

If those questions are not answered at the application level, the model ends up carrying too much ambiguity. That is usually where reliability begins to collapse.

The good news is that strong patterns emerge quickly once you stop treating tool use like magic and start treating it like infrastructure.

Principle 1: tools should be easy to choose correctly

One of the biggest self-inflicted problems in tool-calling systems is ambiguity.

If several tools overlap loosely, the model will confuse them. If names are generic, the model will generalize badly. If descriptions are vague, the wrong tool can still look semantically plausible. If a single tool handles too many modes, behavior becomes inconsistent.

The goal is not simply to define tools. The goal is to define them in a way that makes correct selection easier than incorrect selection.

That usually means:

using precise names
describing when the tool should be used
describing when it should not be used
stating preconditions clearly
constraining parameters tightly
avoiding catch-all tool designs

A bad tool often looks “flexible.” A good tool often looks annoyingly specific.

That specificity is a feature. It narrows the action space, helps the model discriminate, and helps humans understand what the system is actually allowed to do.

Here is a simple contrast.

Bad:

{
  "name": "process_item",
  "description": "Handles different item operations",
  "parameters": {
    "type": "object",
    "properties": {
      "item_id": { "type": "string" },
      "mode": { "type": "string" }
    }
  }
}

Better:

{
  "name": "approve_invoice",
  "description": "Approve an invoice that has already passed validation
          checks and is eligible for approval by the current user.",
  "parameters": {
    "type": "object",
    "properties": {
      "invoice_id": {
        "type": "string",
        "description": "Internal invoice identifier.
            Required for a single invoice approval action."
      },
      "approval_note": {
        "type": "string",
        "description": "Short explanation for the approval decision
            that will be stored in the audit log."
      }
    },
    "required": ["invoice_id", "approval_note"],
    "additionalProperties": false
  }
}

The improved version does more than define a payload. It teaches the model what kind of act this is, what assumptions are already supposed to be true, and how constrained the action should be.

Principle 2: schema design is prompt design in disguise

People often think of schemas as plumbing.

They are not.

A schema is one of the clearest forms of instruction the model receives. It shapes not only the final JSON, but the model’s internal understanding of what the tool is for. Loose schemas invite guesswork. Strong schemas guide behavior.

That means:

required fields should actually be required
enum values should be used when the domain is constrained
descriptions should clarify semantics, not just repeat field names
unexpected keys should be rejected where possible
types should be explicit and meaningful

If the model can wander, it will.

The schema is part of how you build the walls.

Principle 3: the model proposes, the application decides

This is one of the most useful mental models in the field.

The model is not the execution authority.

It is a proposal engine.

It can suggest a tool call. It can suggest arguments. It can surface intent. But the application must still decide whether execution should happen.

That means every tool call should pass through independent checks such as:

payload validation
permission verification
resource existence checks
state checks
policy checks
business-rule checks
idempotency checks where relevant

If your system simply takes the model’s tool call and executes it because the JSON “looks correct,” you are not building a tool-using assistant. You are building a very articulate liability.

That matters even for seemingly low-risk systems, because the failure modes are not just security failures. They are workflow failures, customer trust failures, data integrity failures, and audit failures.

Here is a simple TypeScript-style validator:

type ApproveInvoiceArgs = {
  invoice_id: string;
  approval_note: string;
};

function validateApproveInvoiceArgs(input: unknown): ApproveInvoiceArgs {
  if (!input || typeof input !== "object") {
    throw new Error("Invalid payload");
  }

  const args = input as Record<string, unknown>;

  if (typeof args.invoice_id !== "string" || !args.invoice_id.trim()) {
    throw new Error("invoice_id is required");
  }

  if (typeof args.approval_note !== "string" || !args.approval_note.trim()) {
    throw new Error("approval_note is required");
  }

  return {
    invoice_id: args.invoice_id,
    approval_note: args.approval_note,
  };
}

That is not glamorous code.

It is also the kind of code that keeps tool use from turning into a liability.

Principle 4: permission models belong outside the prompt

There is a tempting but dangerous shortcut in agent systems: trying to teach the model what the user is allowed to do and hoping it will behave.

That is not enough.

The prompt can inform. It cannot enforce.

If a user should not be able to refund an order, approve an invoice, access a protected record, modify an account, or trigger an external action, the system needs to check that explicitly before execution.

A useful way to think about permissions is at three layers:

Visibility

Can the model even see that this tool exists?

Proposal

Can the model suggest this tool in the current context?

Execution

Can the application actually perform this action for this user, on this resource, in this state?

Those are different questions. Treating them as one leads to brittle systems.

For higher-risk actions, it is often wise to add explicit confirmation steps or human approval gates. Read tools and write tools should not be treated as morally equivalent just because they share the same interface shape.

Principle 5: separate read tools from action tools

One of the cleanest design moves in real systems is separating tools by risk and function.

A tool that retrieves information is not the same kind of thing as a tool that changes the world. If the system treats them all as flat, interchangeable affordances, you lose an important layer of control.

A useful categorization is:

Read tools

Used to retrieve or inspect information.

Examples:

get_customer_profile
search_orders
retrieve_contract
list_open_tickets

Decision-support tools

Used to summarize, compare, classify, or analyze.

Examples:

summarize_ticket_history
compare_contract_versions
categorize_exception_case

Action tools

Used to mutate state or trigger external effects.

Examples:

approve_invoice
create_refund
send_customer_email
update_subscription_status

This separation helps with:

permissioning
execution policies
UI/UX expectations
testing strategy
audit logging
blast-radius control

A healthy default in many systems is read before write. Let the assistant gather context first, then ask for confirmation or proceed through a more constrained action path.

Principle 6: error handling is not an edge case — it is part of the product

Weak systems assume the tool call will work. Strong systems assume failure needs to be understandable.

There are several distinct failure types, and they should not all collapse into “something went wrong.”

Validation failure

The arguments are malformed or incomplete.

The assistant should not blindly retry. It should identify what is missing and, when appropriate, ask for it.

Permission failure

The user is not allowed to perform the action.

The assistant should say so clearly and not imply the task was completed.

Business-rule failure

The action is syntactically valid but not allowed in this state.

Examples:

invoice already approved
refund window expired
contract not in editable state
order already cancelled

The system should explain the condition and suggest the next valid path where possible.

External system failure

The downstream service timed out, returned an error, or became temporarily unavailable.

The assistant should acknowledge that the action did not complete. It should not hallucinate success, and it should retry only when the retry policy actually makes sense.

This is where many agent experiences feel brittle. Not because tools fail — tools always fail sometimes — but because the system has no graceful semantics for failure.

Principle 7: test tool choice, not just tool execution

A lot of teams unit-test their tools and think they are done.

That is only half the problem.

The deeper problem is whether the model chooses well in context.

This means you need to evaluate things like:

Did the model pick the right tool?
Did it ask for missing information instead of guessing?
Did it refrain from acting when the request was ambiguous?
Did it choose a read tool before a write tool?
Did it separate actions that should not be bundled?
Did it respect confirmation requirements?
Did it stay within the permission model?

These are scenario-level behaviors, not just code-level properties.

For example:

“What is the status of invoice INV-1427?” should trigger a read path, not an approval path.
“Approve this invoice” should not execute if the invoice is unclear or the user lacks permission.
“Please take care of it” should not cause the system to guess which risky action the user means.
“Refund and email the customer” may require staged execution with separate confirmations or checks.

In other words, you need to test the judgment around tool use, not merely the software behind it.

Principle 8: observability is how tool systems become mature

If you cannot see what the assistant is doing with tools, you cannot improve it systematically.

Good observability in tool-calling systems usually includes:

tool selection frequency
argument validation failures
permission denials
business-rule denials
downstream failures
retries
multi-step sequences
abandonment rates
repeated ambiguity patterns
suspicious near-misses
cases where the assistant should have used a tool but did not

This is not just for debugging. It is how you discover structural issues in the design.

Maybe the tool names overlap too much. Maybe the descriptions are unclear. Maybe the model lacks context when choosing. Maybe a tool should be split into two. Maybe risky actions need stronger confirmation policies. Maybe the workflow itself is a bad candidate for free-form interaction.

Observability turns vague frustration into tractable design decisions.

A realistic tool definition example

Here is the kind of tool definition that is much easier to operationalize safely:

{
  "name": "create_refund",
  "description": "Create a refund for an eligible paid order.
          Use only after verifying refund eligibility and
          collecting a clear user-visible reason.",
  "parameters": {
    "type": "object",
    "properties": {
      "order_id": {
        "type": "string",
        "description": "Internal order identifier."
      },
      "amount": {
        "type": "number",
        "description": "Refund amount in the order currency."
      },
      "reason_code": {
        "type": "string",
        "enum": [
          "duplicate",
          "customer_request",
          "service_issue",
          "billing_error"
          ],
        "description": "Standard refund reason used for
          reporting and audit purposes."
      },
      "customer_message": {
        "type": "string",
        "description": "Short message stored in the refund
          log and visible to support staff."
      }
    },
    "required": [
        "order_id",
        "amount",
        "reason_code",
        "customer_message"
        ],
    "additionalProperties": false
  }
}

Notice what this definition does well:

it narrows the action
it frames the precondition
it constrains the reason domain
it discourages invented fields
it makes validation straightforward
it supports better logging and reporting

This is not just a machine-readable contract. It is part of the behavioral control surface.

A practical pattern library for reliable tool use

If you build enough of these systems, certain patterns keep showing up:

Narrow tools over multipurpose tools

One clear action beats one vague Swiss Army knife almost every time.

Strong schemas over permissive schemas

Constrain what you can. Every degree of freedom becomes a chance for drift.

Application-side validation over trust in model output

The model can suggest. The system must verify.

Read-before-write over eager action

Gather context first. Change state later.

Explicit permissions over prompt-only guidance

Security and policy belong in execution logic.

Structured failures over raw exceptions

Return failures the assistant can actually handle well.

Scenario evaluations over isolated unit thinking

Test real conversational situations, not just payload shape.

Observability over intuition

If you cannot inspect tool behavior, you are improving by folklore.

A shift in mindset is part of the transition

Function calling matters because it turns language models from passive generators into participants in real workflows. That is a meaningful shift. It is also why the field is going through a predictable phase of overconfidence, simplification, and gradual maturation.

This is normal.

Whenever a new capability becomes operationally relevant, the first wave focuses on possibility. The next wave discovers the edge cases. The mature wave learns where control, design discipline, and trust actually matter. Tool use in AI systems is moving through that same transition now.

So the goal is not to become cynical about function calling, nor to pretend that reliable tool use should already be effortless. The goal is to understand that the real value does not come from the existence of the feature. It comes from the quality of the operating environment built around it.

The systems that benefit most from function calling will not be the ones with the largest pile of tools. They will be the ones designed with the clearest boundaries, strongest validation, best permission model, and best understanding of where humans should still remain in control.

Function calling is here to stay.

The question is not whether models can use tools. They can.

The question is whether we are prepared to design tool-using systems with enough clarity, restraint, and rigor for them to be worth trusting.

The bottom line

Function calling is not interesting because the model can emit JSON.

It becomes interesting when tool use is:

correct enough to trust
constrained enough to control
observable enough to improve
permission-aware enough to deploy responsibly
recoverable enough to survive production failure
understandable enough for humans to live with

That is the difference between a clever capability demo and a system people can actually rely on.

And that gap — between possibility and reliability — is exactly where the best engineering work still needs to happen.

Function Calling in the Wild

A Practical Guide to Tool Use

The fantasy version of tool use

Tool use is really about operating boundaries

Principle 1: tools should be easy to choose correctly

Principle 2: schema design is prompt design in disguise

Principle 3: the model proposes, the application decides

Principle 4: permission models belong outside the prompt

Visibility

Proposal

Execution

Principle 5: separate read tools from action tools

Read tools

Decision-support tools

Action tools

Principle 6: error handling is not an edge case — it is part of the product

Validation failure

Permission failure

Business-rule failure

External system failure

Principle 7: test tool choice, not just tool execution

Principle 8: observability is how tool systems become mature

A realistic tool definition example

A practical pattern library for reliable tool use

Narrow tools over multipurpose tools

Strong schemas over permissive schemas

Application-side validation over trust in model output

Read-before-write over eager action

Explicit permissions over prompt-only guidance

Structured failures over raw exceptions

Scenario evaluations over isolated unit thinking

Observability over intuition

A shift in mindset is part of the transition

The bottom line

Let's Build Extraordinary!

Message Sent Successfully!