Edit

Share via


How to generate chat completions with Azure AI Foundry Models

Important

Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

This article explains how to use chat completions API with models deployed in Azure AI Foundry Models.

Prerequisites

To use chat completion models in your application, you need:

  • A chat completions model deployment. If you don't have one, read Add and configure Foundry Models to add a chat completions model to your resource.

    • This example uses mistral-large-2407.

Use chat completions

First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables.

import os
from azure.ai.inference import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential

client = ChatCompletionsClient(
    endpoint="https://<resource>.services.ai.azure.com/models",
    credential=AzureKeyCredential(os.environ["AZURE_INFERENCE_CREDENTIAL"]),
    model="mistral-large-2407"
)

If you've configured the resource with Microsoft Entra ID support, you can use the following code snippet to create a client.

import os
from azure.ai.inference import ChatCompletionsClient
from azure.identity import DefaultAzureCredential

client = ChatCompletionsClient(
    endpoint="https://<resource>.services.ai.azure.com/models",
    credential=DefaultAzureCredential(),
    model="mistral-large-2407"
)

Create a chat completion request

The following example shows how you can create a basic chat completions request to the model.

from azure.ai.inference.models import SystemMessage, UserMessage

response = client.complete(
    messages=[
        SystemMessage(content="You are a helpful assistant."),
        UserMessage(content="How many languages are in the world?"),
    ],
)

Note

Some models don't support system messages (role="system"). When you use the Azure AI model inference API, system messages are translated to user messages, which is the closest capability available. This translation is offered for convenience, but it's important for you to verify that the model is following the instructions in the system message with the right level of confidence.

The response is as follows, where you can see the model's usage statistics:

print("Response:", response.choices[0].message.content)
print("Model:", response.model)
print("Usage:")
print("\tPrompt tokens:", response.usage.prompt_tokens)
print("\tTotal tokens:", response.usage.total_tokens)
print("\tCompletion tokens:", response.usage.completion_tokens)
Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.
Model: mistral-large-2407
Usage: 
  Prompt tokens: 19
  Total tokens: 91
  Completion tokens: 72

Inspect the usage section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion.

Stream content

By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds.

You can stream the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as data-only server-sent events. Extract chunks from the delta field, rather than the message field.

To stream completions, set stream=True when you call the model.

result = client.complete(
    messages=[
        SystemMessage(content="You are a helpful assistant."),
        UserMessage(content="How many languages are in the world?"),
    ],
    temperature=0,
    top_p=1,
    max_tokens=2048,
    stream=True,
)

To visualize the output, define a helper function to print the stream.

def print_stream(result):
    """
    Prints the chat completion with streaming.
    """
    import time
    for update in result:
        if update.choices:
            print(update.choices[0].delta.content, end="")

You can visualize how streaming generates content:

print_stream(result)

Explore more parameters supported by the inference client

Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see Azure AI Model Inference API reference.

from azure.ai.inference.models import ChatCompletionsResponseFormatText

response = client.complete(
    messages=[
        SystemMessage(content="You are a helpful assistant."),
        UserMessage(content="How many languages are in the world?"),
    ],
    presence_penalty=0.1,
    frequency_penalty=0.8,
    max_tokens=2048,
    stop=["<|endoftext|>"],
    temperature=0,
    top_p=1,
    response_format={ "type": ChatCompletionsResponseFormatText() },
)

Some models don't support JSON output formatting. You can always prompt the model to generate JSON outputs. However, such outputs aren't guaranteed to be valid JSON.

If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using extra parameters. See Pass extra parameters to the model.

Create JSON outputs

Some models can create JSON outputs. Set response_format to json_object to enable JSON mode and guarantee that the message the model generates is valid JSON. You must also instruct the model to produce JSON yourself via a system or user message. Also, the message content might be partially cut off if finish_reason="length", which indicates that the generation exceeded max_tokens or that the conversation exceeded the max context length.

response = client.complete(
    messages=[
        SystemMessage(content="You are a helpful assistant that always generate responses in JSON format, using."
                      " the following format: { ""answer"": ""response"" }."),
        UserMessage(content="How many languages are in the world?"),
    ],
    response_format="json_object"
)

Pass extra parameters to the model

The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter logprobs to the model.

response = client.complete(
    messages=[
        SystemMessage(content="You are a helpful assistant."),
        UserMessage(content="How many languages are in the world?"),
    ],
    model_extras={
        "logprobs": True
    }
)

Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header extra-parameters is passed to the model with the value pass-through. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported.

Use tools

Some models support the use of tools, which can be an extraordinary resource when you need to offload specific tasks from the language model and instead rely on a more deterministic system or even a different language model. The Azure AI Model Inference API allows you to define tools in the following way.

The following code example creates a tool definition that is able to look from flight information from two different cities.

from azure.ai.inference.models import FunctionDefinition, ChatCompletionsToolDefinition

flight_info = ChatCompletionsToolDefinition(
    function=FunctionDefinition(
        name="get_flight_info",
        description="Returns information about the next flight between two cities. This includes the name of the airline, flight number and the date and time of the next flight",
        parameters={
            "type": "object",
            "properties": {
                "origin_city": {
                    "type": "string",
                    "description": "The name of the city where the flight originates",
                },
                "destination_city": {
                    "type": "string",
                    "description": "The flight destination city",
                },
            },
            "required": ["origin_city", "destination_city"],
        },
    )
)

tools = [flight_info]

In this example, the function's output is that there are no flights available for the selected route, but the user should consider taking a train.

def get_flight_info(loc_origin: str, loc_destination: str):
    return { 
        "info": f"There are no flights available from {loc_origin} to {loc_destination}. You should take a train, specially if it helps to reduce CO2 emissions."
    }

Note

Cohere models require a tool's responses to be a valid JSON content formatted as a string. When constructing messages of type Tool, ensure the response is a valid JSON string.

Prompt the model to book flights with the help of this function:

messages = [
    SystemMessage(
        content="You are a helpful assistant that help users to find information about traveling, how to get"
                " to places and the different transportations options. You care about the environment and you"
                " always have that in mind when answering inqueries.",
    ),
    UserMessage(
        content="When is the next flight from Miami to Seattle?",
    ),
]

response = client.complete(
    messages=messages, tools=tools, tool_choice="auto"
)

You can inspect the response to find out if a tool needs to be called. Inspect the finish reason to determine if the tool should be called. Remember that multiple tool types can be indicated. This example demonstrates a tool of type function.

response_message = response.choices[0].message
tool_calls = response_message.tool_calls

print("Finish reason:", response.choices[0].finish_reason)
print("Tool call:", tool_calls)

To continue, append this message to the chat history:

messages.append(
    response_message
)

Now, it's time to call the appropriate function to handle the tool call. The following code snippet iterates over all the tool calls indicated in the response and calls the corresponding function with the appropriate parameters. The response is also appended to the chat history.

import json
from azure.ai.inference.models import ToolMessage

for tool_call in tool_calls:

    # Get the tool details:

    function_name = tool_call.function.name
    function_args = json.loads(tool_call.function.arguments.replace("\'", "\""))
    tool_call_id = tool_call.id

    print(f"Calling function `{function_name}` with arguments {function_args}")

    # Call the function defined above using `locals()`, which returns the list of all functions 
    # available in the scope as a dictionary. Notice that this is just done as a simple way to get
    # the function callable from its string name. Then we can call it with the corresponding
    # arguments.

    callable_func = locals()[function_name]
    function_response = callable_func(**function_args)

    print("->", function_response)

    # Once we have a response from the function and its arguments, we can append a new message to the chat 
    # history. Notice how we are telling to the model that this chat message came from a tool:

    messages.append(
        ToolMessage(
            tool_call_id=tool_call_id,
            content=json.dumps(function_response)
        )
    )

View the response from the model:

response = client.complete(
    messages=messages,
    tools=tools,
)

Apply Guardrails and controls

The Azure AI model inference API supports Azure AI content safety. When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions.

The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled.

from azure.ai.inference.models import AssistantMessage, UserMessage, SystemMessage

try:
    response = client.complete(
        messages=[
            SystemMessage(content="You are an AI assistant that helps people find information."),
            UserMessage(content="Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills."),
        ]
    )

    print(response.choices[0].message.content)

except HttpResponseError as ex:
    if ex.status_code == 400:
        response = ex.response.json()
        if isinstance(response, dict) and "error" in response:
            print(f"Your request triggered an {response['error']['code']} error:\n\t {response['error']['message']}")
        else:
            raise
    raise

Tip

To learn more about how you can configure and control Azure AI content safety settings, check the Azure AI content safety documentation.

Important

Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

This article explains how to use chat completions API with models deployed in Azure AI Foundry Models.

Prerequisites

To use chat completion models in your application, you need:

  • Install the Azure Inference library for JavaScript with the following command:

    npm install @azure-rest/ai-inference
    npm install @azure/core-auth
    npm install @azure/identity
    

    If you are using Node.js, you can configure the dependencies in package.json:

    package.json

    {
      "name": "main_app",
      "version": "1.0.0",
      "description": "",
      "main": "app.js",
      "type": "module",
      "dependencies": {
        "@azure-rest/ai-inference": "1.0.0-beta.6",
        "@azure/core-auth": "1.9.0",
        "@azure/core-sse": "2.2.0",
        "@azure/identity": "4.8.0"
      }
    }
    
  • Import the following:

    import ModelClient from "@azure-rest/ai-inference";
    import { isUnexpected } from "@azure-rest/ai-inference";
    import { createSseStream } from "@azure/core-sse";
    import { AzureKeyCredential } from "@azure/core-auth";
    import { DefaultAzureCredential } from "@azure/identity";
    

Use chat completions

First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables.

const client = ModelClient(
    "https://<resource>.services.ai.azure.com/api/models", 
    new AzureKeyCredential(process.env.AZURE_INFERENCE_CREDENTIAL)
);

If you've configured the resource with Microsoft Entra ID support, you can use the following code snippet to create a client.

const clientOptions = { credentials: { "https://cognitiveservices.azure.com" } };

const client = ModelClient(
    "https://<resource>.services.ai.azure.com/api/models", 
    new DefaultAzureCredential()
    clientOptions,
);

Create a chat completion request

The following example shows how you can create a basic chat completions request to the model.

var messages = [
    { role: "system", content: "You are a helpful assistant" },
    { role: "user", content: "How many languages are in the world?" },
];

var response = await client.path("/chat/completions").post({
    body: {
        model: "mistral-large-2407",
        messages: messages,
    }
});

Note

Some models don't support system messages (role="system"). When you use the Azure AI model inference API, system messages are translated to user messages, which is the closest capability available. This translation is offered for convenience, but it's important for you to verify that the model is following the instructions in the system message with the right level of confidence.

The response is as follows, where you can see the model's usage statistics:

if (isUnexpected(response)) {
    throw response.body.error;
}

console.log("Response: ", response.body.choices[0].message.content);
console.log("Model: ", response.body.model);
console.log("Usage:");
console.log("\tPrompt tokens:", response.body.usage.prompt_tokens);
console.log("\tTotal tokens:", response.body.usage.total_tokens);
console.log("\tCompletion tokens:", response.body.usage.completion_tokens);
Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.
Model: mistral-large-2407
Usage: 
  Prompt tokens: 19
  Total tokens: 91
  Completion tokens: 72

Inspect the usage section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion.

Stream content

By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds.

You can stream the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as data-only server-sent events. Extract chunks from the delta field, rather than the message field.

To stream completions, use .asNodeStream() when you call the model.

var messages = [
    { role: "system", content: "You are a helpful assistant" },
    { role: "user", content: "How many languages are in the world?" },
];

var response = await client.path("/chat/completions").post({
    body: {
        model: "mistral-large-2407",
        messages: messages,
        stream: true,
    }
}).asNodeStream();

You can visualize how streaming generates content:

var stream = response.body;
if (!stream) {
    stream.destroy();
    throw new Error(`Failed to get chat completions with status: ${response.status}`);
}

if (response.status !== "200") {
    throw new Error(`Failed to get chat completions: ${response.body.error}`);
}

var sses = createSseStream(stream);

for await (const event of sses) {
    if (event.data === "[DONE]") {
        return;
    }
    for (const choice of (JSON.parse(event.data)).choices) {
        process.stdout.write(choice.delta?.content ?? "");
    }
}

Explore more parameters supported by the inference client

Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see Azure AI Model Inference API reference.

var messages = [
    { role: "system", content: "You are a helpful assistant" },
    { role: "user", content: "How many languages are in the world?" },
];

var response = await client.path("/chat/completions").post({
    body: {
        model: "mistral-large-2407",
        messages: messages,
        presence_penalty: "0.1",
        frequency_penalty: "0.8",
        max_tokens: 2048,
        stop: ["<|endoftext|>"],
        temperature: 0,
        top_p: 1,
        response_format: { type: "text" },
    }
});

Some models don't support JSON output formatting. You can always prompt the model to generate JSON outputs. However, such outputs aren't guaranteed to be valid JSON.

If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using extra parameters. See Pass extra parameters to the model.

Create JSON outputs

Some models can create JSON outputs. Set response_format to json_object to enable JSON mode and guarantee that the message the model generates is valid JSON. You must also instruct the model to produce JSON yourself via a system or user message. Also, the message content might be partially cut off if finish_reason="length", which indicates that the generation exceeded max_tokens or that the conversation exceeded the max context length.

var messages = [
    { role: "system", content: "You are a helpful assistant that always generate responses in JSON format, using."
        + " the following format: { \"answer\": \"response\" }." },
    { role: "user", content: "How many languages are in the world?" },
];

var response = await client.path("/chat/completions").post({
    body: {
        model: "mistral-large-2407",
        messages: messages,
        response_format: { type: "json_object" }
    }
});

Pass extra parameters to the model

The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter logprobs to the model.

var messages = [
    { role: "system", content: "You are a helpful assistant" },
    { role: "user", content: "How many languages are in the world?" },
];

var response = await client.path("/chat/completions").post({
    headers: {
        "extra-params": "pass-through"
    },
    body: {
        model: "mistral-large-2407",
        messages: messages,
        logprobs: true
    }
});

Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header extra-parameters is passed to the model with the value pass-through. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported.

Use tools

Some models support the use of tools, which can be an extraordinary resource when you need to offload specific tasks from the language model and instead rely on a more deterministic system or even a different language model. The Azure AI Model Inference API allows you to define tools in the following way.

The following code example creates a tool definition that is able to look from flight information from two different cities.

const flight_info = {
    name: "get_flight_info",
    description: "Returns information about the next flight between two cities. This includes the name of the airline, flight number and the date and time of the next flight",
    parameters: {
        type: "object",
        properties: {
            origin_city: {
                type: "string",
                description: "The name of the city where the flight originates",
            },
            destination_city: {
                type: "string",
                description: "The flight destination city",
            },
        },
        required: ["origin_city", "destination_city"],
    },
}

const tools = [
    {
        type: "function",
        function: flight_info,
    },
];

In this example, the function's output is that there are no flights available for the selected route, but the user should consider taking a train.

function get_flight_info(loc_origin, loc_destination) {
    return {
        info: "There are no flights available from " + loc_origin + " to " + loc_destination + ". You should take a train, specially if it helps to reduce CO2 emissions."
    }
}

Note

Cohere models require a tool's responses to be a valid JSON content formatted as a string. When constructing messages of type Tool, ensure the response is a valid JSON string.

Prompt the model to book flights with the help of this function:

var result = await client.path("/chat/completions").post({
    body: {
        model: "mistral-large-2407",
        messages: messages,
        tools: tools,
        tool_choice: "auto"
    }
});

You can inspect the response to find out if a tool needs to be called. Inspect the finish reason to determine if the tool should be called. Remember that multiple tool types can be indicated. This example demonstrates a tool of type function.

const response_message = response.body.choices[0].message;
const tool_calls = response_message.tool_calls;

console.log("Finish reason: " + response.body.choices[0].finish_reason);
console.log("Tool call: " + tool_calls);

To continue, append this message to the chat history:

messages.push(response_message);

Now, it's time to call the appropriate function to handle the tool call. The following code snippet iterates over all the tool calls indicated in the response and calls the corresponding function with the appropriate parameters. The response is also appended to the chat history.

function applyToolCall({ function: call, id }) {
    // Get the tool details:
    const tool_params = JSON.parse(call.arguments);
    console.log("Calling function " + call.name + " with arguments " + tool_params);

    // Call the function defined above using `window`, which returns the list of all functions 
    // available in the scope as a dictionary. Notice that this is just done as a simple way to get
    // the function callable from its string name. Then we can call it with the corresponding
    // arguments.
    const function_response = tool_params.map(window[call.name]);
    console.log("-> " + function_response);

    return function_response
}

for (const tool_call of tool_calls) {
    var tool_response = tool_call.apply(applyToolCall);

    messages.push(
        {
            role: "tool",
            tool_call_id: tool_call.id,
            content: tool_response
        }
    );
}

View the response from the model:

var result = await client.path("/chat/completions").post({
    body: {
        messages: messages,
        tools: tools,
    }
});

Apply Guardrails and controls

The Azure AI model inference API supports Azure AI content safety. When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions.

The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled.

try {
    var messages = [
        { role: "system", content: "You are an AI assistant that helps people find information." },
        { role: "user", content: "Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills." },
    ];

    var response = await client.path("/chat/completions").post({
        body: {
            messages: messages,
        }
    });

    console.log(response.body.choices[0].message.content);
}
catch (error) {
    if (error.status_code == 400) {
        var response = JSON.parse(error.response._content);
        if (response.error) {
            console.log(`Your request triggered an ${response.error.code} error:\n\t ${response.error.message}`);
        }
        else
        {
            throw error;
        }
    }
}

Tip

To learn more about how you can configure and control Azure AI content safety settings, check the Azure AI content safety documentation.

Important

Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

This article explains how to use chat completions API with models deployed in Azure AI Foundry Models.

Prerequisites

To use chat completion models in your application, you need:

  • Add the Azure AI inference package to your project:

    <dependency>
        <groupId>com.azure</groupId>
        <artifactId>azure-ai-inference</artifactId>
        <version>1.0.0-beta.4</version>
    </dependency>
    
  • If you are using Entra ID, you also need the following package:

    <dependency>
        <groupId>com.azure</groupId>
        <artifactId>azure-identity</artifactId>
        <version>1.15.3</version>
    </dependency>
    
  • Import the following namespace:

    package com.azure.ai.inference.usage;
    
    import com.azure.ai.inference.EmbeddingsClient;
    import com.azure.ai.inference.EmbeddingsClientBuilder;
    import com.azure.ai.inference.ChatCompletionsClient;
    import com.azure.ai.inference.ChatCompletionsClientBuilder;
    import com.azure.ai.inference.models.EmbeddingsResult;
    import com.azure.ai.inference.models.EmbeddingItem;
    import com.azure.ai.inference.models.ChatCompletions;
    import com.azure.core.credential.AzureKeyCredential;
    import com.azure.core.util.Configuration;
    
    import java.util.ArrayList;
    import java.util.List;
    
  • A chat completions model deployment. If you don't have one, read Add and configure Foundry Models to add a chat completions model to your resource.

    • This example uses mistral-large-2407.

Use chat completions

First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables.

ChatCompletionsClient client = new ChatCompletionsClientBuilder()
    .credential(new AzureKeyCredential("{key}"))
    .endpoint("https://<resource>.services.ai.azure.com/api/models")
    .buildClient();

If you've configured the resource with Microsoft Entra ID support, you can use the following code snippet to create a client.

TokenCredential defaultCredential = new DefaultAzureCredentialBuilder().build();
ChatCompletionsClient client = new ChatCompletionsClientBuilder()
    .credential(defaultCredential)
    .endpoint("https://<resource>.services.ai.azure.com/api/models")
    .buildClient();

Create a chat completion request

The following example shows how you can create a basic chat completions request to the model.

List<ChatRequestMessage> chatMessages = new ArrayList<>();
chatMessages.add(new ChatRequestSystemMessage("You are a helpful assistant."));
chatMessages.add(new ChatRequestUserMessage("How many languages are in the world?"));

ChatCompletions response = client.complete(new ChatCompletionsOptions(chatMessages));

Note

Some models don't support system messages (role="system"). When you use the Azure AI model inference API, system messages are translated to user messages, which is the closest capability available. This translation is offered for convenience, but it's important for you to verify that the model is following the instructions in the system message with the right level of confidence.

The response is as follows, where you can see the model's usage statistics:

System.out.printf("Model ID=%s is created at %s.%n", chatCompletions.getId(), chatCompletions.getCreated());
for (ChatChoice choice : chatCompletions.getChoices()) {
    ChatResponseMessage message = choice.getMessage();
    System.out.printf("Index: %d, Chat Role: %s.%n", choice.getIndex(), message.getRole());
    System.out.println("Message:");
    System.out.println(message.getContent());
}
Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.
Model: mistral-large-2407
Usage: 
  Prompt tokens: 19
  Total tokens: 91
  Completion tokens: 72

Inspect the usage section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion.

Stream content

By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds.

You can stream the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as data-only server-sent events. Extract chunks from the delta field, rather than the message field.

List<ChatRequestMessage> chatMessages = new ArrayList<>();
chatMessages.add(new ChatRequestSystemMessage("You are a helpful assistant."));
chatMessages.add(new ChatRequestUserMessage("How many languages are in the world?"));

client.completeStream(new ChatCompletionsOptions(chatMessages))
    .forEach(chatCompletions -> {
        if (CoreUtils.isNullOrEmpty(chatCompletions.getChoices())) {
            return;
        }
        StreamingChatResponseMessageUpdate delta = chatCompletions.getChoice().getDelta();
        if (delta.getRole() != null) {
            System.out.println("Role = " + delta.getRole());
        }
        if (delta.getContent() != null) {
            String content = delta.getContent();
            System.out.print(content);
        }
    });

Explore more parameters supported by the inference client

Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see Azure AI Model Inference API reference. Some models don't support JSON output formatting. You can always prompt the model to generate JSON outputs. However, such outputs aren't guaranteed to be valid JSON.

If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using extra parameters. See Pass extra parameters to the model.

Create JSON outputs

Some models can create JSON outputs. Set response_format to json_object to enable JSON mode and guarantee that the message the model generates is valid JSON. You must also instruct the model to produce JSON yourself via a system or user message. Also, the message content might be partially cut off if finish_reason="length", which indicates that the generation exceeded max_tokens or that the conversation exceeded the max context length.

Pass extra parameters to the model

The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter logprobs to the model.

Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header extra-parameters is passed to the model with the value pass-through. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported.

Use tools

Some models support the use of tools, which can be an extraordinary resource when you need to offload specific tasks from the language model and instead rely on a more deterministic system or even a different language model. The Azure AI Model Inference API allows you to define tools in the following way.

The following code example creates a tool definition that is able to look from flight information from two different cities.

In this example, the function's output is that there are no flights available for the selected route, but the user should consider taking a train.

Note

Cohere models require a tool's responses to be a valid JSON content formatted as a string. When constructing messages of type Tool, ensure the response is a valid JSON string.

Prompt the model to book flights with the help of this function:

You can inspect the response to find out if a tool needs to be called. Inspect the finish reason to determine if the tool should be called. Remember that multiple tool types can be indicated. This example demonstrates a tool of type function.

To continue, append this message to the chat history:

Now, it's time to call the appropriate function to handle the tool call. The following code snippet iterates over all the tool calls indicated in the response and calls the corresponding function with the appropriate parameters. The response is also appended to the chat history.

View the response from the model:

Apply Guardrails and controls

The Azure AI model inference API supports Azure AI content safety. When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions.

The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled.

Tip

To learn more about how you can configure and control Azure AI content safety settings, check the Azure AI content safety documentation.

Important

Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

This article explains how to use chat completions API with models deployed in Azure AI Foundry Models.

Prerequisites

To use chat completion models in your application, you need:

  • Install the Azure AI inference package with the following command:

    dotnet add package Azure.AI.Inference --prerelease
    
  • If you are using Entra ID, you also need the following package:

    dotnet add package Azure.Identity
    
  • A chat completions model deployment. If you don't have one, read Add and configure Foundry Models to add a chat completions model to your resource.

    • This example uses mistral-large-2407.

Use chat completions

First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables.

ChatCompletionsClient client = new ChatCompletionsClient(
    new Uri("https://<resource>.services.ai.azure.com/api/models"),
    new AzureKeyCredential(Environment.GetEnvironmentVariable("AZURE_INFERENCE_CREDENTIAL")),
);

If you've configured the resource with Microsoft Entra ID support, you can use the following code snippet to create a client.

client = new ChatCompletionsClient(
    new Uri("https://<resource>.services.ai.azure.com/api/models"),
    new DefaultAzureCredential(),
);

Create a chat completion request

The following example shows how you can create a basic chat completions request to the model.

ChatCompletionsOptions requestOptions = new ChatCompletionsOptions()
{
    Messages = {
        new ChatRequestSystemMessage("You are a helpful assistant."),
        new ChatRequestUserMessage("How many languages are in the world?")
    },
    Model = "mistral-large-2407",
};

Response<ChatCompletions> response = client.Complete(requestOptions);

Note

Some models don't support system messages (role="system"). When you use the Azure AI model inference API, system messages are translated to user messages, which is the closest capability available. This translation is offered for convenience, but it's important for you to verify that the model is following the instructions in the system message with the right level of confidence.

The response is as follows, where you can see the model's usage statistics:

Console.WriteLine($"Response: {response.Value.Content}");
Console.WriteLine($"Model: {response.Value.Model}");
Console.WriteLine("Usage:");
Console.WriteLine($"\tPrompt tokens: {response.Value.Usage.PromptTokens}");
Console.WriteLine($"\tTotal tokens: {response.Value.Usage.TotalTokens}");
Console.WriteLine($"\tCompletion tokens: {response.Value.Usage.CompletionTokens}");
Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.
Model: mistral-large-2407
Usage: 
  Prompt tokens: 19
  Total tokens: 91
  Completion tokens: 72

Inspect the usage section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion.

Stream content

By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds.

You can stream the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as data-only server-sent events. Extract chunks from the delta field, rather than the message field.

To stream completions, use CompleteStreamingAsync method when you call the model. Notice that in this example we the call is wrapped in an asynchronous method.

static async Task StreamMessageAsync(ChatCompletionsClient client)
{
    ChatCompletionsOptions requestOptions = new ChatCompletionsOptions()
    {
        Messages = {
            new ChatRequestSystemMessage("You are a helpful assistant."),
            new ChatRequestUserMessage("How many languages are in the world? Write an essay about it.")
        },
        MaxTokens=4096,
        Model = "mistral-large-2407",
    };

    StreamingResponse<StreamingChatCompletionsUpdate> streamResponse = await client.CompleteStreamingAsync(requestOptions);

    await PrintStream(streamResponse);
}

To visualize the output, define an asynchronous method to print the stream in the console.

static async Task PrintStream(StreamingResponse<StreamingChatCompletionsUpdate> response)
{
    await foreach (StreamingChatCompletionsUpdate chatUpdate in response)
    {
        if (chatUpdate.Role.HasValue)
        {
            Console.Write($"{chatUpdate.Role.Value.ToString().ToUpperInvariant()}: ");
        }
        if (!string.IsNullOrEmpty(chatUpdate.ContentUpdate))
        {
            Console.Write(chatUpdate.ContentUpdate);
        }
    }
}

You can visualize how streaming generates content:

StreamMessageAsync(client).GetAwaiter().GetResult();

Explore more parameters supported by the inference client

Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see Azure AI Model Inference API reference.

requestOptions = new ChatCompletionsOptions()
{
    Messages = {
        new ChatRequestSystemMessage("You are a helpful assistant."),
        new ChatRequestUserMessage("How many languages are in the world?")
    },
    Model = "mistral-large-2407",
    PresencePenalty = 0.1f,
    FrequencyPenalty = 0.8f,
    MaxTokens = 2048,
    StopSequences = { "<|endoftext|>" },
    Temperature = 0,
    NucleusSamplingFactor = 1,
    ResponseFormat = new ChatCompletionsResponseFormatText()
};

response = client.Complete(requestOptions);
Console.WriteLine($"Response: {response.Value.Content}");

Some models don't support JSON output formatting. You can always prompt the model to generate JSON outputs. However, such outputs aren't guaranteed to be valid JSON.

If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using extra parameters. See Pass extra parameters to the model.

Create JSON outputs

Some models can create JSON outputs. Set response_format to json_object to enable JSON mode and guarantee that the message the model generates is valid JSON. You must also instruct the model to produce JSON yourself via a system or user message. Also, the message content might be partially cut off if finish_reason="length", which indicates that the generation exceeded max_tokens or that the conversation exceeded the max context length.

requestOptions = new ChatCompletionsOptions()
{
    Messages = {
        new ChatRequestSystemMessage(
            "You are a helpful assistant that always generate responses in JSON format, " +
            "using. the following format: { \"answer\": \"response\" }."
        ),
        new ChatRequestUserMessage(
            "How many languages are in the world?"
        )
    },
    ResponseFormat = new ChatCompletionsResponseFormatJsonObject(),
    Model = "mistral-large-2407",
};

response = client.Complete(requestOptions);
Console.WriteLine($"Response: {response.Value.Content}");

Pass extra parameters to the model

The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter logprobs to the model.

requestOptions = new ChatCompletionsOptions()
{
    Messages = {
        new ChatRequestSystemMessage("You are a helpful assistant."),
        new ChatRequestUserMessage("How many languages are in the world?")
    },
    Model = "mistral-large-2407",
    AdditionalProperties = { { "logprobs", BinaryData.FromString("true") } },
};

response = client.Complete(requestOptions, extraParams: ExtraParameters.PassThrough);
Console.WriteLine($"Response: {response.Value.Content}");

Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header extra-parameters is passed to the model with the value pass-through. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported.

Use tools

Some models support the use of tools, which can be an extraordinary resource when you need to offload specific tasks from the language model and instead rely on a more deterministic system or even a different language model. The Azure AI Model Inference API allows you to define tools in the following way.

The following code example creates a tool definition that is able to look from flight information from two different cities.

FunctionDefinition flightInfoFunction = new FunctionDefinition("getFlightInfo")
{
    Description = "Returns information about the next flight between two cities. This includes the name of the airline, flight number and the date and time of the next flight",
    Parameters = BinaryData.FromObjectAsJson(new
    {
        Type = "object",
        Properties = new
        {
            origin_city = new
            {
                Type = "string",
                Description = "The name of the city where the flight originates"
            },
            destination_city = new
            {
                Type = "string",
                Description = "The flight destination city"
            }
        }
    },
        new JsonSerializerOptions() { PropertyNamingPolicy = JsonNamingPolicy.CamelCase }
    )
};

ChatCompletionsFunctionToolDefinition getFlightTool = new ChatCompletionsFunctionToolDefinition(flightInfoFunction);

In this example, the function's output is that there are no flights available for the selected route, but the user should consider taking a train.

static string getFlightInfo(string loc_origin, string loc_destination)
{
    return JsonSerializer.Serialize(new
    {
        info = $"There are no flights available from {loc_origin} to {loc_destination}. You " +
        "should take a train, specially if it helps to reduce CO2 emissions."
    });
}

Note

Cohere models require a tool's responses to be a valid JSON content formatted as a string. When constructing messages of type Tool, ensure the response is a valid JSON string.

Prompt the model to book flights with the help of this function:

var chatHistory = new List<ChatRequestMessage>(){
        new ChatRequestSystemMessage(
            "You are a helpful assistant that help users to find information about traveling, " +
            "how to get to places and the different transportations options. You care about the" +
            "environment and you always have that in mind when answering inqueries."
        ),
        new ChatRequestUserMessage("When is the next flight from Miami to Seattle?")
    };

requestOptions = new ChatCompletionsOptions(chatHistory, model: "mistral-large-2407");
requestOptions.Tools.Add(getFlightTool);
requestOptions.ToolChoice = ChatCompletionsToolChoice.Auto;

response = client.Complete(requestOptions);

You can inspect the response to find out if a tool needs to be called. Inspect the finish reason to determine if the tool should be called. Remember that multiple tool types can be indicated. This example demonstrates a tool of type function.

var responseMessage = response.Value;
var toolsCall = responseMessage.ToolCalls;

Console.WriteLine($"Finish reason: {response.Value.Choices[0].FinishReason}");
Console.WriteLine($"Tool call: {toolsCall[0].Id}");

To continue, append this message to the chat history:

requestOptions.Messages.Add(new ChatRequestAssistantMessage(response.Value));

Now, it's time to call the appropriate function to handle the tool call. The following code snippet iterates over all the tool calls indicated in the response and calls the corresponding function with the appropriate parameters. The response is also appended to the chat history.

foreach (ChatCompletionsToolCall tool in toolsCall)
{
    if (tool is ChatCompletionsFunctionToolCall functionTool)
    {
        // Get the tool details:
        string callId = functionTool.Id;
        string toolName = functionTool.Name;
        string toolArgumentsString = functionTool.Arguments;
        Dictionary<string, object> toolArguments = JsonSerializer.Deserialize<Dictionary<string, object>>(toolArgumentsString);

        // Here you have to call the function defined. In this particular example we use 
        // reflection to find the method we definied before in a static class called 
        // `ChatCompletionsExamples`. Using reflection allows us to call a function 
        // by string name. Notice that this is just done for demonstration purposes as a 
        // simple way to get the function callable from its string name. Then we can call 
        // it with the corresponding arguments.

        var flags = BindingFlags.Instance | BindingFlags.Public | BindingFlags.NonPublic | BindingFlags.Static;
        string toolResponse = (string)typeof(ChatCompletionsExamples).GetMethod(toolName, flags).Invoke(null, toolArguments.Values.Cast<object>().ToArray());

        Console.WriteLine("->", toolResponse);
        requestOptions.Messages.Add(new ChatRequestToolMessage(toolResponse, callId));
    }
    else
        throw new Exception("Unsupported tool type");
}

View the response from the model:

response = client.Complete(requestOptions);

Apply Guardrails and controls

The Azure AI model inference API supports Azure AI content safety. When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions.

The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled.

try
{
    requestOptions = new ChatCompletionsOptions()
    {
        Messages = {
            new ChatRequestSystemMessage("You are an AI assistant that helps people find information."),
            new ChatRequestUserMessage(
                "Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills."
            ),
        },
        Model = "mistral-large-2407",
    };

    response = client.Complete(requestOptions);
    Console.WriteLine(response.Value.Content);
}
catch (RequestFailedException ex)
{
    if (ex.ErrorCode == "content_filter")
    {
        Console.WriteLine($"Your query has trigger Azure Content Safety: {ex.Message}");
    }
    else
    {
        throw;
    }
}

Tip

To learn more about how you can configure and control Azure AI content safety settings, check the Azure AI content safety documentation.

Important

Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

This article explains how to use chat completions API with models deployed in Azure AI Foundry Models.

Prerequisites

To use chat completion models in your application, you need:

  • A chat completions model deployment. If you don't have one, read Add and configure Foundry Models to add a chat completions model to your resource.

    • This example uses mistral-large-2407.

Use chat completions

To use chat completions API, use the route /chat/completions appended to the base URL along with your credential indicated in api-key.

POST https://<resource>.services.ai.azure.com/models/chat/completions?api-version=2024-05-01-preview
Content-Type: application/json
api-key: <key>

If you've configured the resource with Microsoft Entra ID support, pass you token in the Authorization header with the format Bearer <token>. Use scope https://cognitiveservices.azure.com/.default.

POST https://<resource>.services.ai.azure.com/models/chat/completions?api-version=2024-05-01-preview
Content-Type: application/json
Authorization: Bearer <token>

Using Microsoft Entra ID might require extra configuration in your resource to grant access. Learn how to configure key-less authentication with Microsoft Entra ID.

Create a chat completion request

The following example shows how you can create a basic chat completions request to the model.

{
    "model": "mistral-large-2407",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "How many languages are in the world?"
        }
    ]
}

Note

Some models don't support system messages (role="system"). When you use the Azure AI Foundry Models API, system messages are translated to user messages, which is the closest capability available. This translation is offered for convenience, but it's important for you to verify that the model is following the instructions in the system message with the right level of confidence.

The response is as follows, where you can see the model's usage statistics:

{
    "id": "0a1234b5de6789f01gh2i345j6789klm",
    "object": "chat.completion",
    "created": 1718726686,
    "model": "mistral-large-2407",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.",
                "tool_calls": null
            },
            "finish_reason": "stop",
            "logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 19,
        "total_tokens": 91,
        "completion_tokens": 72
    }
}

Inspect the usage section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion.

Stream content

By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds.

You can stream the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as data-only server-sent events. Extract chunks from the delta field, rather than the message field.

{
    "model": "mistral-large-2407",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "How many languages are in the world?"
        }
    ],
    "stream": true,
    "temperature": 0,
    "top_p": 1,
    "max_tokens": 2048
}

You can visualize how streaming generates content:

{
    "id": "23b54589eba14564ad8a2e6978775a39",
    "object": "chat.completion.chunk",
    "created": 1718726371,
    "model": "mistral-large-2407",
    "choices": [
        {
            "index": 0,
            "delta": {
                "role": "assistant",
                "content": ""
            },
            "finish_reason": null,
            "logprobs": null
        }
    ]
}

The last message in the stream has finish_reason set, indicating the reason for the generation process to stop.

{
    "id": "23b54589eba14564ad8a2e6978775a39",
    "object": "chat.completion.chunk",
    "created": 1718726371,
    "model": "mistral-large-2407",
    "choices": [
        {
            "index": 0,
            "delta": {
                "content": ""
            },
            "finish_reason": "stop",
            "logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 19,
        "total_tokens": 91,
        "completion_tokens": 72
    }
}

Explore more parameters supported by the inference client

Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see Foundry Models API reference.

{
    "model": "mistral-large-2407",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "How many languages are in the world?"
        }
    ],
    "presence_penalty": 0.1,
    "frequency_penalty": 0.8,
    "max_tokens": 2048,
    "stop": ["<|endoftext|>"],
    "temperature" :0,
    "top_p": 1,
    "response_format": { "type": "text" }
}
{
    "id": "0a1234b5de6789f01gh2i345j6789klm",
    "object": "chat.completion",
    "created": 1718726686,
    "model": "mistral-large-2407",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.",
                "tool_calls": null
            },
            "finish_reason": "stop",
            "logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 19,
        "total_tokens": 91,
        "completion_tokens": 72
    }
}

Some models don't support JSON output formatting. You can always prompt the model to generate JSON outputs. However, such outputs aren't guaranteed to be valid JSON.

If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using extra parameters. See Pass extra parameters to the model.

Create JSON outputs

Some models can create JSON outputs. Set response_format to json_object to enable JSON mode and guarantee that the message the model generates is valid JSON. You must also instruct the model to produce JSON yourself via a system or user message. Also, the message content might be partially cut off if finish_reason="length", which indicates that the generation exceeded max_tokens or that the conversation exceeded the max context length.

{
    "model": "mistral-large-2407",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant that always generate responses in JSON format, using the following format: { \"answer\": \"response\" }"
        },
        {
            "role": "user",
            "content": "How many languages are in the world?"
        }
    ],
    "response_format": { "type": "json_object" }
}
{
    "id": "0a1234b5de6789f01gh2i345j6789klm",
    "object": "chat.completion",
    "created": 1718727522,
    "model": "mistral-large-2407",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "{\"answer\": \"There are approximately 7,117 living languages in the world today, according to the latest estimates. However, this number can vary as some languages become extinct and others are newly discovered or classified.\"}",
                "tool_calls": null
            },
            "finish_reason": "stop",
            "logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 39,
        "total_tokens": 87,
        "completion_tokens": 48
    }
}

Pass extra parameters to the model

The Foundry Models Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter logprobs to the model.

POST https://<resource>.services.ai.azure.com/models/chat/completions?api-version=2024-05-01-preview
Authorization: Bearer <TOKEN>
Content-Type: application/json
extra-parameters: pass-through
{
    "model": "mistral-large-2407",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "How many languages are in the world?"
        }
    ],
    "logprobs": true
}

Before you pass extra parameters to The Azure AI Model Inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header extra-parameters is passed to the model with the value pass-through. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported.

Use tools

Some models support the use of tools, which can be an extraordinary resource when you need to offload specific tasks from the language model and instead rely on a more deterministic system or even a different language model. The Azure AI Model Inference API allows you to define tools in the following way.

The following code example creates a tool definition that is able to look from flight information from two different cities.

{
    "type": "function",
    "function": {
        "name": "get_flight_info",
        "description": "Returns information about the next flight between two cities. This includes the name of the airline, flight number and the date and time of the next flight",
        "parameters": {
            "type": "object",
            "properties": {
                "origin_city": {
                    "type": "string",
                    "description": "The name of the city where the flight originates"
                },
                "destination_city": {
                    "type": "string",
                    "description": "The flight destination city"
                }
            },
            "required": [
                "origin_city",
                "destination_city"
            ]
        }
    }
}

In this example, the function's output is that there are no flights available for the selected route, but the user should consider taking a train.

Note

Cohere models require a tool's responses to be a valid JSON content formatted as a string. When constructing messages of type Tool, ensure the response is a valid JSON string.

Prompt the model to book flights with the help of this function:

{
    "model": "mistral-large-2407",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant that help users to find information about traveling, how to get to places and the different transportations options. You care about the environment and you always have that in mind when answering inqueries"
        },
        {
            "role": "user",
            "content": "When is the next flight from Miami to Seattle?"
        }
    ],
    "tool_choice": "auto",
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_flight_info",
                "description": "Returns information about the next flight between two cities. This includes the name of the airline, flight number and the date and time of the next flight",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "origin_city": {
                            "type": "string",
                            "description": "The name of the city where the flight originates"
                        },
                        "destination_city": {
                            "type": "string",
                            "description": "The flight destination city"
                        }
                    },
                    "required": [
                        "origin_city",
                        "destination_city"
                    ]
                }
            }
        }
    ]
}

You can inspect the response to find out if a tool needs to be called. Inspect the finish reason to determine if the tool should be called. Remember that multiple tool types can be indicated. This example demonstrates a tool of type function.

{
    "id": "0a1234b5de6789f01gh2i345j6789klm",
    "object": "chat.completion",
    "created": 1718726007,
    "model": "mistral-large-2407",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "",
                "tool_calls": [
                    {
                        "id": "abc0dF1gh",
                        "type": "function",
                        "function": {
                            "name": "get_flight_info",
                            "arguments": "{\"origin_city\": \"Miami\", \"destination_city\": \"Seattle\"}",
                            "call_id": null
                        }
                    }
                ]
            },
            "finish_reason": "tool_calls",
            "logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 190,
        "total_tokens": 226,
        "completion_tokens": 36
    }
}

To continue, append this message to the chat history:

Now, it's time to call the appropriate function to handle the tool call. The following code snippet iterates over all the tool calls indicated in the response and calls the corresponding function with the appropriate parameters. The response is also appended to the chat history.

View the response from the model:

{
    "model": "mistral-large-2407",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant that help users to find information about traveling, how to get to places and the different transportations options. You care about the environment and you always have that in mind when answering inqueries"
        },
        {
            "role": "user",
            "content": "When is the next flight from Miami to Seattle?"
        },
        {
            "role": "assistant",
            "content": "",
            "tool_calls": [
                {
                    "id": "abc0DeFgH",
                    "type": "function",
                    "function": {
                        "name": "get_flight_info",
                        "arguments": "{\"origin_city\": \"Miami\", \"destination_city\": \"Seattle\"}",
                        "call_id": null
                    }
                }
            ]
        },
        {
            "role": "tool",
            "content": "{ \"info\": \"There are no flights available from Miami to Seattle. You should take a train, specially if it helps to reduce CO2 emissions.\" }",
            "tool_call_id": "abc0DeFgH" 
        }
    ],
    "tool_choice": "auto",
    "tools": [
        {
            "type": "function",
            "function": {
            "name": "get_flight_info",
            "description": "Returns information about the next flight between two cities. This includes the name of the airline, flight number and the date and time of the next flight",
            "parameters":{
                "type": "object",
                "properties": {
                    "origin_city": {
                        "type": "string",
                        "description": "The name of the city where the flight originates"
                    },
                    "destination_city": {
                        "type": "string",
                        "description": "The flight destination city"
                    }
                },
                "required": ["origin_city", "destination_city"]
            }
            }
        }
    ]
}

Apply Guardrails and controls

The Azure AI Model Inference API supports Azure AI Content Safety. When you use deployments with Azure AI Content Safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions.

The following example shows how to handle events when the model detects harmful content in the input prompt.

{
    "model": "mistral-large-2407",
    "messages": [
        {
            "role": "system",
            "content": "You are an AI assistant that helps people find information."
        },
                {
            "role": "user",
            "content": "Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills."
        }
    ]
}
{
    "error": {
        "message": "The response was filtered due to the prompt triggering Microsoft's content management policy. Please modify your prompt and retry.",
        "type": null,
        "param": "prompt",
        "code": "content_filter",
        "status": 400
    }
}

Tip

To learn more about how you can configure and control Azure AI Content Safety settings, check the Azure AI Content Safety documentation.