Edit

Share via


How to customize voice live input and output

Voice live provides multiple options to optimize performance and quality by using custom models. The following customization options are currently available:

  • Speech input customization:
    • Phrase-list: A lightweight just-in-time customization based on a list of words or phrases provided as part of the session configuration to help improve recognition quality. To learn more, see Improve recognition accuracy with phrase list.
    • Custom Speech: With custom speech, you can evaluate and improve the accuracy of speech recognition for your applications and products and fine-tune the recognition quality to your business needs. See What is custom speech? to learn more.
  • Speech output customization:
    • Custom lexicon: Custom lexicon allows you to easily customize pronunciation for both standard Azure text to speech voices and custom voices to improve speech synthesis accuracy for your use case. See custom lexicon for text to speech to learn more.
    • Custom voice: Custom voice lets you create a one-of-a-kind, customized, synthetic voice for your applications. With custom voice, you can build a highly natural-sounding voice for your brand or characters by providing human speech samples as fine-tuning data. See What is custom voice? to learn more.
    • Custom avatar: Custom text to speech avatar allows you to create a customized, one-of-a-kind synthetic talking avatar for your application. With custom text to speech avatar, you can build a unique and natural-looking avatar for your product or brand by providing video recording data of your selected actors. See What is custom text to speech avatar? to learn more.

Speech input customization

Phrase list

Use phrase list for lightweight just-in-time customization on audio input. To configure phrase list, you can set the phrase_list in the session.update message.

{
    "session": {
        "input_audio_transcription": {
            "model": "azure-speech",
            "phrase_list": ["Neo QLED TV", "TUF Gaming", "AutoQuote Explorer"]
        }
    }
}

Note

Phrase list currently doesn't support gpt-realtime, gpt-4o-mini-realtime, and phi4-mm-realtime. To learn more about phrase list, see phrase list for speech to text.

Custom speech configuration

You can use the custom_speech field to specify your custom speech models. This field is defined as a dictionary, where each key represents a locale code and each value corresponds to the Model ID of the custom speech model. For more information about custom speech, see What is custom speech?.

Voice live supports using a combination of base models and custom models as long as each type is unique per locale with a maximum of 10 languages specified in total.

Example session configuration with custom speech models. In this example when the detected language is English, the base model is used and, when the detected language is Chinese, the custom speech model is used.

{
  "session": {
    "input_audio_transcription": {
      "model": "azure-speech",
      "language": "en",
      "custom_speech": {
        "zh-CN": "847cb03d-7f22-4b11-444-e1be1d77bf17"
      }
    }
  }
}

Note

In order to use a custom speech model with Voice live API, the model must be available on the same Azure AI Foundry resource you're using to call the Voice live API. If you trained the model on a different Azure AI Foundry or Azure AI Speech resource, you have to copy the model to the resource you're using to call the Voice live API. You pay separately for custom speech training and model hosting. For more information on supported regions, see Speech service supported regions.

Speech output customization

Custom lexicon

Use the custom_lexicon_url string property to customize pronunciation for both standard Azure text to speech voices and custom voices. To learn more about how to format the custom lexicon (the same as Speech Synthesis Markup Language (SSML)), see custom lexicon for text to speech.

{
  "voice": {
    "name": "en-US-Ava:DragonHDLatestNeural",
    "type": "azure-standard",
    "temperature": 0.8, // optional
    "custom_lexicon_url": "<custom lexicon url>"
  }
}

Azure custom voices

You can use a custom voice for audio output. For information about how to create a custom voice, see What is custom voice.

{
  "voice": {
    "name": "en-US-CustomNeural",
    "type": "azure-custom",
    "endpoint_id": "your-endpoint-id", // a guid string
    "temperature": 0.8 // optional, value range 0.0-1.0, only take effect when using HD voices
  }
}

Important

Custom voice access is limited based on eligibility and usage criteria. Request access on the intake form.

Note

In order to use a custom voice model with Voice live API, the model must be available on the same Azure AI Foundry resource you're using to call the Voice live API. If you trained the model on a different Azure AI Foundry or Azure AI Speech resource, you have to copy it to the resource you're using to call the Voice live API. You pay separately for custom voice training and model hosting. For more information on supported regions, see Speech service supported regions.

Azure custom avatar

Text to speech avatar converts text into a digital video of a photorealistic human (either a standard avatar or a custom text to speech avatar) speaking with a natural-sounding voice.

The configuration for a custom avatar doesn't differ from the configuration of a standard avatar. Refer to How to use the Voice live API - Azure text to speech avatar for a detailed example.

Important

Custom text to speech avatar access is limited based on eligibility and usage criteria. Request access on the intake form.

Note

In order to use a custom avatar with Voice live API, the avatar must be available on the same Azure AI Foundry resource you're using to call the Voice live API. If you trained the avatar on a different Azure AI Foundry or Azure AI Speech resource, you have to copy the model to the resource you're using to call the Voice live API. You pay separately for custom avatar training and model hosting. For more information on supported regions, see Speech service supported regions.

Note

Custom photo avatar (PREVIEW) training isn't yet available as a self-service option and currently requires a manual offline process.