スピーチトランスクリプションと翻訳のためのLLM音声（プレビュー版）

注

現在、この機能はパブリックプレビュー段階にあります。このプレビューはサービスレベルアグリーメントなしで提供され、運用環境のワークロードには推奨されません。特定の機能はサポート対象ではなく、機能が制限されることがあります。詳細については、「 Microsoft Azure プレビューの追加使用条件」を参照してください。

LLM 音声は、品質の向上、深いコンテキスト理解、多言語サポート、およびプロンプトチューニング機能を提供する、大規模な言語モデル拡張音声モデルを利用しています。 GPU アクセラレーションを使用して超高速推論を行い、オーディオファイルからのキャプションと字幕の生成、会議ノートの要約、コールセンターエージェントの支援、ボイスメールの文字起こしなど、さまざまなシナリオに最適です。

LLM Speech API は現在、次の音声タスクをサポートしています。

transcribe
translate

[前提条件]

LLM 音声 API を使用できるリージョンの 1 つの Azure AI Speech リソース。サポートされているリージョンの現在の一覧については、「 Speech Service リージョン」を参照してください。
バッチ文字起こし API でサポートされる形式とコーデックの 1 つ (WAV、MP3、OPUS/OGG、SLACK、WMA、AAC、AAC、WAV コンテナーの ALAW、WAV コンテナーの MULAW、AMR、WebM、SPEEX) のいずれかのオーディオファイル (長さ 2 時間未満、サイズが 300 MB 未満)。サポートされているオーディオ形式の詳細については、サポートされているオーディオ形式のセクションを参照してください。

LLM 音声 API を使用する

サポートされている言語

現在、 transcribe タスクと translate タスクの両方で、次の言語がサポートされています。

English、 Chinese、 German、 French、 Italian、 Japanese、 Spanish、 Portuguese、および Korean。

オーディオのアップロード

オーディオデータは、次の方法で提供できます。

インラインオーディオデータを渡します。

  --form 'audio=@"YourAudioFile"'

パブリック audioUrlからオーディオファイルをアップロードします。

  --form 'definition": "{\"audioUrl\": \"https://crbn.us/hello.wav"}"'

以下のセクションでは、インラインオーディオアップロードを例として使用します。

LLM 音声 API を呼び出す

オーディオファイルと要求本文のプロパティを使用して、transcriptions エンドポイントに対して multipart/form-data POST 要求を行います。

次の例は、指定されたロケールでオーディオファイルを文字起こしする方法を示しています。オーディオファイルのロケールがわかっている場合は、それを指定して文字起こしの精度を向上させ、待機時間を最小限に抑えることができます。

YourSpeechResoureKey をSpeech リソースキーに置き換えます。
YourServiceRegion を音声リソースのリージョンに置き換えます。
YourAudioFile を、オーディオファイルへのパスに置き換えます。

Important

Microsoft Entra ID で推奨されるキーレス認証の場合は、 --header 'Ocp-Apim-Subscription-Key: YourSpeechResoureKey' を --header "Authorization: Bearer YourAccessToken"に置き換えます。キーレス認証の詳細については、ロールベースのアクセス制御のハウツーガイドを参照してください。

LLM 音声を使用してオーディオを文字起こしする

ロケールコードを指定せずに、入力言語でオーディオを文字起こしできます。モデルは、オーディオコンテンツに基づいて適切な言語を自動的に検出して選択します。

curl --___location 'https://<YourServiceRegion>.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2025-10-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: <YourSpeechResourceKey>' \
--form 'audio=@"YourAudioFile.wav"' \
--form 'definition={
  "enhancedMode": {
    "enabled": true,
    "task": "transcribe"
  }
}'

LLM 音声を使用してオーディオファイルを翻訳する

オーディオは、指定したターゲット言語に翻訳できます。翻訳を有効にするには、要求でターゲット言語コードを指定する必要があります。

curl --___location 'https://<YourServiceRegion>.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2025-10-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: <YourSpeechResourceKey>' \
--form 'audio=@"YourAudioFile.wav"' \
--form 'definition={
  "enhancedMode": {
    "enabled": true,
    "task": "translate",
    "targetLanguage": "ko"
  }
}'

プロンプトチューニングを使用してパフォーマンスを変更する

オプションのテキストを指定して、 transcribe または translate タスクの出力スタイルをガイドできます。

curl --___location 'https://<YourServiceRegion>.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2025-10-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: <YourSpeechResourceKey>' \
--form 'audio=@"YourAudioFile.wav"' \
--form 'definition={
  "enhancedMode": {
    "enabled": true,
    "task": "transcribe",
    "prompt": ["Output must be in lexical format."]
  }
}'

プロンプトのベストプラクティスを次に示します。

プロンプトの最大長は 4,096 文字です。
プロンプトは、できれば英語で記述する必要があります。
プロンプトでは、出力の書式設定をガイドできます。既定では、応答では読みやすく最適化された表示形式が使用されます。字句形式を適用するには、次を追加します。 Output must be in lexical format.
プロンプトは、特定の語句や頭字語の顕著さを増幅し、認識の可能性を向上させることができます。使用: Pay attention to *phrase1*, *phrase2*, …。最適な結果を得るには、プロンプトごとの語句の数を制限します。
通常、音声タスクに関連しないプロンプト ( Tell me a story.など) は無視されます。

その他の構成オプション

追加の構成オプションと高速文字起こしを組み合わせて、 diarization、 profanityFilterMode、 channelsなどの強化された機能を有効にすることができます。

curl --___location 'https://<YourServiceRegion>.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2025-10-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: <YourSpeechResourceKey>' \
--form 'audio=@"YourAudioFile.wav"' \
--form 'definition={
  "enhancedMode": {
    "enabled": true,
    "task": "transcribe",
    "prompt": ["Output must be in lexical format."]
  },
  "diarization": {
    "maxSpeakers": 2,
    "enabled": true
  },
  "profanityFilterMode": "Masked"
}'

localesやphraseListsなどの一部の構成オプションは、必須ではないか、LLM 音声には適用されず、要求から省略できます。高速文字起こしの構成オプションの詳細を確認します。

サンプル応答

JSON 応答では、 combinedPhrases プロパティには完全な文字起こしまたは翻訳されたテキストが含まれており、 phrases プロパティにはセグメントレベルと単語レベルの詳細が含まれます。

{
    "durationMilliseconds": 57187,
    "combinedPhrases": [
        {
            "text": "With custom speech,you can evaluate and improve the microsoft speech to text accuracy for your applications and products 现成的语音转文本,利用通用语言模型作为一个基本模型,使用microsoft自有数据进行训练,并反映常用的口语。此基础模型使用那些代表各常见领域的方言和发音进行了预先训练。 Quand vous effectuez une demande de reconnaissance vocale, le modèle de base le plus récent pour chaque langue prise en charge est utilisé par défaut. Le modèle de base fonctionne très bien dans la plupart des scénarios de reconnaissance vocale. A custom model can be used to augment the base model to improve recognition of ___domain specific vocabulary specified to the application by providing text data to train the model. It can also be used to improve recognition based for the specific audio conditions of the application by providing audio data with reference transcriptions."
        }
    ],
    "phrases": [
        {
            "offsetMilliseconds": 80,
            "durationMilliseconds": 6960,
            "text": "With custom speech,you can evaluate and improve the microsoft speech to text accuracy for your applications and products.",
            "words": [
                {
                    "text": "with",
                    "offsetMilliseconds": 80,
                    "durationMilliseconds": 160
                },
                {
                    "text": "custom",
                    "offsetMilliseconds": 240,
                    "durationMilliseconds": 480
                },

                {
                    "text": "speech",
                    "offsetMilliseconds": 720,
                    "durationMilliseconds": 360
                },,
		// More transcription results...
	    // Redacted for brevity
            ],
            "locale": "en-us",
            "confidence": 0
        },
        {
            "offsetMilliseconds": 8000,
            "durationMilliseconds": 8600,
            "text": "现成的语音转文本,利用通用语言模型作为一个基本模型,使用microsoft自有数据进行训练,并反映常用的口语。此基础模型使用那些代表各常见领域的方言和发音进行了预先训练。",
            "words": [
                {
                    "text": "现",
                    "offsetMilliseconds": 8000,
                    "durationMilliseconds": 40
                },
                {
                    "text": "成",
                    "offsetMilliseconds": 8040,
                    "durationMilliseconds": 40
                },
		// More transcription results...
	    // Redacted for brevity
                {
                    "text": "训",
                    "offsetMilliseconds": 16400,
                    "durationMilliseconds": 40
                },
                {
                    "text": "练",
                    "offsetMilliseconds": 16560,
                    "durationMilliseconds": 40
                },
            ],
            "locale": "zh-cn",
            "confidence": 0
		// More transcription results...
	    // Redacted for brevity
                {
                    "text": "with",
                    "offsetMilliseconds": 54720,
                    "durationMilliseconds": 200
                },
                {
                    "text": "reference",
                    "offsetMilliseconds": 54920,
                    "durationMilliseconds": 360
                },
                {
                    "text": "transcriptions.",
                    "offsetMilliseconds": 55280,
                    "durationMilliseconds": 1200
                }
            ],
            "locale": "en-us",
            "confidence": 0
        }
    ]
}

応答形式は、高速文字起こしやバッチ文字起こしなど、他の既存の音声テキスト変換出力と一致します。主な違いは次のとおりです。

word レベルの durationMilliseconds と offsetMilliseconds は、 translate タスクではサポートされていません。
diarization は、 translate タスクではサポートされていません。 speaker1 ラベルのみが返されます。
confidence は使用できません。常に 0。

注

音声認識サービスは、エラスティックサービスです。 429 エラーコード (要求が多すぎる) を受け取った場合は、ベストプラクティスに従って、オートスケーリング中のスロットリングを抑制してください。

フィードバック

このページはお役に立ちましたか?

Last updated on 2025-11-05