Azure AI Speech Services: Speech-to-Text & Text-to-Speech
Back to Tutorials
Azure AIBeginner5 Steps

Azure AI Speech Services: Speech-to-Text & Text-to-Speech

Microsoft LearnSeptember 25, 202535 min watch35 min video

Build voice-enabled applications with Azure AI Speech Services. Learn speech-to-text transcription, text-to-speech synthesis, speech translation, and custom voice models.

Azure AI Speech Services

Azure AI Speech Services provides speech-to-text, text-to-speech, speech translation, and speaker recognition capabilities. Build voice-enabled apps, transcribe meetings, create audiobooks, and more.

Available Capabilities

Service Description
Speech-to-Text Real-time and batch transcription
Text-to-Speech 400+ neural voices in 140+ languages
Speech Translation Real-time speech-to-speech translation
Speaker Recognition Verify and identify speakers
Pronunciation Assessment Score pronunciation accuracy
Custom Voice Create a branded neural voice

Prerequisites

# Create a Speech resource
az cognitiveservices account create \
  --name "my-speech" \
  --resource-group "my-rg" \
  --kind "SpeechServices" \
  --sku "S0" \
  --location "eastus"

Speech-to-Text

Real-Time Transcription (Python)

import azure.cognitiveservices.speech as speechsdk

speech_config = speechsdk.SpeechConfig(
    subscription="your-key",
    region="eastus"
)
speech_config.speech_recognition_language = "en-US"

# From microphone
audio_config = speechsdk.audio.AudioConfig(use_default_microphone=True)
recognizer = speechsdk.SpeechRecognizer(
    speech_config=speech_config,
    audio_config=audio_config
)

print("Speak now...")
result = recognizer.recognize_once_async().get()

if result.reason == speechsdk.ResultReason.RecognizedSpeech:
    print(f"Recognized: {result.text}")
elif result.reason == speechsdk.ResultReason.NoMatch:
    print("No speech could be recognized")

Continuous Recognition

def recognized_handler(evt):
    print(f"RECOGNIZED: {evt.result.text}")

def session_stopped_handler(evt):
    print("Session stopped")
    stop_event.set()

recognizer.recognized.connect(recognized_handler)
recognizer.session_stopped.connect(session_stopped_handler)

recognizer.start_continuous_recognition()
# ... runs until stop_event is set

Batch Transcription

For pre-recorded audio files (meetings, calls, podcasts):

# REST API call to start batch transcription
import requests

url = "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2/transcriptions"
headers = {"Ocp-Apim-Subscription-Key": "your-key"}
body = {
    "contentUrls": [
        "https://storage.blob.core.windows.net/audio/meeting.wav"
    ],
    "locale": "en-US",
    "displayName": "Team Meeting Transcription",
    "properties": {
        "wordLevelTimestampsEnabled": True,
        "diarizationEnabled": True,  # Speaker identification
        "speakers": {"minCount": 2, "maxCount": 6}
    }
}

response = requests.post(url, headers=headers, json=body)

Text-to-Speech

Basic Synthesis (Python)

speech_config = speechsdk.SpeechConfig(
    subscription="your-key", region="eastus"
)

# Choose a neural voice
speech_config.speech_synthesis_voice_name = "en-US-JennyNeural"

synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
result = synthesizer.speak_text_async("Hello! Welcome to Azure AI Speech Services.").get()

SSML for Fine Control

ssml = """
<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xml:lang='en-US'>
  <voice name='en-US-JennyNeural'>
    <prosody rate='medium' pitch='+5%'>
      Welcome to the <emphasis level='strong'>Azure AI</emphasis> Speech tutorial.
    </prosody>
    <break time='500ms'/>
    <prosody rate='slow'>
      Let me walk you through the key features.
    </prosody>
  </voice>
</speak>
"""

result = synthesizer.speak_ssml_async(ssml).get()

Popular Neural Voices

  • en-US-JennyNeural — Friendly, conversational
  • en-US-GuyNeural — Professional, warm
  • en-US-AriaNeural — Cheerful, expressive
  • en-US-DavisNeural — Casual, natural
  • en-GB-SoniaNeural — British English

Speech Translation

Real-time speech-to-speech translation across 70+ languages:

translation_config = speechsdk.translation.SpeechTranslationConfig(
    subscription="your-key", region="eastus"
)
translation_config.speech_recognition_language = "en-US"
translation_config.add_target_language("es")  # Spanish
translation_config.add_target_language("fr")  # French

recognizer = speechsdk.translation.TranslationRecognizer(
    translation_config=translation_config
)

result = recognizer.recognize_once_async().get()
print(f"English: {result.text}")
print(f"Spanish: {result.translations['es']}")
print(f"French: {result.translations['fr']}")

.NET Integration

using Microsoft.CognitiveServices.Speech;

var speechConfig = SpeechConfig.FromSubscription("your-key", "eastus");
speechConfig.SpeechRecognitionLanguage = "en-US";

using var recognizer = new SpeechRecognizer(speechConfig);
var result = await recognizer.RecognizeOnceAsync();

Console.WriteLine($"Recognized: {result.Text}");

Resources

Video: Search "Azure Speech Services tutorial" on Microsoft Azure YouTube for official demos.

Speech ServicesSpeech-to-TextText-to-SpeechAzure AIVoice

Share this tutorial

Chapters (5)

  1. 1

    Speech Services Overview

    Available capabilities and use cases

  2. 2

    Speech-to-Text

    Real-time and batch transcription with custom models

  3. 3

    Text-to-Speech

    Neural voices, SSML, and custom neural voice

  4. 4

    Speech Translation

    Real-time translation across languages

  5. 5

    Integration Patterns

    Build with SDKs for .NET, Python, and JavaScript

About the Author

KH

Microsoft Learn

Microsoft MVP | AI Engineer

Software & AI Engineer specializing in Microsoft Azure, .NET, and cutting-edge AI technologies.

Need help with your project?

Let's discuss how I can help bring your ideas to life.

Get In Touch