Azure AI Speech Services
Azure AI Speech Services provides speech-to-text, text-to-speech, speech translation, and speaker recognition capabilities. Build voice-enabled apps, transcribe meetings, create audiobooks, and more.
Available Capabilities
| Service | Description |
|---|---|
| Speech-to-Text | Real-time and batch transcription |
| Text-to-Speech | 400+ neural voices in 140+ languages |
| Speech Translation | Real-time speech-to-speech translation |
| Speaker Recognition | Verify and identify speakers |
| Pronunciation Assessment | Score pronunciation accuracy |
| Custom Voice | Create a branded neural voice |
Prerequisites
# Create a Speech resource
az cognitiveservices account create \
--name "my-speech" \
--resource-group "my-rg" \
--kind "SpeechServices" \
--sku "S0" \
--location "eastus"
Speech-to-Text
Real-Time Transcription (Python)
import azure.cognitiveservices.speech as speechsdk
speech_config = speechsdk.SpeechConfig(
subscription="your-key",
region="eastus"
)
speech_config.speech_recognition_language = "en-US"
# From microphone
audio_config = speechsdk.audio.AudioConfig(use_default_microphone=True)
recognizer = speechsdk.SpeechRecognizer(
speech_config=speech_config,
audio_config=audio_config
)
print("Speak now...")
result = recognizer.recognize_once_async().get()
if result.reason == speechsdk.ResultReason.RecognizedSpeech:
print(f"Recognized: {result.text}")
elif result.reason == speechsdk.ResultReason.NoMatch:
print("No speech could be recognized")
Continuous Recognition
def recognized_handler(evt):
print(f"RECOGNIZED: {evt.result.text}")
def session_stopped_handler(evt):
print("Session stopped")
stop_event.set()
recognizer.recognized.connect(recognized_handler)
recognizer.session_stopped.connect(session_stopped_handler)
recognizer.start_continuous_recognition()
# ... runs until stop_event is set
Batch Transcription
For pre-recorded audio files (meetings, calls, podcasts):
# REST API call to start batch transcription
import requests
url = "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2/transcriptions"
headers = {"Ocp-Apim-Subscription-Key": "your-key"}
body = {
"contentUrls": [
"https://storage.blob.core.windows.net/audio/meeting.wav"
],
"locale": "en-US",
"displayName": "Team Meeting Transcription",
"properties": {
"wordLevelTimestampsEnabled": True,
"diarizationEnabled": True, # Speaker identification
"speakers": {"minCount": 2, "maxCount": 6}
}
}
response = requests.post(url, headers=headers, json=body)
Text-to-Speech
Basic Synthesis (Python)
speech_config = speechsdk.SpeechConfig(
subscription="your-key", region="eastus"
)
# Choose a neural voice
speech_config.speech_synthesis_voice_name = "en-US-JennyNeural"
synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
result = synthesizer.speak_text_async("Hello! Welcome to Azure AI Speech Services.").get()
SSML for Fine Control
ssml = """
<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xml:lang='en-US'>
<voice name='en-US-JennyNeural'>
<prosody rate='medium' pitch='+5%'>
Welcome to the <emphasis level='strong'>Azure AI</emphasis> Speech tutorial.
</prosody>
<break time='500ms'/>
<prosody rate='slow'>
Let me walk you through the key features.
</prosody>
</voice>
</speak>
"""
result = synthesizer.speak_ssml_async(ssml).get()
Popular Neural Voices
- en-US-JennyNeural — Friendly, conversational
- en-US-GuyNeural — Professional, warm
- en-US-AriaNeural — Cheerful, expressive
- en-US-DavisNeural — Casual, natural
- en-GB-SoniaNeural — British English
Speech Translation
Real-time speech-to-speech translation across 70+ languages:
translation_config = speechsdk.translation.SpeechTranslationConfig(
subscription="your-key", region="eastus"
)
translation_config.speech_recognition_language = "en-US"
translation_config.add_target_language("es") # Spanish
translation_config.add_target_language("fr") # French
recognizer = speechsdk.translation.TranslationRecognizer(
translation_config=translation_config
)
result = recognizer.recognize_once_async().get()
print(f"English: {result.text}")
print(f"Spanish: {result.translations['es']}")
print(f"French: {result.translations['fr']}")
.NET Integration
using Microsoft.CognitiveServices.Speech;
var speechConfig = SpeechConfig.FromSubscription("your-key", "eastus");
speechConfig.SpeechRecognitionLanguage = "en-US";
using var recognizer = new SpeechRecognizer(speechConfig);
var result = await recognizer.RecognizeOnceAsync();
Console.WriteLine($"Recognized: {result.Text}");
Resources
Video: Search "Azure Speech Services tutorial" on Microsoft Azure YouTube for official demos.

