Audio/Speech & Transcription
Convert text to speech, transcribe audio to text, and translate spoken audio to English — all through a unified BIF interface across multiple AI providers.
BoxLang AI provides three first-class audio operations through a unified BIF interface: Text-to-Speech (aiSpeak), Speech-to-Text (aiTranscribe), and Audio Translation (aiTranslate). Each BIF works across all supported providers with consistent parameters and return values, so you can switch providers without rewriting application code.
🏗️ Architecture
📊 Provider Support Matrix
Provider
TTS (aiSpeak)
STT (aiTranscribe)
Translation (aiTranslate)
Env Var
OpenAI
✅ tts-1
✅ whisper-1
✅
OPENAI_API_KEY
Mistral
✅ voxtral-mini-tts-2603
✅ voxtral-mini-latest
❌
MISTRAL_API_KEY
Groq / Whisper
❌
✅ whisper-large-v3
✅
GROQ_API_KEY
Grok / xAI
✅ custom
❌
❌
GROK_API_KEY
Gemini
✅ gemini-2.5-flash-preview-tts
✅ gemini-2.5-flash
❌
GEMINI_API_KEY
ElevenLabs
✅ eleven_multilingual_v2
✅ scribe_v1
❌
ELEVENLABS_API_KEY
⚡ Quick Start
🗣️ Text-to-Speech
🎙️ Speech-to-Text
🌐 Audio Translation
Note:
aiTranslatealways outputs English text. It is speech-to-English transcription, not general text-to-text translation.
🧱 Fluent Builder API (v3.2.0+)
All three audio BIFs now support a fluent builder API. Calling any of them with no arguments returns the request object for method chaining.
aiSpeak() Builder
aiSpeak() Builderof( text ) / .text( text )
Set the text to synthesize
.model( name )
Set the TTS model
.provider( name )
Set the provider
.apiKey( key )
Set the API key
.voice( name )
Set the voice
.male() / .female()
Gender shortcut (resolved per provider)
.speed( n )
Set playback speed
.instructions( text )
Set voice instructions
.outputFile( path )
Set output file path
.asMP3() / .asWav() / .asFlac() / .asOpus() / .asPCM()
Format shortcuts
.withParams( struct )
Set provider params
.withOptions( struct )
Set module options
.withLogging()
Enable request/response logging
.speak()
Execute and return AiSpeechResponse
aiTranscribe() Builder
aiTranscribe() Builderof( audio )
Static factory — set audio input
.file( path )
Set audio file path
.url( url )
Set audio URL
.data( binary )
Set raw binary audio data
.model( name )
Set the STT model
.provider( name )
Set the provider
.apiKey( key )
Set the API key
.language( code )
Set input audio language (BCP-47)
.inputFormat( fmt )
Set input audio format
.withWordTimestamps()
Enable word-level timestamps
.withSegmentTimestamps()
Enable segment-level timestamps
.withTimestamps()
Enable all timestamps
.diarize( bool )
Enable speaker diarization
.asJSON() / .asText() / .asVerboseJSON() / .asSRT() / .asVTT()
Output format shortcuts
.withParams( struct )
Set provider params
.withOptions( struct )
Set module options
.withLogging()
Enable request/response logging
.transcribe()
Execute transcription
.translate()
Execute translation (audio → English)
aiTranslate() Builder
aiTranslate() BuilderThe aiTranslate() builder shares the same methods as aiTranscribe(). The .translate() terminator executes the translation operation.
💡 Backward Compatible: The traditional
aiSpeak( text, params, options )syntax continues to work unchanged. The fluent builder is an additional option — no migration required.
⚙️ Module Configuration
Configure global audio defaults in boxlang.json to avoid repeating options on every call:
defaultVoice
Default voice name/ID for aiSpeak when no voice option is provided
defaultOutputFormat
Default audio output format: mp3, wav, flac, opus, pcm
defaultSpeechModel
Default TTS model (leave blank to use the provider's default)
defaultTranscriptionModel
Default model for aiTranscribe and aiTranslate calls
📡 Audio Events
The module fires interception points around every audio operation, giving you full observability and extensibility.
beforeAISpeech
Before sending a TTS request to the provider
afterAISpeech
After receiving TTS audio from the provider
beforeAITranscription
Before sending a transcription request to the provider
afterAITranscription
After receiving a transcription response from the provider
beforeAITranslation
Before sending an audio translation request to the provider
afterAITranslation
After receiving an audio translation response from the provider
Register interceptors in your application or script using BoxRegisterInterceptor():
📖 In This Section
Text-to-Speech — Convert text to audio with
aiSpeak()(traditional + fluent builder)Speech-to-Text — Transcribe audio files with
aiTranscribe()(traditional + fluent builder)Audio Translation — Translate spoken audio to English with
aiTranslate()(traditional + fluent builder)
Last updated