waveform-linesSpeech-to-Text

Transcribe audio files, URLs, or binary data into text using AI providers. aiTranscribe() auto-detects the input type and optionally returns rich metadata including word-level timestamps, segments, an

aiTranscribe() converts audio — from a local file path, a public URL, or raw binary data — into text. With basic usage you get a plain text string. With returnFormat: "response" you get a full AiTranscriptionResponse object containing timestamps, language detection, segments, word-level alignment, and more.

🔧 The aiTranscribe() Function

Syntax

aiTranscribe( audio, params={}, options={} )

Parameters

Parameter
Type
Required
Description

audio

string or binary

✅ Yes

File path, URL, or raw binary audio data

params

struct

No

Provider API parameters (model, language, temperature, etc.)

options

struct

No

Module-level options (provider, returnFormat, responseFormat, etc.)

Options

Option
Type
Default
Description

provider

string

(config)

AI provider: openai, groq

apiKey

string

(env var)

Provider API key

returnFormat

string

"text"

"text" — returns a plain string; "response" — returns AiTranscriptionResponse

language

string

""

Input audio language in BCP-47 format (e.g. en, es, fr). Optional but improves accuracy

responseFormat

string

"json"

Provider format: json, text, verbose_json, srt, vtt

timestamps

array

[]

Timestamp granularities: ["segment"], ["word"], or ["segment", "word"]

diarize

boolean

false

Enable speaker diarization (Groq only)

timeout

numeric

30

HTTP timeout in seconds

logRequest

boolean

false

Log requests to the module log file

logRequestToConsole

boolean

false

Print request payload to console

logResponse

boolean

false

Log responses to the module log file

logResponseToConsole

boolean

false

Print raw provider response to console

Audio Input Detection

aiTranscribe() automatically detects the audio input type:

Input
Detection Method

File path string

String ending with an audio extension (.mp3, .wav, .m4a, .webm, .ogg, .flac, etc.)

URL string

String beginning with http:// or https://

Binary data

BoxLang binary / Java byte[] value

📦 Return Value — AiTranscriptionResponse

By default (returnFormat: "text"), aiTranscribe() returns a plain string containing the transcribed text.

With returnFormat: "response", it returns an AiTranscriptionResponse object.

AiTranscriptionResponse Properties

Property
Type
Description

text

string

Transcribed text

segments

array

Array of segment structs with start/end timestamps and text

words

array

Array of word structs with start/end timestamps

language

string

Detected or specified language code

duration

numeric

Total audio duration in seconds

model

string

Model used for transcription

provider

string

Provider name

metadata

struct

Raw provider response metadata

timestamp

datetime

When the transcription was created

AiTranscriptionResponse Methods

Method
Returns
Description

getText()

string

Returns the transcribed text

hasText()

boolean

Returns true if transcribed text is non-empty

getWordCount()

numeric

Count of words in the transcription

getFormattedDuration()

string

Human-readable duration, e.g. "1:23"

hasSegments()

boolean

Returns true if segment data is available

hasWords()

boolean

Returns true if word-level timestamp data is available

getSegments()

array

Returns the array of segment structs

getWords()

array

Returns the array of word structs

toStruct()

struct

Returns a full struct representation

toJSON()

string

Returns JSON-serialized response

toString()

string

Alias for getText()

🎼 Output Formats

When using responseFormat in options you can request different provider-level output styles:

Format
Description

json

Default JSON with text field (minimal)

text

Plain text only — fastest, no metadata

verbose_json

Full JSON with segments, words, timestamps, language, duration

srt

SubRip subtitle format for video captioning

vtt

WebVTT subtitle format for HTML5 <track> elements

Tip: When using returnFormat: "response", always pair it with responseFormat: "verbose_json" so word/segment data is populated.

💡 Examples

Basic — transcribe and get plain text

Full response object

Word-level timestamps

Specify language for better accuracy

Groq — fast transcription

Transcribe from a URL

Transcribe binary audio data

Useful when audio arrives in memory from an upload, a stream, or another API:

Generate SRT captions for a video

🧱 Fluent Builder API (v3.2.0+)

Calling aiTranscribe() with no arguments returns an AiTranscriptionRequest builder object for method chaining. This provides a more readable, self-documenting way to configure transcription.

Basic Builder Usage

Builder Methods

Method
Description

of( audio )

Static factory — set audio input

.file( path )

Set audio file path

.url( url )

Set audio URL

.data( binary )

Set raw binary audio data

.model( name )

Set the STT model

.provider( name )

Set the provider

.apiKey( key )

Set the API key

.language( code )

Set input audio language (BCP-47)

.inputFormat( fmt )

Set input audio format

.withWordTimestamps()

Enable word-level timestamps

.withSegmentTimestamps()

Enable segment-level timestamps

.withTimestamps()

Enable all timestamps

.diarize( bool )

Enable speaker diarization (Groq only)

.asJSON()

Output as JSON

.asText()

Output as plain text

.asVerboseJSON()

Output as verbose JSON with segments/words

.asSRT()

Output as SubRip subtitles

.asVTT()

Output as WebVTT subtitles

.withParams( struct )

Set provider params

.withOptions( struct )

Set module options

.withLogging()

Enable request/response logging

.transcribe()

Terminator — execute transcription

.translate()

Terminator — execute translation (audio → English)

Fluent Examples

💡 Backward Compatible: The traditional aiTranscribe( audio, params, options ) syntax continues to work unchanged. The fluent builder is an additional option — no migration required.

📡 Events

Event
Data Available

beforeAITranscription

transcriptionRequest, service

afterAITranscription

transcriptionRequest, service, result


Last updated