# Speech-to-Text

`aiTranscribe()` converts audio — from a local file path, a public URL, or raw binary data — into text. With basic usage you get a plain text string. With `returnFormat: "response"` you get a full `AiTranscriptionResponse` object containing timestamps, language detection, segments, word-level alignment, and more.

## 🔧 The `aiTranscribe()` Function

### Syntax

```javascript
aiTranscribe( audio, params={}, options={} )
```

### Parameters

| Parameter | Type             | Required | Description                                                         |
| --------- | ---------------- | -------- | ------------------------------------------------------------------- |
| `audio`   | string or binary | ✅ Yes    | File path, URL, or raw binary audio data                            |
| `params`  | struct           | No       | Provider API parameters (model, language, temperature, etc.)        |
| `options` | struct           | No       | Module-level options (provider, returnFormat, responseFormat, etc.) |

### Options

| Option                 | Type    | Default   | Description                                                                                   |
| ---------------------- | ------- | --------- | --------------------------------------------------------------------------------------------- |
| `provider`             | string  | (config)  | AI provider: `openai`, `groq`                                                                 |
| `apiKey`               | string  | (env var) | Provider API key                                                                              |
| `returnFormat`         | string  | `"text"`  | `"text"` — returns a plain string; `"response"` — returns `AiTranscriptionResponse`           |
| `language`             | string  | `""`      | Input audio language in BCP-47 format (e.g. `en`, `es`, `fr`). Optional but improves accuracy |
| `responseFormat`       | string  | `"json"`  | Provider format: `json`, `text`, `verbose_json`, `srt`, `vtt`                                 |
| `timestamps`           | array   | `[]`      | Timestamp granularities: `["segment"]`, `["word"]`, or `["segment", "word"]`                  |
| `diarize`              | boolean | `false`   | Enable speaker diarization (Groq only)                                                        |
| `timeout`              | numeric | `30`      | HTTP timeout in seconds                                                                       |
| `logRequest`           | boolean | `false`   | Log requests to the module log file                                                           |
| `logRequestToConsole`  | boolean | `false`   | Print request payload to console                                                              |
| `logResponse`          | boolean | `false`   | Log responses to the module log file                                                          |
| `logResponseToConsole` | boolean | `false`   | Print raw provider response to console                                                        |

### Audio Input Detection

`aiTranscribe()` automatically detects the audio input type:

| Input            | Detection Method                                                                               |
| ---------------- | ---------------------------------------------------------------------------------------------- |
| File path string | String ending with an audio extension (`.mp3`, `.wav`, `.m4a`, `.webm`, `.ogg`, `.flac`, etc.) |
| URL string       | String beginning with `http://` or `https://`                                                  |
| Binary data      | BoxLang binary / Java `byte[]` value                                                           |

## 📦 Return Value — `AiTranscriptionResponse`

By default (`returnFormat: "text"`), `aiTranscribe()` returns a plain **string** containing the transcribed text.

With `returnFormat: "response"`, it returns an **`AiTranscriptionResponse`** object.

### `AiTranscriptionResponse` Properties

| Property    | Type     | Description                                                 |
| ----------- | -------- | ----------------------------------------------------------- |
| `text`      | string   | Transcribed text                                            |
| `segments`  | array    | Array of segment structs with start/end timestamps and text |
| `words`     | array    | Array of word structs with start/end timestamps             |
| `language`  | string   | Detected or specified language code                         |
| `duration`  | numeric  | Total audio duration in seconds                             |
| `model`     | string   | Model used for transcription                                |
| `provider`  | string   | Provider name                                               |
| `metadata`  | struct   | Raw provider response metadata                              |
| `timestamp` | datetime | When the transcription was created                          |

### `AiTranscriptionResponse` Methods

| Method                   | Returns | Description                                              |
| ------------------------ | ------- | -------------------------------------------------------- |
| `getText()`              | string  | Returns the transcribed text                             |
| `hasText()`              | boolean | Returns `true` if transcribed text is non-empty          |
| `getWordCount()`         | numeric | Count of words in the transcription                      |
| `getFormattedDuration()` | string  | Human-readable duration, e.g. `"1:23"`                   |
| `hasSegments()`          | boolean | Returns `true` if segment data is available              |
| `hasWords()`             | boolean | Returns `true` if word-level timestamp data is available |
| `getSegments()`          | array   | Returns the array of segment structs                     |
| `getWords()`             | array   | Returns the array of word structs                        |
| `toStruct()`             | struct  | Returns a full struct representation                     |
| `toJSON()`               | string  | Returns JSON-serialized response                         |
| `toString()`             | string  | Alias for `getText()`                                    |

## 🎼 Output Formats

When using `responseFormat` in options you can request different provider-level output styles:

| Format         | Description                                                    |
| -------------- | -------------------------------------------------------------- |
| `json`         | Default JSON with `text` field (minimal)                       |
| `text`         | Plain text only — fastest, no metadata                         |
| `verbose_json` | Full JSON with segments, words, timestamps, language, duration |
| `srt`          | SubRip subtitle format for video captioning                    |
| `vtt`          | WebVTT subtitle format for HTML5 `<track>` elements            |

> **Tip:** When using `returnFormat: "response"`, always pair it with `responseFormat: "verbose_json"` so word/segment data is populated.

## 💡 Examples

### Basic — transcribe and get plain text

```javascript
text = aiTranscribe( "/recordings/meeting.mp3" )
println( text )
```

### Full response object

```javascript
result = aiTranscribe(
    "/recordings/meeting.mp3",
    {},
    { returnFormat: "response", responseFormat: "verbose_json" }
)

println( "Text: #result.getText()#" )
println( "Language: #result.language#" )
println( "Duration: #result.getFormattedDuration()#" )
println( "Words: #result.getWordCount()#" )
```

### Word-level timestamps

```javascript
result = aiTranscribe(
    "/recordings/interview.wav",
    {},
    {
        returnFormat: "response",
        responseFormat: "verbose_json",
        timestamps: [ "word" ]
    }
)

result.getWords().each( word => {
    println( "[#word.start#s–#word.end#s] #word.word#" )
})
```

### Specify language for better accuracy

```javascript
// Provide a BCP-47 language hint when you know the source language
text = aiTranscribe(
    "/recordings/spanish-lecture.mp3",
    {},
    { language: "es" }
)
```

### Groq — fast transcription

```javascript
// Groq's Whisper endpoint is significantly faster than OpenAI's
text = aiTranscribe(
    "/recordings/podcast.mp3",
    { model: "whisper-large-v3" },
    { provider: "groq" }
)
println( text )
```

### Transcribe from a URL

```javascript
text = aiTranscribe( "https://example.com/audio/announcement.mp3" )
println( text )
```

### Transcribe binary audio data

Useful when audio arrives in memory from an upload, a stream, or another API:

```javascript
// Read binary from an HTTP multipart upload or file
binaryAudio = fileReadBinary( "/tmp/upload.webm" )
text = aiTranscribe( binaryAudio )
println( text )
```

### Generate SRT captions for a video

```javascript
srt = aiTranscribe(
    "/video/presentation.mp4",
    {},
    { responseFormat: "srt" }
)
fileWrite( "/video/presentation.srt", srt )
```

## 🧱 Fluent Builder API (v3.2.0+)

Calling `aiTranscribe()` with **no arguments** returns an `AiTranscriptionRequest` builder object for method chaining. This provides a more readable, self-documenting way to configure transcription.

### Basic Builder Usage

```javascript
text = aiTranscribe()
    .file( "/recordings/meeting.mp3" )
    .transcribe()
```

### Builder Methods

| Method                     | Description                                            |
| -------------------------- | ------------------------------------------------------ |
| `of( audio )`              | Static factory — set audio input                       |
| `.file( path )`            | Set audio file path                                    |
| `.url( url )`              | Set audio URL                                          |
| `.data( binary )`          | Set raw binary audio data                              |
| `.model( name )`           | Set the STT model                                      |
| `.provider( name )`        | Set the provider                                       |
| `.apiKey( key )`           | Set the API key                                        |
| `.language( code )`        | Set input audio language (BCP-47)                      |
| `.inputFormat( fmt )`      | Set input audio format                                 |
| `.withWordTimestamps()`    | Enable word-level timestamps                           |
| `.withSegmentTimestamps()` | Enable segment-level timestamps                        |
| `.withTimestamps()`        | Enable all timestamps                                  |
| `.diarize( bool )`         | Enable speaker diarization (Groq only)                 |
| `.asJSON()`                | Output as JSON                                         |
| `.asText()`                | Output as plain text                                   |
| `.asVerboseJSON()`         | Output as verbose JSON with segments/words             |
| `.asSRT()`                 | Output as SubRip subtitles                             |
| `.asVTT()`                 | Output as WebVTT subtitles                             |
| `.withParams( struct )`    | Set provider params                                    |
| `.withOptions( struct )`   | Set module options                                     |
| `.withLogging()`           | Enable request/response logging                        |
| `.transcribe()`            | **Terminator** — execute transcription                 |
| `.translate()`             | **Terminator** — execute translation (audio → English) |

### Fluent Examples

```javascript
// From URL with language hint
text = aiTranscribe()
    .url( "https://example.com/audio/spanish-lecture.mp3" )
    .language( "es" )
    .transcribe()

// With word-level timestamps
result = aiTranscribe()
    .file( "/recordings/interview.wav" )
    .withWordTimestamps()
    .asVerboseJSON()
    .withOptions( { returnFormat: "response" } )
    .transcribe()

// Speaker diarization (Groq)
text = aiTranscribe()
    .file( "/recordings/panel-discussion.mp3" )
    .provider( "groq" )
    .diarize( true )
    .transcribe()

// Generate SRT captions
srt = aiTranscribe()
    .file( "/video/presentation.mp4" )
    .asSRT()
    .transcribe()

// Translate audio to English (dual terminator)
english = aiTranscribe()
    .file( "/recordings/french-meeting.mp3" )
    .translate()
```

> 💡 **Backward Compatible:** The traditional `aiTranscribe( audio, params, options )` syntax continues to work unchanged. The fluent builder is an **additional** option — no migration required.

## 📡 Events

| Event                   | Data Available                              |
| ----------------------- | ------------------------------------------- |
| `beforeAITranscription` | `transcriptionRequest`, `service`           |
| `afterAITranscription`  | `transcriptionRequest`, `service`, `result` |

```javascript
// Track all transcription requests for cost monitoring
BoxRegisterInterceptor( "afterAITranscription", event => {
    println( "Transcribed #event.result.getWordCount()# words via #event.service.getName()#" )
})
```

***

## 📖 Related Pages

* [Audio Overview](/main-components/audio.md)
* [Text-to-Speech](/main-components/audio/text-to-speech.md)
* [Audio Translation](/main-components/audio/audio-translation.md)
* [aiTranscribe BIF Reference](/advanced/reference/built-in-functions/aitranscribe.md)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://ai.ortusbooks.com/main-components/audio/speech-to-text.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.