# Speech To Text

> xAI's Grok Speech-to-Text — fast, accurate transcription across 25 languages with speaker diarization, word-level timestamps, multichannel audio, and inverse text normalization.


## Overview

- **Endpoint**: `https://fal.run/xai/speech-to-text/v1`
- **Model ID**: `xai/speech-to-text/v1`
- **Category**: text-to-speech
- **Kind**: inference
**Tags**: transcription, speech-to-text, multilingual, diarization, multichannel, word-timestamps, audio, grok, xai


## Pricing

$0.001667 per minute. Duration is rounded up to the nearest full minute (e.g., 10s and 50s both count as 1 minute; 1m 1s counts as 2 minutes).

For more details, see [fal.ai pricing](https://fal.ai/pricing).

## API Information

This model can be used via our HTTP API or more conveniently via our client libraries.
See the input and output schema below, as well as the usage examples.


### Input Schema

The API accepts the following input parameters:


- **`audio_url`** (`string`, _required_):
  URL of the audio file to transcribe. Supported formats: mp3, wav, ogg, opus, flac, aac, mp4, m4a, mkv.
  - Examples: "https://storage.googleapis.com/falserverless/model_tests/whisper/dinner_conversation.mp3"

- **`language`** (`LanguageEnum`, _optional_):
  BCP-47 language code for the audio, or 'auto' to let xAI detect the language automatically. Supported: ar, cs, da, de, en, es, fa, fil, fr, hi, id, it, ja, ko, mk, ms, nl, pl, pt, ro, ru, sv, th, tr, vi. Default value: `"auto"`
  - Default: `"auto"`
  - Options: `"auto"`, `"ar"`, `"cs"`, `"da"`, `"de"`, `"en"`, `"es"`, `"fa"`, `"fil"`, `"fr"`, `"hi"`, `"id"`, `"it"`, `"ja"`, `"ko"`, `"mk"`, `"ms"`, `"nl"`, `"pl"`, `"pt"`, `"ro"`, `"ru"`, `"sv"`, `"th"`, `"tr"`, `"vi"`

- **`diarize`** (`boolean`, _optional_):
  Whether to enable speaker diarization. When enabled, each word in the response includes a `speaker` integer indicating the detected speaker.
  - Default: `false`

- **`format`** (`boolean`, _optional_):
  Whether to apply Inverse Text Normalization (ITN) to the transcript — adds punctuation, capitalization, and formats numbers/digits. Set to false to get unformatted output. Note: xAI requires a specific `language` for ITN, so this flag is automatically disabled when `language='auto'`. Default value: `true`
  - Default: `true`

- **`multichannel`** (`boolean`, _optional_):
  Whether to transcribe each audio channel independently. When enabled, per-channel transcripts are returned in the `channels` array.
  - Default: `false`


**Required Parameters Example**:

```json
{
  "audio_url": "https://storage.googleapis.com/falserverless/model_tests/whisper/dinner_conversation.mp3"
}
```

**Full Example**:

```json
{
  "audio_url": "https://storage.googleapis.com/falserverless/model_tests/whisper/dinner_conversation.mp3",
  "language": "auto",
  "format": true
}
```


### Output Schema

The API returns the following output format:

- **`text`** (`string`, _required_):
  The full transcribed text.
  - Examples: "The future belongs to those who believe in the beauty of their dreams."

- **`language`** (`string`, _required_):
  BCP-47 language code detected (or echoed) by xAI.
  - Examples: "en"

- **`duration`** (`float`, _optional_):
  Duration of the audio in seconds as reported by xAI.
  - Examples: 12.34

- **`words`** (`list<TranscriptionWord>`, _optional_):
  Word-level transcription with timestamps.
  - Array of TranscriptionWord

- **`channels`** (`list<TranscriptionChannel>`, _optional_):
  Per-channel transcripts. Only populated when `multichannel=true` was requested.
  - Array of TranscriptionChannel


**Example Response**:

```json
{
  "text": "The future belongs to those who believe in the beauty of their dreams.",
  "language": "en",
  "duration": 12.34,
  "words": [
    {
      "text": ""
    }
  ],
  "channels": [
    {
      "text": "",
      "words": [
        {
          "text": ""
        }
      ]
    }
  ]
}
```


## Usage Examples

### cURL

```bash
curl --request POST \
  --url https://fal.run/xai/speech-to-text/v1 \
  --header "Authorization: Key $FAL_KEY" \
  --header "Content-Type: application/json" \
  --data '{
     "audio_url": "https://storage.googleapis.com/falserverless/model_tests/whisper/dinner_conversation.mp3"
   }'
```

### Python

Ensure you have the Python client installed:

```bash
pip install fal-client
```

Then use the API client to make requests:

```python
import fal_client

def on_queue_update(update):
    if isinstance(update, fal_client.InProgress):
        for log in update.logs:
           print(log["message"])

result = fal_client.subscribe(
    "xai/speech-to-text/v1",
    arguments={
        "audio_url": "https://storage.googleapis.com/falserverless/model_tests/whisper/dinner_conversation.mp3"
    },
    with_logs=True,
    on_queue_update=on_queue_update,
)
print(result)
```

### JavaScript

Ensure you have the JavaScript client installed:

```bash
npm install --save @fal-ai/client
```

Then use the API client to make requests:

```javascript
import { fal } from "@fal-ai/client";

const result = await fal.subscribe("xai/speech-to-text/v1", {
  input: {
    audio_url: "https://storage.googleapis.com/falserverless/model_tests/whisper/dinner_conversation.mp3"
  },
  logs: true,
  onQueueUpdate: (update) => {
    if (update.status === "IN_PROGRESS") {
      update.logs.map((log) => log.message).forEach(console.log);
    }
  },
});
console.log(result.data);
console.log(result.requestId);
```


## Additional Resources

### Documentation

- [Model Playground](https://fal.ai/models/xai/speech-to-text/v1)
- [API Documentation](https://fal.ai/models/xai/speech-to-text/v1/api)
- [OpenAPI Schema](https://fal.ai/api/openapi/queue/openapi.json?endpoint_id=xai/speech-to-text/v1)

### fal.ai Platform

- [Platform Documentation](https://docs.fal.ai)
- [Python Client](https://docs.fal.ai/clients/python)
- [JavaScript Client](https://docs.fal.ai/clients/javascript)