Getting Started

How It Works

Two ways to use AethonVoice. Pick the one that fits.

Two paths: REST API vs MCP Server
Path 1

REST API (For Developers)

Two ways to get results: poll for status, or let us call your webhook when it's ready. Two modes to submit: single job via POST /tts/submit, or batch via POST /tts/batch-submit for many items in one request — see Batch Processing below.

Option A: Polling

Submit a job, check back periodically, download when ready. Simple and reliable.

Polling flow: submit, poll status, download MP3
Polling Flow
1. POST /tts/submit         { jobId, status: "queued" }
2. GET  /tts/status/:jobId   { status: "processing" }
3. GET  /tts/status/:jobId   { status: "done", downloadUrl: "..." }
4. Download MP3 from downloadUrl (no auth needed)

Option B: Webhook Callback

Provide a callbackUrl and we'll POST the results to your server when the job completes. No polling needed.

Webhook callback flow: submit with callbackUrl, receive results via POST
Webhook Flow
1. POST /tts/submit    { jobId, status: "queued" }
   body: { text, voice, langs, callbackUrl: "https://your-server.com/webhook" }

2. (your app does other work — no polling needed)

3. POST your callbackUrl ← AethonVoice calls you
   {
     "jobId": "Xk9mP2qR7vNw",
     "status": "done",
     "downloadUrl": "https://storage.googleapis.com/...",
     "durationMs": 2100
   }

4. Download MP3 from downloadUrl (no auth needed)

Best for batch jobs and server-to-server integrations. You can still poll GET /tts/status/:jobId as a fallback.

If generation fails, you still receive a callback — the payload carries a top-level error string so you don't have to walk the items map:

Failure callback
{
  "batchId": "Bt9xK2pL...",
  "status": "error",
  "error": "All 3 items failed",
  "items": { /* per-item status + error */ }
}

Credits are only debited for items that successfully returned audio — failed items are never charged.

Quick Start

1 Get an API key

Sign up and create an API key from your dashboard. Keys are issued in the form av_live_<32-bytes-base64url> and shown only once at creation — we store a SHA-256 hash, never the raw key. Send it as Authorization: Bearer <key> on every request (the X-API-Key header is also accepted).

2 Submit your first job

curl
curl -X POST https://aethon.lab.ai/api/tts/submit \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "text": "Welcome to AethonVoice.",
    "voice": "aris",
    "langs": ["en"]
  }'

Response:

json
{
  "jobId": "Xk9mP2qR7vNw",
  "status": "queued"
}

3 Poll for status

curl
curl https://aethon.lab.ai/api/tts/status/Xk9mP2qR7vNw \
  -H "Authorization: Bearer YOUR_API_KEY"

Response (when complete):

json
{
  "jobId": "Xk9mP2qR7vNw",
  "status": "done",
  "downloadUrl": "https://storage.googleapis.com/...",
  "durationMs": 2100
}

4 Download

The downloadUrl is a signed URL. Download it directly — no authentication header needed.

Polling Tips

  • Wait 2-3 seconds after submit before first poll
  • Poll every 3-5 seconds
  • First request may take 30-60 seconds (cold start — model loading into GPU memory). Subsequent requests on a warm worker are much faster (3-15 seconds).
  • Timeout after 120 seconds

Endpoints

Endpoint Method Description
/tts/submit POST Submit single TTS job
/tts/batch-submit POST Submit batch of TTS items
/tts/status/:jobId GET Poll single job status
/tts/batch-status/:batchId GET Poll batch status with per-item progress

Job Status Values

Status Meaning
queued Job received, waiting for GPU worker
processing Audio is being generated
done Audio ready — downloadUrl included
error Generation failed — error message included
partial (Batch only) Some items succeeded, some failed

Credits & Credit Gate

1 credit = 1 second of generated audio, rounded up per job. Charges are always based on the actual duration returned — never on your input text length — and failed items are never debited.

Before we queue a job, we run a quick pre-flight estimate based on character count and language speaking rate. If the estimate exceeds your available balance by more than a 30-second tolerance, the request is rejected with 402 Payment Required before any work is done:

402 response
{
  "error": "Insufficient credits",
  "estimatedSec": 420,
  "balance": 85
}

A 30-second tolerance means borderline jobs are allowed through and may end the session with a small negative balance — the next top-up restores you automatically.

Error Codes

Code Meaning When
400 Bad Request Missing/invalid text, voice, or langs
401 Unauthorized Missing Authorization header, or key invalid/revoked
402 Payment Required Estimated duration exceeds balance + 30s tolerance
404 Not Found Unknown jobId or batchId
429 Rate Limited Too many requests in the current window

Batch Processing

Submit multiple items in one request. Each item can have different text, voice, and languages — useful for dubbing, dataset generation, or multi-speaker scripts.

curl
curl -X POST https://aethon.lab.ai/api/tts/batch-submit \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "items": [
      { "key": "word-01", "text": "สวัสดีครับ", "voice": "aris", "langs": ["th"] },
      { "key": "word-02", "text": "こんにちは", "voice": "lyra", "langs": ["ja"] },
      { "key": "word-03", "text": "Auf Wiedersehen", "voice": "nolan", "langs": ["de"] }
    ],
    "callbackUrl": "https://your-server.com/webhook"
  }'

Response:

json
{
  "batchId": "Bk8nQ3rS6uMx",
  "status": "queued",
  "itemCount": 3
}

Poll GET /tts/batch-status/:batchId to see per-item progress. Start downloading completed items before the full batch finishes. If you supplied callbackUrl, we POST the final payload to your server — no polling needed.

Path 2 Coming Soon

MCP Server (For Everyone)

No code required. AethonVoice will work as a tool inside any MCP-compatible AI assistant.

What You Need

1

An AI assistant that supports MCP (Claude Code, Codex, Coworks, Cursor, Windsurf, and others)

2

An AethonVoice API key

3

A one-time MCP Server connection setup

How It Works

Once connected, you talk to your AI assistant in plain language:

"Generate Thai pronunciation for สวัสดีครับ using the Aris voice"

The assistant calls AethonVoice, generates the audio, and returns a download link.

"Read this paragraph aloud with the Lyra voice"

Paste any text. The assistant sends it to AethonVoice and gives you the MP3.

"Create audio for this blog post in French"

Long-form content works too. The assistant handles the submission and polling automatically.

"Pronounce this Japanese word: ありがとうございます"

Quick pronunciation lookups — useful for language learning, content creation, or accessibility.

MCP Server flow
Reference

Input Format

Text

Plain text in any of the 21 supported languages (24 locale variants). Mixed-language text is handled automatically.

Language Tags

For explicit control over language boundaries, wrap portions in XML-style tags:

language tags
<ja>聞いてください</ja> This is English <th>สวัสดีครับ</th>

Tags are optional. Tagged portions use the specified language. Untagged portions are split automatically by character range detection. You can mix tagged and untagged text freely.

When to use tags: When languages share the same script (e.g., Chinese and Japanese both use CJK characters). For most other language pairs, automatic detection works without tags.

Paralinguistic Tags

Insert natural sound effects inline:

Tag Effect
[pause] Silence (0.7-1.1s, randomized)
[laugh] Laughter from voice's sound bank
[sigh] Sigh from voice's sound bank
[er] Hesitation / filler sound
example
I was thinking [pause] maybe we should [laugh] just go for it

Tags work with both API and MCP Server.

Voices

Voice ID Gender Character
Aris aris Male Warm, steady, authoritative
Nolan nolan Male Clear, friendly, upbeat
Lyra lyra Female Gentle, expressive, nuanced
Senna senna Female Calm, articulate, professional

Additional Request Fields

Field Required Description
callbackRef No Opaque reference string stored with the job. Useful for correlating results with your own records.

Language Codes

Specify languages with the langs array. The first element is the primary language. List all languages present in your text.

Code Language Code Language
en English id Indonesian
th Thai ms Malay
ja Japanese hi Hindi
ko Korean ar Arabic
zh Chinese (Mandarin) bn Bengali
yue Cantonese fa Persian
fr French ur Urdu
de German vi Vietnamese
es Spanish tr Turkish
it Italian ru Russian
pt Portuguese

Output Format

Parameter Value
Format MP3
Bitrate 96 kbps
Sample rate 24 kHz
Channels Mono
Download URL expiry 7 days

Why 24 kHz / 96 kbps?

These numbers look lower than music streaming, but they're optimal for speech. Human speech tops out at ~8 kHz — well within the 12 kHz Nyquist limit of a 24 kHz sample rate. 100% of the speech signal is captured.

At 96 kbps MP3, the compressed audio is perceptually transparent for speech — indistinguishable from the uncompressed original in listening tests. The result: smaller files, faster downloads, identical quality.

For context, OpenAI TTS also outputs at 24 kHz. This reflects the TTS research consensus that 24 kHz is the sweet spot for speech synthesis. Higher sample rates add file size with zero audible benefit for voice.

Download URLs are signed — no authentication needed to download. The cryptographic signature carries access. URLs cannot be guessed or enumerated.

Full API Documentation

This page covers the essentials. For complete API reference including all request/response schemas, error codes, rate limits, and advanced usage:

Read the Full API Docs

Get Started

Studio-grade TTS in minutes. Choose your path.

For Developers

Get an API key and make your first TTS call in minutes.

Get API Key Read the Docs →

For Everyone

Use AethonVoice through any MCP-compatible AI assistant. No code needed.

Set Up MCP Soon Claude Code, Cursor, Windsurf & more