Features

Everything AethonVoice can do — from voice quality to batch processing to AI assistant integration.

Quality

Voice Quality

AethonVoice generates speech that is indistinguishable from human voice recordings in most contexts.

1.30%
WER (English)
0.84%
CER (Chinese)
0.729
SIM-o Speaker Similarity

Quality is consistent across all 21 supported languages. There is no "tier 1 vs. tier 2" language split — Thai, Japanese, German, and Arabic all receive the same model attention and dedicated voice reference audio.

Voices

4 Built-In Voices

Each with a defined personality and dedicated reference recordings for all 21 supported languages.

Voice Personas — Aris, Nolan, Lyra, Senna
Voice Gender Personality Best For
Aris Male Warm, steady, authoritative. A deep, grounded tone that conveys trust and expertise. Narration, instruction, educational content, non-fiction audiobooks
Nolan Male Clear, friendly, upbeat. Approachable with natural energy — like a colleague explaining something over coffee. Dialogue, conversation, customer-facing content, podcasts, chatbot voices
Lyra Female Gentle, expressive, emotionally nuanced. A voice that draws the listener in. Storytelling, fiction audiobooks, emotional content, children's content, meditation
Senna Female Calm, articulate, professional. Polished and confident without being cold. Corporate content, e-learning, presentations, professional narration, accessibility

Each voice maintains its personality and timbre across languages. Aris speaking Thai sounds like the same person as Aris speaking Japanese — the voice identity carries over while pronunciation and prosody adapt naturally.

Voice Cloning Coming Soon

Zero-Shot Voice Cloning

The underlying OmniVoice model supports zero-shot voice cloning from a short audio sample (as little as 10–15 seconds). No fine-tuning, no training, no waiting.

1

Upload Reference

Upload a reference audio clip via the API.

2

Clone Instantly

AethonVoice reproduces the speaker's vocal characteristics.

3

All Languages

The cloned voice works across all 21 supported languages.

4

Voice Library

Cloned voices are stored in your account — use, manage, and delete at any time.

The 4 built-in voices (Aris, Nolan, Lyra, Senna) are available now and use this same zero-shot cloning technology internally, with dedicated per-language reference recordings for consistent quality.

Multilingual

Multilingual Mixing

Handle mixed-language text natively — multiple languages in a single sentence, spoken by one voice, with seamless transitions.

Example input
ฟังดีๆ นะ คำว่า <ja>コーヒー</ja> แปลว่า coffee
Multilingual mixing flow

Automatic Detection

Detects boundaries automatically using Unicode character ranges — Thai script, Japanese kana, Korean hangul, CJK ideographs, Cyrillic, Arabic, Devanagari, Latin, and more. No tags needed.

Explicit Tags

For ambiguous pairs (e.g., Chinese and Japanese share CJK characters), wrap the target language in tags: <ja>テキスト</ja>. Tagged portions use the specified language.

LLM-Assisted Edge Cases

Ambiguous digits and orphan segments are resolved by a lightweight LLM call. Adds ~1–2 seconds but only triggers when genuine ambiguity exists. Most text splits instantly.

Expression

Paralinguistic Expression

Real speech includes pauses, laughter, hesitation, and sighs. AethonVoice supports inline tags that insert these natural sound effects.

Tag Effect Preview
[pause] Natural silence between phrases
[laugh] Natural laughter
[sigh] Gentle sigh
[er] Hesitation / filler sound
Example
So I was thinking [pause] maybe we should [laugh] just go for it
Paralinguistic waveform

Each voice has its own set of pre-recorded paralinguistic clips. Clips are randomly selected and seamlessly blended into the surrounding speech. The result: audio that sounds like a person speaking naturally, not a machine reading text.

Languages

21 Languages, 24 Locale Variants

Each locale has dedicated voice clone reference audio for all 4 built-in voices (96 reference pairs total), verified pronunciation quality, and tested language detection.

Language Code Locale(s) Script Detection
English en en-US, en-GB Latin
Thai th th-TH Thai script
Japanese ja ja-JP Hiragana/Katakana + CJK
Korean ko ko-KR Hangul + CJK
Chinese (Mandarin) zh zh-CN, zh-TW CJK
Cantonese yue yue-HK CJK
French fr fr-FR Latin
German de de-DE Latin
Spanish es es-ES Latin
Italian it it-IT Latin
Portuguese pt pt-BR, pt-PT Latin
Russian ru ru-RU Cyrillic
Vietnamese vi vi-VN Latin (extended)
Turkish tr tr-TR Latin
Indonesian id id-ID Latin
Malay ms ms-MY Latin
Hindi hi hi-IN Devanagari
Arabic ar ar-SA Arabic script
Bengali bn bn-IN Bengali script
Persian (Farsi) fa fa-IR Arabic (extended)
Urdu ur ur-PK Arabic (extended)

The underlying OmniVoice model supports 646 languages, with 82 languages achieving CER of 5% or less. AethonVoice curates the 21 above where quality has been verified end-to-end with dedicated reference audio. Additional languages can be added by providing reference audio and running quality checks.

Long-Form

Long-Form Generation

Generate up to 1 hour of continuous audio from a single submission (approximately 60,000 characters per request).

AethonVoice automatically splits long text at sentence and paragraph boundaries, generates each segment with the same voice, and concatenates everything with crossfade to eliminate audible seams. The output is one continuous MP3 file.

Long-form vs. Batch API: Long-form produces one continuous audio file from one long text. The Batch API produces many separate audio files from many separate texts.

Use Cases

Audiobooks Convert entire chapters into natural, continuous speech
Podcasts Generate full episodes from scripts
Educational content Create complete lessons with consistent voice
Accessibility Convert articles and reports to audio
Scale

Batch API

Submit hundreds of TTS items in a single request. Each item is tracked independently — start downloading completed items before the full batch finishes.

Use Cases

  • Generate vocabulary pronunciations for a language course (100+ words per batch)
  • Produce audio for an entire flashcard deck
  • Create multiple narration variants with different voices

How It Works

1

Submit an array of items, each with its own text, language, and voice

2

AethonVoice processes items in parallel (configurable concurrency per GPU)

3

Poll batch status to see per-item progress

4

Download each item's MP3 as it completes

Each item in a batch can use a different voice and different languages.

AI Integration Coming Soon

MCP Server

An interface that lets any MCP-compatible AI assistant generate speech directly, without writing code.

What is MCP?

MCP (Model Context Protocol) is an open standard created by Anthropic that lets AI assistants use external tools. Think of it as USB for AI — any AI agent that supports MCP can plug into any MCP-compatible tool without custom integration code. Supported by Claude Code, Codex, Coworks, Cursor, Windsurf, and other major AI platforms.

What This Means for You

Instead of writing API calls, you tell your AI assistant what you want:

Natural language
"Read this paragraph aloud using the Lyra voice"
"Generate Thai pronunciation for สวัสดีครับ"
"Create an audio version of this blog post"

Who Is This For

Content Creators

Add audio to their work without touching code.

Educators

Generate pronunciation guides effortlessly.

Writers

Create audio versions of their text.

Anyone

Who wants TTS without learning an API.

Output

Output Format

24 kHz captures 100% of the speech signal (human speech tops out at ~8 kHz, well within the 12 kHz Nyquist limit). 96 kbps achieves perceptual transparency for speech — the compressed audio is indistinguishable from the uncompressed original in listening tests.

For context, OpenAI TTS also outputs at 24 kHz — this reflects the TTS research consensus that 24 kHz is optimal for speech synthesis.

Parameter Value
Format MP3
Bitrate 96 kbps
Sample rate 24 kHz
Channels Mono
Download Signed URL (7-day expiry)
Comparison

How We Compare

AethonVoice is the only service that combines multilingual mixing, paralinguistic expression, and an open-source model foundation.

Feature AethonVoice ElevenLabs Google TTS Amazon Polly Azure TTS OpenAI TTS PlayHT Cartesia
Voice quality Excellent Excellent Good-Excellent Good Good Good Good Good
Zero-shot cloning Yes (instant) Yes (instant + pro) No No No No Yes Yes
Languages 21 curated (646 model) 32+ 40+ 30+ 100+ ~57 142+ 10+
Multilingual mixing Yes (auto split + merge) Limited No No No No No No
Paralinguistic tags Yes (pause, laugh, sigh, er) No Limited (SSML) Limited (SSML) SSML-based No No No
Streaming Planned Yes Yes Yes Yes Yes Yes Yes
Batch API Yes Yes Yes Yes Yes No Yes Yes
Long-form (1hr+) Yes Yes No (5000 char) Yes Yes 4096 char limit Yes Yes
Open-source model Yes (OmniVoice) No No No No No No No
Credit expiration Never Monthly reset Pay-per-use Pay-per-use Pay-per-use Pay-per-use Monthly reset Monthly reset
Roadmap

What's Coming Next

Streaming TTS

WebSocket endpoint for real-time audio streaming. Target: first audio byte in under 500ms. Enables voice agents, chatbots, and real-time translation.

Auto Emotion & Rhythm Detection

LLM pre-processing that automatically inserts paralinguistic tags. Submit plain text; AethonVoice adds the expressiveness. Configurable: minimal, natural, dramatic.

Additional Output Formats

WAV, OGG, FLAC support. Configurable bitrate (64–320 kbps) and sample rate (16–48 kHz).

Auto Language Detection

Remove the need to specify expected languages. AethonVoice will detect languages present in the text automatically.

SDKs

Python (pip install aethonvoice) and Node.js/TypeScript (npm install aethonvoice) client libraries.

Ready to Get Started?

Hear these features in action, see what it costs, or start building today.