Features — AethonVoice

Quality

Voice Quality

AethonVoice generates speech that is indistinguishable from human voice recordings in most contexts.

1.30%

WER (English)

0.84%

CER (Chinese)

0.729

SIM-o Speaker Similarity

Quality is consistent across all 21 supported languages. There is no "tier 1 vs. tier 2" language split — Thai, Japanese, German, and Arabic all receive the same model attention and dedicated voice reference audio.

Voices

4 Built-In Voices

Each with a defined personality and dedicated reference recordings for all 21 supported languages.

Voice Personas — Aris, Nolan, Lyra, Senna

Voice	Gender	Personality	Best For
Aris	Male	Warm, steady, authoritative. A deep, grounded tone that conveys trust and expertise.	Narration, instruction, educational content, non-fiction audiobooks
Nolan	Male	Clear, friendly, upbeat. Approachable with natural energy — like a colleague explaining something over coffee.	Dialogue, conversation, customer-facing content, podcasts, chatbot voices
Lyra	Female	Gentle, expressive, emotionally nuanced. A voice that draws the listener in.	Storytelling, fiction audiobooks, emotional content, children's content, meditation
Senna	Female	Calm, articulate, professional. Polished and confident without being cold.	Corporate content, e-learning, presentations, professional narration, accessibility

Each voice maintains its personality and timbre across languages. Aris speaking Thai sounds like the same person as Aris speaking Japanese — the voice identity carries over while pronunciation and prosody adapt naturally.

Listen to all 4 voices

Voice Cloning Coming Soon

Zero-Shot Voice Cloning

The underlying OmniVoice model supports zero-shot voice cloning from a short audio sample (as little as 10–15 seconds). No fine-tuning, no training, no waiting.

1

Upload Reference

Upload a reference audio clip via the API.

2

Clone Instantly

AethonVoice reproduces the speaker's vocal characteristics.

3

All Languages

The cloned voice works across all 21 supported languages.

4

Voice Library

Cloned voices are stored in your account — use, manage, and delete at any time.

The 4 built-in voices (Aris, Nolan, Lyra, Senna) are available now and use this same zero-shot cloning technology internally, with dedicated per-language reference recordings for consistent quality.

Multilingual

Multilingual Mixing

Handle mixed-language text natively — multiple languages in a single sentence, spoken by one voice, with seamless transitions.

Example input

ฟังดีๆ นะ คำว่า <ja>コーヒー</ja> แปลว่า coffee

Automatic Detection

Detects boundaries automatically using Unicode character ranges — Thai script, Japanese kana, Korean hangul, CJK ideographs, Cyrillic, Arabic, Devanagari, Latin, and more. No tags needed.

Explicit Tags

For ambiguous pairs (e.g., Chinese and Japanese share CJK characters), wrap the target language in tags: <ja>テキスト</ja>. Tagged portions use the specified language.

LLM-Assisted Edge Cases

Ambiguous digits and orphan segments are resolved by a lightweight LLM call. Adds ~1–2 seconds but only triggers when genuine ambiguity exists. Most text splits instantly.

Expression

Paralinguistic Expression

Real speech includes pauses, laughter, hesitation, and sighs. AethonVoice supports inline tags that insert these natural sound effects.

Tag	Effect	Preview
[pause]	Natural silence between phrases
[laugh]	Natural laughter
[sigh]	Gentle sigh
[er]	Hesitation / filler sound

Example

So I was thinking [pause] maybe we should [laugh] just go for it

Each voice has its own set of pre-recorded paralinguistic clips. Clips are randomly selected and seamlessly blended into the surrounding speech. The result: audio that sounds like a person speaking naturally, not a machine reading text.

Languages

21 Languages, 24 Locale Variants

Each locale has dedicated voice clone reference audio for all 4 built-in voices (96 reference pairs total), verified pronunciation quality, and tested language detection.

Language	Code	Locale(s)	Script Detection
English	en	en-US, en-GB	Latin
Thai	th	th-TH	Thai script
Japanese	ja	ja-JP	Hiragana/Katakana + CJK
Korean	ko	ko-KR	Hangul + CJK
Chinese (Mandarin)	zh	zh-CN, zh-TW	CJK
Cantonese	yue	yue-HK	CJK
French	fr	fr-FR	Latin
German	de	de-DE	Latin
Spanish	es	es-ES	Latin
Italian	it	it-IT	Latin
Portuguese	pt	pt-BR, pt-PT	Latin
Russian	ru	ru-RU	Cyrillic
Vietnamese	vi	vi-VN	Latin (extended)
Turkish	tr	tr-TR	Latin
Indonesian	id	id-ID	Latin
Malay	ms	ms-MY	Latin
Hindi	hi	hi-IN	Devanagari
Arabic	ar	ar-SA	Arabic script
Bengali	bn	bn-IN	Bengali script
Persian (Farsi)	fa	fa-IR	Arabic (extended)
Urdu	ur	ur-PK	Arabic (extended)

The underlying OmniVoice model supports 646 languages, with 82 languages achieving CER of 5% or less. AethonVoice curates the 21 above where quality has been verified end-to-end with dedicated reference audio. Additional languages can be added by providing reference audio and running quality checks.

Long-Form

Long-Form Generation

Generate up to 1 hour of continuous audio from a single submission (approximately 60,000 characters per request).

AethonVoice automatically splits long text at sentence and paragraph boundaries, generates each segment with the same voice, and concatenates everything with crossfade to eliminate audible seams. The output is one continuous MP3 file.

Long-form vs. Batch API: Long-form produces one continuous audio file from one long text. The Batch API produces many separate audio files from many separate texts.

Use Cases

Audiobooks Convert entire chapters into natural, continuous speech

Podcasts Generate full episodes from scripts

Educational content Create complete lessons with consistent voice

Accessibility Convert articles and reports to audio

Scale

Batch API

Submit hundreds of TTS items in a single request. Each item is tracked independently — start downloading completed items before the full batch finishes.

Use Cases

Generate vocabulary pronunciations for a language course (100+ words per batch)
Produce audio for an entire flashcard deck
Create multiple narration variants with different voices

How It Works

1

Submit an array of items, each with its own text, language, and voice

2

AethonVoice processes items in parallel (configurable concurrency per GPU)

3

Poll batch status to see per-item progress

4

Download each item's MP3 as it completes

Each item in a batch can use a different voice and different languages.

AI Integration Coming Soon

MCP Server

An interface that lets any MCP-compatible AI assistant generate speech directly, without writing code.

What is MCP?

MCP (Model Context Protocol) is an open standard created by Anthropic that lets AI assistants use external tools. Think of it as USB for AI — any AI agent that supports MCP can plug into any MCP-compatible tool without custom integration code. Supported by Claude Code, Codex, Coworks, Cursor, Windsurf, and other major AI platforms.

What This Means for You

Instead of writing API calls, you tell your AI assistant what you want:

Natural language

"Read this paragraph aloud using the Lyra voice"
"Generate Thai pronunciation for สวัสดีครับ"
"Create an audio version of this blog post"

Who Is This For

Content Creators

Add audio to their work without touching code.

Educators

Generate pronunciation guides effortlessly.

Writers

Create audio versions of their text.

Anyone

Who wants TTS without learning an API.

Output

Output Format

24 kHz captures 100% of the speech signal (human speech tops out at ~8 kHz, well within the 12 kHz Nyquist limit). 96 kbps achieves perceptual transparency for speech — the compressed audio is indistinguishable from the uncompressed original in listening tests.

For context, OpenAI TTS also outputs at 24 kHz — this reflects the TTS research consensus that 24 kHz is optimal for speech synthesis.

Parameter	Value
Format	MP3
Bitrate	96 kbps
Sample rate	24 kHz
Channels	Mono
Download	Signed URL (7-day expiry)

Comparison

How We Compare

AethonVoice is the only service that combines multilingual mixing, paralinguistic expression, and an open-source model foundation.

Feature	AethonVoice	ElevenLabs	Google TTS	Amazon Polly	Azure TTS	OpenAI TTS	PlayHT	Cartesia
Voice quality	Excellent	Excellent	Good-Excellent	Good	Good	Good	Good	Good
Zero-shot cloning	Yes (instant)	Yes (instant + pro)	No	No	No	No	Yes	Yes
Languages	21 curated (646 model)	32+	40+	30+	100+	~57	142+	10+
Multilingual mixing	Yes (auto split + merge)	Limited	No	No	No	No	No	No
Paralinguistic tags	Yes (pause, laugh, sigh, er)	No	Limited (SSML)	Limited (SSML)	SSML-based	No	No	No
Streaming	Planned	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Batch API	Yes	Yes	Yes	Yes	Yes	No	Yes	Yes
Long-form (1hr+)	Yes	Yes	No (5000 char)	Yes	Yes	4096 char limit	Yes	Yes
Open-source model	Yes (OmniVoice)	No	No	No	No	No	No	No
Credit expiration	Never	Monthly reset	Pay-per-use	Pay-per-use	Pay-per-use	Pay-per-use	Monthly reset	Monthly reset

See pricing comparison

Roadmap

What's Coming Next

Streaming TTS

WebSocket endpoint for real-time audio streaming. Target: first audio byte in under 500ms. Enables voice agents, chatbots, and real-time translation.

Auto Emotion & Rhythm Detection

LLM pre-processing that automatically inserts paralinguistic tags. Submit plain text; AethonVoice adds the expressiveness. Configurable: minimal, natural, dramatic.

Additional Output Formats

WAV, OGG, FLAC support. Configurable bitrate (64–320 kbps) and sample rate (16–48 kHz).

Auto Language Detection

Remove the need to specify expected languages. AethonVoice will detect languages present in the text automatically.

SDKs

Python (pip install aethonvoice) and Node.js/TypeScript (npm install aethonvoice) client libraries.

Ready to Get Started?

Hear these features in action, see what it costs, or start building today.

Audio Demos Pricing How It Works