The Technology Behind
AethonVoice

Built on OmniVoice, the most capable open-source text-to-speech model released to date. This page explains what OmniVoice is, what it can do, and what AethonVoice adds on top.

Foundation

OmniVoice: The Foundation

Published in April 2026 by the k2-fsa research group as a fully open-source project — model weights, training code, and inference pipeline are all publicly available.

Benchmarks

OmniVoice sets new benchmarks across every major TTS metric, outperforming commercial systems on multilingual tasks.

Metric Score What It Measures
1.30% WER LibriSpeech-PC (English) Word Error Rate — how accurately generated speech reproduces input text
0.84% CER Seed-TTS (Chinese) Character Error Rate — intelligibility on Chinese text
0.729 SIM-o LibriSpeech-PC Speaker Similarity — how closely the voice matches the reference speaker
RTF 0.032 16-step inference Real-Time Factor — 1 second of audio in 32ms (30x real-time)
646 languages Training coverage Languages represented in training data
82 languages CER ≤ 5% Languages with verified high-quality output
~3.0 GB Model size (fp16) Compact enough for fast cold starts on GPU
Benchmark chart

For context: ElevenLabs and MiniMax are the closest commercial competitors on multilingual benchmarks. OmniVoice matches or exceeds both while being fully open-source.

Architecture

OmniVoice introduces a diffusion language model-style discrete non-autoregressive (NAR) architecture with several innovations:

Direct Text-to-Acoustic Mapping

Skips the intermediate semantic token stage. Maps text directly to multi-codebook acoustic tokens, eliminating quality loss from two-stage pipelines.

Full-Codebook Random Masking

Tokens masked across all 8 codebooks simultaneously, yielding 50% of tokens for loss computation — dramatically more efficient than per-layer masking.

LLM Initialization

Backbone initialized from Qwen3-0.6B pre-trained weights. The first NAR TTS model to leverage LLM pre-training for superior text understanding.

Higgs-Audio Tokenizer

Extracts 8-codebook acoustic tokens and reconstructs high-fidelity audio, providing the representation layer between text and sound.

Training Data

581,000
Hours of Audio
646
Languages

OmniVoice was trained on multilingual audio curated from open-source datasets. This is the broadest language coverage of any TTS model to date. Multilingual capability is not bolted on — it is foundational to the architecture.

Value Add

What AethonVoice Adds

OmniVoice is a powerful model, but a model is not a product. AethonVoice adds the production-grade features that turn raw TTS capability into a usable service.

Capability OmniVoice Alone AethonVoice
Generate speech from text Yes (one language at a time) Yes, with multilingual mixing in one utterance
Voice cloning Yes (manual reference setup) Yes, via dashboard — upload audio, get cloned voice
Mixed-language text No — must split manually Automatic language detection and splitting
Paralinguistic expression No Pause, laughter, sighs, hesitations via inline tags
Long-form audio No — limited to short segments Up to 1 hour continuous audio per submission
Batch processing No Submit hundreds of items, track per-item progress
API access No — Python code only REST API with async job pattern
MCP Server No AI assistants can generate speech directly
Post-processing No — raw model output VAD trimming, crossfade, silence normalization
Pipeline

The Audio Pipeline

When you submit text to AethonVoice, here is what happens.

Audio pipeline
1

Language Splitting

Mixed-language text is split into per-language segments. Explicit tags (<ja>...</ja>) are extracted first. Remaining text is split by Unicode character ranges. Ambiguous boundaries are resolved by a lightweight LLM call.

2

Paralinguistic Parsing

Inline tags like [pause], [laugh], [sigh], and [er] are extracted. Speech portions go to the TTS model. Tags are replaced with pre-recorded clips from the voice's sound bank.

3

TTS Generation

Each segment is generated using OmniVoice with a language-appropriate voice clone prompt. The same voice identity is maintained across language switches. Generation runs on GPU at 30x real-time speed.

4

Post-Processing

VAD trim removes silence. 40ms fade at boundaries prevents clicks. Crossfade blends clips using equal-power curves (50–120ms randomized). 100ms gaps between language segments for natural pacing.

5

Encode & Deliver

Final audio encoded to MP3 (96 kbps, 24 kHz, mono) and uploaded to cloud storage. A signed download URL is returned, valid for 7 days.

Open Source

Why Open Source Matters

AethonVoice's open-source foundation is not a marketing angle — it has direct, practical consequences.

No Vendor Lock-In

The model weights are public. If AethonVoice disappeared tomorrow, the underlying technology would still exist. Compare this to ElevenLabs or OpenAI, where the model is proprietary and inaccessible.

Verifiable Quality

Every benchmark number on this page links to a published paper and reproducible evaluation. You do not have to take our word for it.

No Per-Minute Licensing

Cloud TTS providers charge premium rates because their model is proprietary. AethonVoice pays only for GPU compute — the model is free. This is why we offer $0.015/min instead of $0.03–0.12.

Continuous Improvement

OmniVoice is actively developed by the open-source community. As the model improves, AethonVoice improves — without waiting for a vendor to ship updates.

Infrastructure

Infrastructure

Architecture diagram
Layer Technology Purpose
API Firebase Functions (Node.js) HTTP endpoints, language splitting, job management
TTS Worker Python + GPU (RunPod) OmniVoice inference, post-processing, encoding
Storage Google Cloud Storage Generated audio files (signed URLs, 7-day TTL)
Job Tracking Firestore Job status, metadata

The API layer and TTS worker are decoupled. The API handles text processing and job orchestration. The GPU worker handles only TTS generation. This separation allows independent scaling — adding GPU capacity does not require changing the API layer.

See What You Can Build

Explore the full feature set or learn how to integrate AethonVoice into your workflow.