Technology — AethonVoice

Foundation

OmniVoice: The Foundation

Published in April 2026 by the k2-fsa research group as a fully open-source project — model weights, training code, and inference pipeline are all publicly available.

Paper Model on Hugging Face GitHub Official Demos

Benchmarks

OmniVoice sets new benchmarks across every major TTS metric, outperforming commercial systems on multilingual tasks.

Metric	Score	What It Measures
1.30% WER	LibriSpeech-PC (English)	Word Error Rate — how accurately generated speech reproduces input text
0.84% CER	Seed-TTS (Chinese)	Character Error Rate — intelligibility on Chinese text
0.729 SIM-o	LibriSpeech-PC	Speaker Similarity — how closely the voice matches the reference speaker
RTF 0.032	16-step inference	Real-Time Factor — 1 second of audio in 32ms (30x real-time)
646 languages	Training coverage	Languages represented in training data
82 languages	CER ≤ 5%	Languages with verified high-quality output
~3.0 GB	Model size (fp16)	Compact enough for fast cold starts on GPU

For context: ElevenLabs and MiniMax are the closest commercial competitors on multilingual benchmarks. OmniVoice matches or exceeds both while being fully open-source.

Architecture

OmniVoice introduces a diffusion language model-style discrete non-autoregressive (NAR) architecture with several innovations:

Direct Text-to-Acoustic Mapping

Skips the intermediate semantic token stage. Maps text directly to multi-codebook acoustic tokens, eliminating quality loss from two-stage pipelines.

Full-Codebook Random Masking

Tokens masked across all 8 codebooks simultaneously, yielding 50% of tokens for loss computation — dramatically more efficient than per-layer masking.

LLM Initialization

Backbone initialized from Qwen3-0.6B pre-trained weights. The first NAR TTS model to leverage LLM pre-training for superior text understanding.

Higgs-Audio Tokenizer

Extracts 8-codebook acoustic tokens and reconstructs high-fidelity audio, providing the representation layer between text and sound.

Training Data

581,000

Hours of Audio

646

Languages

OmniVoice was trained on multilingual audio curated from open-source datasets. This is the broadest language coverage of any TTS model to date. Multilingual capability is not bolted on — it is foundational to the architecture.

Value Add

What AethonVoice Adds

OmniVoice is a powerful model, but a model is not a product. AethonVoice adds the production-grade features that turn raw TTS capability into a usable service.

Capability	OmniVoice Alone	AethonVoice
Generate speech from text	Yes (one language at a time)	Yes, with multilingual mixing in one utterance
Voice cloning	Yes (manual reference setup)	Yes, via dashboard — upload audio, get cloned voice
Mixed-language text	No — must split manually	Automatic language detection and splitting
Paralinguistic expression	No	Pause, laughter, sighs, hesitations via inline tags
Long-form audio	No — limited to short segments	Up to 1 hour continuous audio per submission
Batch processing	No	Submit hundreds of items, track per-item progress
API access	No — Python code only	REST API with async job pattern
MCP Server	No	AI assistants can generate speech directly
Post-processing	No — raw model output	VAD trimming, crossfade, silence normalization

Pipeline

The Audio Pipeline

When you submit text to AethonVoice, here is what happens.

1

Language Splitting

Mixed-language text is split into per-language segments. Explicit tags (<ja>...</ja>) are extracted first. Remaining text is split by Unicode character ranges. Ambiguous boundaries are resolved by a lightweight LLM call.

2

Paralinguistic Parsing

Inline tags like [pause], [laugh], [sigh], and [er] are extracted. Speech portions go to the TTS model. Tags are replaced with pre-recorded clips from the voice's sound bank.

3

TTS Generation

Each segment is generated using OmniVoice with a language-appropriate voice clone prompt. The same voice identity is maintained across language switches. Generation runs on GPU at 30x real-time speed.

4

Post-Processing

VAD trim removes silence. 40ms fade at boundaries prevents clicks. Crossfade blends clips using equal-power curves (50–120ms randomized). 100ms gaps between language segments for natural pacing.

5

Encode & Deliver

Final audio encoded to MP3 (96 kbps, 24 kHz, mono) and uploaded to cloud storage. A signed download URL is returned, valid for 7 days.

Open Source

Why Open Source Matters

AethonVoice's open-source foundation is not a marketing angle — it has direct, practical consequences.

No Vendor Lock-In

The model weights are public. If AethonVoice disappeared tomorrow, the underlying technology would still exist. Compare this to ElevenLabs or OpenAI, where the model is proprietary and inaccessible.

Verifiable Quality

Every benchmark number on this page links to a published paper and reproducible evaluation. You do not have to take our word for it.

No Per-Minute Licensing

Cloud TTS providers charge premium rates because their model is proprietary. AethonVoice pays only for GPU compute — the model is free. This is why we offer $0.015/min instead of $0.03–0.12.

Continuous Improvement

OmniVoice is actively developed by the open-source community. As the model improves, AethonVoice improves — without waiting for a vendor to ship updates.

Infrastructure

Layer	Technology	Purpose
API	Firebase Functions (Node.js)	HTTP endpoints, language splitting, job management
TTS Worker	Python + GPU (RunPod)	OmniVoice inference, post-processing, encoding
Storage	Google Cloud Storage	Generated audio files (signed URLs, 7-day TTL)
Job Tracking	Firestore	Job status, metadata

The API layer and TTS worker are decoupled. The API handles text processing and job orchestration. The GPU worker handles only TTS generation. This separation allows independent scaling — adding GPU capacity does not require changing the API layer.

See What You Can Build

Explore the full feature set or learn how to integrate AethonVoice into your workflow.

Explore Features How It Works

The Technology Behind AethonVoice