Research
An end-to-end speech-to-speech foundation model, trained on NVIDIA H100.
Bulut Labs' core research program is the design, pre-training, and deployment of BL-Voice-1, a multilingual speech-to-speech foundation model that fuses ASR, dialog reasoning, and expressive TTS into a single decoder-only transformer. We do not chain third-party APIs. We train, optimize, and serve our own weights on NVIDIA accelerated computing.
01 — Architecture
BL-Voice-1 model architecture
┌─────────────────────────────────────────────────────────┐ │ BL-Voice-1 · 7B parameters │ │ decoder-only · interleaved audio + text tokens │ ├─────────────────────────────────────────────────────────┤ │ Audio In → Mimi-style 12.5 Hz neural codec → RVQ │ │ ↓ │ │ Interleaved [audio | text] stream │ │ ↓ │ │ 32-layer Transformer · 32 heads · d_model 4096 │ │ RoPE · GQA · FlashAttention-3 │ │ ↓ │ │ dual head: text logits + audio RVQ logits │ │ ↓ │ │ Streaming decoder · 80 ms frames │ │ ↓ │ │ Audio Out ← Codec decoder ← Speculative decoding │ └─────────────────────────────────────────────────────────┘
Key design decisions
- — Single decoder over cascaded STT → LLM → TTS, eliminating ~400 ms of pipeline latency
- — Neural audio codec at 12.5 Hz / 8 codebooks for high-fidelity streaming reconstruction
- — Joint pre-training on text + speech yields shared semantic space for cross-modal reasoning
- — Grouped-Query Attention (GQA) reduces KV cache size 8× for high-concurrency serving
Why NVIDIA-only
- — FlashAttention-3 + FP8 are first-class on Hopper / Blackwell, not on alternatives
- — TensorRT-LLM is the only inference runtime with mature streaming + speculative decoding for our token interleaving scheme
- — NeMo + Megatron-Core give us tensor + pipeline + sequence parallelism out of the box for 7B and beyond
- — Triton Inference Server is the only stack we trust for million-call concurrency
02 — Training Infrastructure
How we train BL-Voice-1
| Hardware | 64 × NVIDIA H100 SXM (8 nodes × 8 GPUs), NVLink + InfiniBand HDR fabric |
| Framework | NVIDIA NeMo + Megatron-Core (TP=4, PP=2, DP=8) |
| Precision | FP8 mixed precision (E4M3 forward, E5M2 backward) via Transformer Engine |
| Optimizer | Distributed AdamW with ZeRO-1, gradient checkpointing on every 4th block |
| Pre-train data | 850k hours multilingual speech + 1.2T text tokens (12 languages) |
| Tokens seen | 2.4T interleaved audio + text tokens (Chinchilla-optimal for 7B) |
| Throughput | 412 TFLOPs / GPU sustained, 38% MFU on H100 |
| Wall-clock | ~21 days for full pre-training run |
| Eval harness | LibriSpeech, FLEURS, MLS, internal voice-agent dialog benchmark |
03 — Inference Benchmarks
Latency on a single H100 (TensorRT-LLM, FP8, batch=32)
| Stage | Cascaded baseline | BL-Voice-1 | Δ |
|---|---|---|---|
| Speech-in → first token | 280 ms | 62 ms | −78% |
| Reasoning step | 190 ms | 41 ms | −78% |
| First audio frame out | 220 ms | 34 ms | −85% |
| End-to-end voice latency | 690 ms | 137 ms | −80% |
| Concurrent streams / H100 | 48 | 312 | 6.5× |
Methodology: warm batches, 16 kHz input, English voice-agent dialog harness, p50 reported. Cascaded baseline = Whisper-large-v3 (Faster-Whisper) → Llama-3-8B (vLLM) → XTTS-v2. BL-Voice-1 served with TensorRT-LLM 0.13, in-flight batching, speculative decoding (draft model = 1B), continuous KV cache.
04 — Open Artifacts