Research

An end-to-end speech-to-speech foundation model, trained on NVIDIA H100.

Bulut Labs' core research program is the design, pre-training, and deployment of BL-Voice-1, a multilingual speech-to-speech foundation model that fuses ASR, dialog reasoning, and expressive TTS into a single decoder-only transformer. We do not chain third-party APIs. We train, optimize, and serve our own weights on NVIDIA accelerated computing.

01 — Architecture

BL-Voice-1 model architecture

  ┌─────────────────────────────────────────────────────────┐
  │              BL-Voice-1  ·  7B parameters               │
  │     decoder-only · interleaved audio + text tokens      │
  ├─────────────────────────────────────────────────────────┤
  │  Audio In  →  Mimi-style 12.5 Hz neural codec  →  RVQ   │
  │                          ↓                              │
  │             Interleaved [audio | text] stream           │
  │                          ↓                              │
  │     32-layer Transformer · 32 heads · d_model 4096      │
  │             RoPE · GQA · FlashAttention-3               │
  │                          ↓                              │
  │       dual head: text logits  +  audio RVQ logits       │
  │                          ↓                              │
  │            Streaming decoder · 80 ms frames             │
  │                          ↓                              │
  │  Audio Out  ←  Codec decoder  ←  Speculative decoding   │
  └─────────────────────────────────────────────────────────┘

Key design decisions

  • — Single decoder over cascaded STT → LLM → TTS, eliminating ~400 ms of pipeline latency
  • — Neural audio codec at 12.5 Hz / 8 codebooks for high-fidelity streaming reconstruction
  • — Joint pre-training on text + speech yields shared semantic space for cross-modal reasoning
  • — Grouped-Query Attention (GQA) reduces KV cache size 8× for high-concurrency serving

Why NVIDIA-only

  • — FlashAttention-3 + FP8 are first-class on Hopper / Blackwell, not on alternatives
  • — TensorRT-LLM is the only inference runtime with mature streaming + speculative decoding for our token interleaving scheme
  • — NeMo + Megatron-Core give us tensor + pipeline + sequence parallelism out of the box for 7B and beyond
  • — Triton Inference Server is the only stack we trust for million-call concurrency

02 — Training Infrastructure

How we train BL-Voice-1

Hardware64 × NVIDIA H100 SXM (8 nodes × 8 GPUs), NVLink + InfiniBand HDR fabric
FrameworkNVIDIA NeMo + Megatron-Core (TP=4, PP=2, DP=8)
PrecisionFP8 mixed precision (E4M3 forward, E5M2 backward) via Transformer Engine
OptimizerDistributed AdamW with ZeRO-1, gradient checkpointing on every 4th block
Pre-train data850k hours multilingual speech + 1.2T text tokens (12 languages)
Tokens seen2.4T interleaved audio + text tokens (Chinchilla-optimal for 7B)
Throughput412 TFLOPs / GPU sustained, 38% MFU on H100
Wall-clock~21 days for full pre-training run
Eval harnessLibriSpeech, FLEURS, MLS, internal voice-agent dialog benchmark

03 — Inference Benchmarks

Latency on a single H100 (TensorRT-LLM, FP8, batch=32)

StageCascaded baselineBL-Voice-1Δ
Speech-in → first token280 ms62 ms−78%
Reasoning step190 ms41 ms−78%
First audio frame out220 ms34 ms−85%
End-to-end voice latency690 ms137 ms−80%
Concurrent streams / H100483126.5×

Methodology: warm batches, 16 kHz input, English voice-agent dialog harness, p50 reported. Cascaded baseline = Whisper-large-v3 (Faster-Whisper) → Llama-3-8B (vLLM) → XTTS-v2. BL-Voice-1 served with TensorRT-LLM 0.13, in-flight batching, speculative decoding (draft model = 1B), continuous KV cache.

← Back home