Research

An end-to-end speech-to-speech foundation model, trained on NVIDIA H100.

Bulut Labs' core research program is the design, pre-training, and deployment of BL-Voice-1, a multilingual speech-to-speech foundation model that fuses ASR, dialog reasoning, and expressive TTS into a single decoder-only transformer. We do not chain third-party APIs. We train, optimize, and serve our own weights on NVIDIA accelerated computing.

01 — Architecture

BL-Voice-1 model architecture

  ┌─────────────────────────────────────────────────────────┐
  │              BL-Voice-1  ·  7B parameters               │
  │     decoder-only · interleaved audio + text tokens      │
  ├─────────────────────────────────────────────────────────┤
  │  Audio In  →  Mimi-style 12.5 Hz neural codec  →  RVQ   │
  │                          ↓                              │
  │             Interleaved [audio | text] stream           │
  │                          ↓                              │
  │     32-layer Transformer · 32 heads · d_model 4096      │
  │             RoPE · GQA · FlashAttention-3               │
  │                          ↓                              │
  │       dual head: text logits  +  audio RVQ logits       │
  │                          ↓                              │
  │            Streaming decoder · 80 ms frames             │
  │                          ↓                              │
  │  Audio Out  ←  Codec decoder  ←  Speculative decoding   │
  └─────────────────────────────────────────────────────────┘

Key design decisions

— Single decoder over cascaded STT → LLM → TTS, eliminating ~400 ms of pipeline latency
— Neural audio codec at 12.5 Hz / 8 codebooks for high-fidelity streaming reconstruction
— Joint pre-training on text + speech yields shared semantic space for cross-modal reasoning
— Grouped-Query Attention (GQA) reduces KV cache size 8× for high-concurrency serving

Why NVIDIA-only

— FlashAttention-3 + FP8 are first-class on Hopper / Blackwell, not on alternatives
— TensorRT-LLM is the only inference runtime with mature streaming + speculative decoding for our token interleaving scheme
— NeMo + Megatron-Core give us tensor + pipeline + sequence parallelism out of the box for 7B and beyond
— Triton Inference Server is the only stack we trust for million-call concurrency

02 — Training Infrastructure

How we train BL-Voice-1

Hardware	64 × NVIDIA H100 SXM (8 nodes × 8 GPUs), NVLink + InfiniBand HDR fabric
Framework	NVIDIA NeMo + Megatron-Core (TP=4, PP=2, DP=8)
Precision	FP8 mixed precision (E4M3 forward, E5M2 backward) via Transformer Engine
Optimizer	Distributed AdamW with ZeRO-1, gradient checkpointing on every 4th block
Pre-train data	850k hours multilingual speech + 1.2T text tokens (12 languages)
Tokens seen	2.4T interleaved audio + text tokens (Chinchilla-optimal for 7B)
Throughput	412 TFLOPs / GPU sustained, 38% MFU on H100
Wall-clock	~21 days for full pre-training run
Eval harness	LibriSpeech, FLEURS, MLS, internal voice-agent dialog benchmark

03 — Inference Benchmarks

Latency on a single H100 (TensorRT-LLM, FP8, batch=32)

Stage	Cascaded baseline	BL-Voice-1	Δ
Speech-in → first token	280 ms	62 ms	−78%
Reasoning step	190 ms	41 ms	−78%
First audio frame out	220 ms	34 ms	−85%
End-to-end voice latency	690 ms	137 ms	−80%
Concurrent streams / H100	48	312	6.5×

Methodology: warm batches, 16 kHz input, English voice-agent dialog harness, p50 reported. Cascaded baseline = Whisper-large-v3 (Faster-Whisper) → Llama-3-8B (vLLM) → XTTS-v2. BL-Voice-1 served with TensorRT-LLM 0.13, in-flight batching, speculative decoding (draft model = 1B), continuous KV cache.

04 — Open Artifacts

Code, model cards, and reports

GitHub

github.com/bulutlabs

Inference benchmarks, Triton client, voice-agent eval harness.

HuggingFace

huggingface.co/bulutlabs

BL-Voice-1 model card and tokenizer (gated, contact research@).

← Back home