Technology

The full stack, from CUDA to call.

Seven vertically integrated layers, all built on NVIDIA accelerated computing. Every layer is owned and engineered in-house. Nothing is rented from an LLM API.

Application — Talkif.ai

Enterprise voice agent platform. Visual flow builder, campaign engine, real-time monitoring. Production with paying customers.

Orchestration — Triton Inference Server

Auto-scaling GPU pools across H100 / Blackwell. In-flight batching, dynamic ensemble routing, gRPC streaming. Millions of concurrent calls.

Inference runtime — NVIDIA TensorRT-LLM

FP8 engines compiled per GPU SKU. Speculative decoding with a 1B draft model. Continuous batching with paged attention. Custom plugins for our interleaved audio + text token format.

Custom CUDA kernels

Streaming attention with rolling KV windows. Fused codec decode kernel. Audio frame de-interleaver. RVQ logits sampler. All hand-written for SM_90 (Hopper) and SM_100 (Blackwell).

Model — BL-Voice-1 (7B, decoder-only)

Speech-to-speech foundation model. ASR + dialog reasoning + expressive TTS in a single transformer. Trained from scratch by Bulut Labs.

Training — NVIDIA NeMo + Megatron-Core

Multi-node H100 / H200 clusters. Tensor + pipeline + data parallelism. FP8 mixed precision via Transformer Engine. SLURM job orchestration. NCCL over InfiniBand HDR.

Hardware — NVIDIA H100 / H200 / Blackwell + Jetson Orin

DGX-class clusters in the cloud for training and inference. Jetson Orin for on-premise enterprise edge deployments. One model family, one toolchain, every layer of the platform.

NVIDIA Software Stack in Production

NeMo

Megatron-Core

TensorRT-LLM

Triton Inference Server

CUDA 12.6

Transformer Engine (FP8)

cuBLAS / cuDNN

NCCL

Riva

Jetson Orin SDK

DCGM

NIM Microservices

← Back home