Cataleya (Voice-to-Voice AI)
Technology stack: PersonaPlex (Moshi-based architecture), PyTorch, TensorRT-LLM, FastAPI, WebRTC, Telegram Mini App (TMA).
Equipment location: Uzbekistan and Kazakhstan (TAS-IX network), clusters based on NVIDIA RTX 4090.
- Multilingualism: Fine-tuning the model to ensure native-level support for Uzbek (including regional dialects), Kazakh, and Russian languages.
- Latency Optimization: Optimizing inference pipelines to achieve a target response latency of 0.07 seconds.
- Smart RAG (100 GB): Vector knowledge base architecture for educational materials with the implementation of a "triple-check" mechanism to eliminate hallucinations.
- NVIDIA Stack: Optimizing inference for the RTX 4090 environment using vLLM, TensorRT-LLM, and INT4/FP8 quantization.
- Audio Streaming: Implementing real-time audio transmission with low latency via WebRTC / WebSockets (going beyond standard voice message protocols).
- Full-Duplex UI: Developing an interface that supports interruptibility, allowing the AI to respond instantly if the user interrupts it.
- Vocal ID: Integrating voice biometrics for secure user authentication.
- Billing: Integrating local payment gateways (Payme, Click) for subscription management.
- Highload Design: Designing a horizontally scalable system capable of handling high loads from competitive users.
- Signal Processing: Implementing software echo cancellation (AEC) and noise suppression to ensure high-quality communication.
- Traffic Localization: Optimizing routing protocols to maximize performance within the TAS-IX network.
- Candidate Requirements
- Proven experience with End-to-end (E2E) speech models (Moshi, AudioLM or analogs).
- Deep proficiency in PyTorch and Transformer architectures.
- Hands-on experience in fine-tuning (Fine-tuning) LLM/S2S models for new language groups.
- Expertise in CUDA 12.x and NVIDIA optimization libraries.
- Expert knowledge of WebRTC / WebSockets for real-time media streaming.
- Experience in developing Telegram Mini Apps (TMA).
- Professional proficiency in FastAPI and React / Next.js.
- Deep understanding of the constraints and requirements of low-latency systems.
-
11 days2200 USD
148 1 1 11 days2200 USDHello! I am ready to complete this project and I have extensive experience in developing various applications.
-
20 days2500 USD
1117 4 0 20 days2500 USDHello!
Cataleya sounds exciting, and I also understand how challenging it is to achieve truly natural-sounding speech. I have worked with PyTorch-based models and real-time audio processing pipelines, and I can assist your team in meticulously refining latency, stability, and the entire process from microphone to GPU and to the speaker.
I would start small and practical. First, I would profile the current English processing path from start to finish and record where time is spent on capture, token processing, output, and streaming. Then I would address the largest latencies one by one, ensuring easily verifiable changes and safe deployment on your 4090 clusters. For Uzbek, Kazakh, and Russian languages, I would help create a simple test set that includes regional speech patterns so that fine-tuning is based on real examples rather than just general estimates.
Another simple but useful idea I can add is an internal representation of latency tracing for the team. This provides a brief analysis of each call to determine whether the slowdown is caused by WebRTC, the server, or the GPU. This significantly simplifies the current setup without complicating the task for users.
https://storyai.cc
… https://oscarstories.com
Thank you!
-
15 days2200 USD
12784 4 2 15 days2200 USDHello,
I am interested in participating in the Cataleya project and clearly understand the technical and architectural complexity of the task. I have hands-on experience with end-to-end speech and multimodal models, low-latency inference pipelines, and large-scale deployment on NVIDIA GPU clusters. I work confidently with PyTorch, Transformer-based architectures, CUDA optimization, quantization, and inference acceleration (including TensorRT-LLM and vLLM), as well as multilingual fine-tuning for non-English language groups.
On the product and infrastructure side, I have experience building real-time audio systems using WebRTC and WebSockets, developing low-latency full-duplex interfaces, and integrating AI services into production environments via FastAPI. I also understand the specifics of Telegram Mini Apps, subscription logic, and payment integrations, and I approach system design with a strong focus on scalability, fault tolerance, and regional network optimization.
I work as a product-minded engineer, comfortable with research, adaptation, and production delivery, and I am confident I can contribute to both the core S2S intelligence and the real-time application layer of Cataleya.
Best regards,
… Jeo Vincent Carretas