Cataleya (Voice-to-Voice AI)

AI & Machine Learning, Data Parsing — incorrectly specified categories?

2200 USD

Project translated automatically. Log in or register, to view the original

Format: Project work / Remote (with access to local GPU clusters)
Technology stack: PersonaPlex (Moshi-based architecture), PyTorch, TensorRT-LLM, FastAPI, WebRTC, Telegram Mini App (TMA).
Equipment location: Uzbekistan and Kazakhstan (TAS-IX network), clusters based on NVIDIA RTX 4090.

Project Overview

Cataleya is an innovative multimodal "speech-to-speech" (S2S) ecosystem that mimics natural human communication. We are creating an AI assistant that easily switches between roles: expert tutor (chemistry, history, biology), empathetic conversationalist, and simultaneous interpreter. By directly processing audio tokens, the system achieves unprecedented interaction speed.

Current status: The base model (English) is stable. We are currently scaling the solution considering regional specifics and deploying it within a high-tech application.

Key Responsibilities

1. Core AI & ML (Adaptation and Intelligence)

Multilingualism: Fine-tuning the model to ensure native-level support for Uzbek (including regional dialects), Kazakh, and Russian languages.
Latency Optimization: Optimizing inference pipelines to achieve a target response latency of 0.07 seconds.
Smart RAG (100 GB): Vector knowledge base architecture for educational materials with the implementation of a "triple-check" mechanism to eliminate hallucinations.
NVIDIA Stack: Optimizing inference for the RTX 4090 environment using vLLM, TensorRT-LLM, and INT4/FP8 quantization.

2. Telegram Mini App and Real-time Web

Audio Streaming: Implementing real-time audio transmission with low latency via WebRTC / WebSockets (going beyond standard voice message protocols).
Full-Duplex UI: Developing an interface that supports interruptibility, allowing the AI to respond instantly if the user interrupts it.
Vocal ID: Integrating voice biometrics for secure user authentication.
Billing: Integrating local payment gateways (Payme, Click) for subscription management.

3. Architecture and Infrastructure

Highload Design: Designing a horizontally scalable system capable of handling high loads from competitive users.
Signal Processing: Implementing software echo cancellation (AEC) and noise suppression to ensure high-quality communication.
Traffic Localization: Optimizing routing protocols to maximize performance within the TAS-IX network.
Candidate Requirements

AI / ML Engineering:

Proven experience with End-to-end (E2E) speech models (Moshi, AudioLM or analogs).
Deep proficiency in PyTorch and Transformer architectures.
Hands-on experience in fine-tuning (Fine-tuning) LLM/S2S models for new language groups.
Expertise in CUDA 12.x and NVIDIA optimization libraries.

Fullstack Development:

Expert knowledge of WebRTC / WebSockets for real-time media streaming.
Experience in developing Telegram Mini Apps (TMA).
Professional proficiency in FastAPI and React / Next.js.
Deep understanding of the constraints and requirements of low-latency systems.

Proposals 4

Dmytro Zmenkov

1 1

Projects -
Rating -
Rating 121

Budget: 2200 USD Deadline: 11 days

Hello! I am ready to complete this project and I have extensive experience in developing various applications.

Tamara Ibrahim Sule A.

4 0

Budget: 2500 USD Deadline: 20 days

Hello!

Cataleya sounds exciting, and I also understand how challenging it is to achieve truly natural-sounding speech. I have worked with PyTorch-based models and real-time audio processing pipelines, and I can assist your team in meticulously refining latency, stability, and the entire process from microphone to GPU and to the speaker.

I would start small and practical. First, I would profile the current English processing path from start to finish and record where time is spent on capture, token processing, output, and streaming. Then I would address the largest latencies one by one, ensuring easily verifiable changes and safe deployment on your 4090 clusters. For Uzbek, Kazakh, and Russian languages, I would help create a simple test set that includes regional speech patterns so that fine-tuning is based on real examples rather than just general estimates.

Another simple but useful idea I can add is an internal representation of latency tracing for the team. This provides a brief analysis of each call to determine whether the slowdown is caused by WebRTC, the server, or the GPU. This significantly simplifies the current setup without complicating the task for users.

https://storyai.cc
https://oscarstories.com

Thank you!

Jeo Vincent C.

4 2

Budget: 2200 USD Deadline: 15 days

Hello,

I am interested in participating in the Cataleya project and clearly understand the technical and architectural complexity of the task. I have hands-on experience with end-to-end speech and multimodal models, low-latency inference pipelines, and large-scale deployment on NVIDIA GPU clusters. I work confidently with PyTorch, Transformer-based architectures, CUDA optimization, quantization, and inference acceleration (including TensorRT-LLM and vLLM), as well as multilingual fine-tuning for non-English language groups.

On the product and infrastructure side, I have experience building real-time audio systems using WebRTC and WebSockets, developing low-latency full-duplex interfaces, and integrating AI services into production environments via FastAPI. I also understand the specifics of Telegram Mini Apps, subscription logic, and payment integrations, and I approach system design with a strong focus on scalability, fault tolerance, and regional network optimization.

I work as a product-minded engineer, comfortable with research, adaptation, and production delivery, and I am confident I can contribute to both the core S2S intelligence and the real-time application layer of Cataleya.

Best regards,
Jeo Vincent Carretas

The list does not show proposals concealed by the client or freelancer with a Plus profile, as well as proposals violating rules

Tulkin Said
Ташкент, Uzbekistan

Projects -
Rating -
Rating 65