Switch to English?
Yes
Переключитись на українську?
Так
Переключиться на русскую?
Да
Przełączyć się na polską?
Tak
Post your project for free and start receiving proposals from freelancers within minutes after publication!

AI Voice Cloning Real-Time

Translated

  1. 5093
     30  0
    Work example:
    Mobile app with admin
    45 days607 USD

    The benchmark for the first working MVP is 320,000 UAH and about 45 days. I would include in this estimate a Windows client, selection of audio devices, sample upload, streaming processing via a server with GPU, faster - better modes, assembly into a single .exe, and measurement of actual latency. The goal of 400 ms is realistic only after testing the model, network, and audio drivers, so we can start with a short engineering prototype.

    An important point - we only work with voices where there are rights for use and consent from the owner. For such a product, I would add scenario limitations, logging, and clear labeling, because otherwise, the risk is not technical, but legal and reputational. Look, there’s a nuance - the devil is not in the interface, but in the latency and artifacts =)

    > On implementation
    >> Windows application for microphone, output, and virtual audio cable
    >> separate GPU service for real-time voice conversion
    >> level, latency, and network status indicators
    >> quality modes, test profiles, and packaging into .exe

    > Questions
    >> Is there already a GPU server or does it need to be selected and configured?
    >> Is an MVP needed on ready-made models or an industrial product level with tests on different microphones, networks, and voices?

    > Similar works by Ingello
    >> https://business.ingello.com/tts - close in voice technologies and speech processing
    >> https://business.ingello.com/fractal - close in complex AI architecture and automation
    >> https://systems-fl.ingello.com - Ingello Systems profile for such systems

    !!If the goal is public distribution, it’s better to start with a prototype and a technical audit of latency, rather than promising quality at random!!-

  2. 141  
    2 days67 USD

    I can do this for 3k with the help of vibe coding. I have already done something similar. The requirements are that you have a powerful graphics card or money for cloud AI.

  3. 196  
    45 days607 USD

    we already have a nearly ready architecture for such a voice AI product, it can be quickly adapted and launched for a Windows client, GPU server, and virtual audio cable
    we are in touch, we can discuss the details here on the platform

    the estimate for the first working stage is 260,000 UAH and about 45 days

    We can keep the start simple - I would go through a technical prototype with measurable latency, and then improve the voice quality
    the goal of 0.3-0.5 seconds is achievable only with careful stream processing, buffer tuning, model, and network

    - I will clarify 2 points
    -- do we need a recognizable voice of a specific person or is a change in timbre and speech manner sufficient
    -- is the GPU server already available or does it need to be selected and deployed along with the solution

    - what we will include in the first stage
    -- Windows application with microphone selection, output, and virtual cable
    -- uploading a wav sample and preparing a voice profile
    -- streaming audio to the GPU server
    -- real-time voice transformation
    -- start, stop, level indicator, latency, and connection status
    -- packaging into a single .exe for test distribution

    - similar cases Ingello
    -- https://business.ingello.com/tts - AI voice and speech solutions
    -- https://business.ingello.com/fractal - server architecture for complex AI processes
    -- https://business.ingello.com/vorfahr - strong example of a product with automation and integrations

    main landing for freelancehunt - https://systems-fl.ingello.com

    it seems that first of all, we should test the prototype on 1-2 target voices in real Discord or OBS
    here !!low latency is more important than a beautiful demo picture!! - the hardware will show the truth better than the presentation ))-

  4. 2116    20  0
    22 days584 USD

    I understood the technical specifications: Windows application, real-time voice conversion (microphone → target voice → virtual audio cable), target latency ≤400ms, server-side on GPU. Sample target voice — one file 1-5 minutes. .exe for distribution, UI with device selection, model training, level and latency indicators.

    Stack as I see it.

    Voice model. For real-time voice conversion with 400ms latency and quality without artifacts, the best option in 2026 is RVC (Retrieval-based Voice Conversion) or its evolution Seed-VC. RVC is trained on short samples, supports real-time inference on GPU 12GB+. An alternative is F5-TTS or OpenVoice v2 from MyShell for voice cloning (but they are more for batch generation, real-time with them is harder to keep within 400ms). RVC inference on RTX 3060/4060 gives a confident 200-300ms per chunk, which fits the budget.

    Architecture. A thin Windows client (Python + Qt or C# WPF) captures the microphone via WASAPI/PyAudio, breaks it into chunks of 100-150ms, sends to the GPU server via WebSocket with low-latency options (ping-pong keepalive, no buffering). The server performs inference and returns the processed audio chunk. The client writes to the virtual audio cable (VB-Audio Virtual Cable as the standard for Windows). Latency budget: 30ms capture + 50ms network round-trip (if on the same network) + 200ms GPU inference + 30ms playback = ~310ms. If the server is remote (cloud GPU) — network round-trip can increase to 80-150ms, plus dependency on connection stability.

    UI. Tkinter or PyQt5 for the Windows client (I have production experience with PyQt5 specifically for this class of tasks). Device selection — through pyaudio.list_devices() with Input/Output filter. Uploading sample voice, sending to the server, model training (training step synchronous or background). Start/Stop button. Indicators — microphone level (RMS), real-time latency (rolling average over the last 50 chunks), connection status.

    Server. FastAPI or WebSocket server on aiohttp with the model loaded in memory, GPU-bound worker queue. If you plan many simultaneous users — a load balancer and several GPU instances are needed, but for MVP one machine with RTX 3090 or 4090 can handle ~5-10 simultaneous users.

    Building into .exe — PyInstaller with bundled dependencies, or Nuitka for production-grade compilation. I have experience with PyInstaller on desktop projects, .exe builds reliably.

    Honestly: real-time voice conversion at this latency is a niche ML task, I haven't done such in production. I have strong backend, ASR/TTS experience (Whisper,

  5. 690    5  1
    14 days360 USD

    Hi, write to me in private messages. I think I can handle it, I've done something similar, but I need a more detailed technical specification. I will outline how many tokens it will take, etc.

  6. 9351    20  0   1
    6 days562 USD

    Hello. A year ago, I created a similar solution for Windows in .exe format for real-time voice conversion. I have working developments; now I need to update the packages, adapt them to your requirements, and test the connection between the Windows client and the GPU server. I believe I can quickly bring this to MVP.

  7. 3926    15  0
    7 days607 USD

    Good afternoon.
    I am currently working on TTS systems like Cartesian/Inword and local LLMs such as XTTS-v2 (Coqui).
    It's not as simple as it seems; TTS is one thing, and STT is another, and a unified solution doesn't always yield acceptable results. Sometimes the TTS is poor, or the STT latency is unsuitable, or the recognition quality doesn't meet your goals. To achieve your target of 400ms, some adjustments are necessary. Basically, I am currently focused on this and trying to achieve a latency of at least 1 second.
    I am a senior developer, working on an hourly rate of 30 euros/hour for this task.
    It's hard to say how long the core will take; it could be 10 hours or even 40 hours, plus a wrapper for Windows.
    If this suits you, my rate is acceptable for you - welcome. I always deliver quality work.
    If we communicate, I will provide a more accurate cost estimate for such a project.

  8. 258  
    50 days607 USD

    We have experience in developing AI/audio realtime solutions, including work with voice conversion, streaming audio, GPU inference, and low-latency sound processing.

    We understand the specifics of the realtime voice changing task:
    — capturing and processing audio streams;
    — voice cloning from a short sample;
    — minimizing latency;
    — integration with Discord / Zoom / OBS via virtual audio devices;
    — building a desktop application for Windows in .exe.

    We can implement:
    • desktop client;
    • server GPU component;
    • voice conversion pipeline;
    • training/fine-tuning of the voice model;
    • realtime streaming;
    • quality/latency settings;
    • UI/UX interface of the application.

    We have worked with the AI audio stack:
    RVC, XTTS, So-VITS-SVC, Whisper, PyTorch, WebRTC, CUDA, realtime audio pipelines.

    We pay special attention to:
    — stability of realtime operation;
    — voice quality without strong artifacts;
    — optimization for regular PCs;
    — architecture for further scaling.

    We are ready to discuss the stack, architecture, and showcase relevant experience.

    Sincerely, Benefit Studio

  9. 556    1  0
    30 days250 USD

    Hello! I am implementing real-time voice conversion with low latency and a client (Windows) + server with GPU inference.

    I have experience with AI integrations and real-time systems (WebRTC/streaming/low-latency processing), so I can design the architecture for this case.

    Architecture:

    * Windows desktop client (UI + audio stream)
    * Virtual audio driver / loopback (VB-Cable or similar)
    * Backend server with GPU (model inference)
    * Streaming via WebSocket / gRPC
    * Buffering for latency ≤ 300–400ms

    ML part:

    * voice conversion model (RVC / so-vits-svc / similar)
    * loading reference voice (1–5 minutes)
    * caching voice embeddings
    * optimization for real-time inference

    Client:

    * selection of input/output devices
    * loading voice sample
    * start/stop streaming button
    * latency/load/audio level indicator
    * integration with Discord / Zoom via virtual audio device

    Work stages:

    1. Architecture + pipeline prototype
    — checking pipeline latency, selecting model
    Deadline: 5 days
    Cost: 400 USD

    2. Backend GPU inference
    — real-time voice conversion API
    — latency optimization
    Deadline: 10 days
    Cost: 800 USD

    3. Windows client
    — UI + audio routing + stream management
    Deadline: 8 days
    Cost: 700 USD

    4. Integration + testing
    — stability, latency tuning, packaging into .exe
    Deadline: 5 days
    Cost: 400 USD

    Total duration: 4 weeks
    Budget: 2300 USD (MVP → stable version)

    Important: the key risk here is the latency and stability of the real-time model. Therefore, I will first create a pipeline prototype to confirm the achievable latency, and only then we will finalize the client.

  10. 368    1  0
    2 days56 USD

    Good day, I am ready to take on the project, I have experience in creating similar work.

  11. Another 5 proposals concealed
  • Nikita Rumyantsev
    26 May, 18:59 |

    Есть же аналоги уже , создание подобного очень дорого выйдет 

  • Nikita Rumyantsev
    28 May, 11:58 |

    Можем плюс-минус подсчитать, сколько выйдет затрат на токены и т.д.

  • Pavlo B.
    31 May, 7:21 |

    Нужно вручную.

  • Yevhen Melnik
    5 June, 16:52 |

    Есть кейсы, где спич, направление или продукт являются конфиденциальными,  и требуют своей сборки на своих серверах друг)


Current freelance projects in the category AI & Machine Learning

Generation and segmentation of the database of drivers and transportation companies in the USA

175 USD

Project Description We are an American company in the HR / transportation recruitment sector. We need a specialist who can use artificial intelligence and available data tools to collect, enrich, and segment a database for our team's further work. What Needs to Be Done A system…

AI & Machine Learning ∙ 19 hours 35 minutes back ∙ 8 proposals

Technical consultant for hardware optimization and memory stability audit

1200 USD

Hello everyone! I am looking for a hardware specialist or systems engineer who can help me understand the unstable performance of my local server. I built it for work tasks, but it seems my amateur knowledge is not enough to get the most out of it. I would rather pay for an hour…

AI & Machine Learning ∙ 1 day 4 hours back ∙ 7 proposals

Marketing automation through AI

I'm looking for a person (not a bot) who understands AI agents and knows how to build them. By AI agent, I mean: processing input data, making a request to a 1x LLM or similar AI model, potentially requesting MCP or similar, potentially requesting a RAG system, processing output…

AI & Machine Learning ∙ 1 day 5 hours back ∙ 24 proposals

Creation of AI Agent

An AI agent is required to perform the functions of a professional packaging designer for a sports nutrition brand. The agent should assist in developing new product designs, creating labels, and adapting existing layouts for various markets and requirements.Main tasks of the AI…

AI & Machine LearningAI Art ∙ 1 day 6 hours back ∙ 27 proposals

"Automation / Chatbots" "CRM Setup"

112 USD

Looking for a technical assistant/integrator to set up automation in a beauty salon. Setting up a chatbot for the beauty salon (Integration of Instagram + Altegio/YCLIENTS + Wahelp) with training. Current setup: CRM system: Altegio (YCLIENTS). Main traffic channel: Instagram…

AI & Machine LearningBot Development ∙ 2 days 2 hours back ∙ 33 proposals

Client
Odd Man
Ukraine Kyiv
Project published
16 days 3 hours back
131 views
Tags
  • windows 8
  • voice cloning
  • Audio Processing
  • Real-time Processing