Freelance projects

Freelance projects

Testing prompts for AI agent

AI & Machine Learning, Bot Development — incorrectly specified categories?

22 USD

Project translated automatically. Log in or register, to view the original

We have implemented a SaaS platform for connecting AI managers for Instagram, Messenger, and Telegram (so that AI communicates with clients instead of a human).

We use prompt_id (developer message) from OpenAI as a prompt for communication in chats.

Once the prompt is created, it is necessary to test dialogues for different scenarios and products before launching into production.

Currently, testing is done manually - changes are made to the prompt, the dialogue is tested 5-10 times, and this takes a lot of time because after each edit, it is necessary to test conversation scenarios for different types of products and under different situations.

It is necessary to think through the logic and tools that can be used for automated testing of dialogues (automated conversations) - if the client writes this way, the AI responds that way; if the prompt is changed, how will the AI's response change.

Also, the AI should suggest what to change in the prompt for better stability and predictability of responses.

Please write if you have already implemented automated testing of dialogues between AI and clients.

Proposals 22 Withdrawn 1

Oleg Grigoryev

33 0

Budget: 12000 UAH Deadline: 5 days

We can start with the first stage - to design the logic of automated tests for dialogues and create a working prototype for 10-20 scenarios. The budget of 1000 UAH for this task, in my opinion, won't even cover proper design, but we can keep it simple and start with a compact stage for 12000 UAH over 5 days =)

We have had similar tasks in AI and automation
> https://business.ingello.com/vorfahr - AI logic, automation of decisions, and quality control of responses
> https://business.ingello.com/fractal - agent processes, scenarios, system behavior stability
> https://systems-fl.ingello.com/ua - a brief overview of Ingello Systems for the exchange

I would build this as a test stand - a set of scenarios, benchmark expectations, evaluating responses not only by exact text but by content, tone, adherence to rules, absence of forbidden promises, and stability after changing prompt_id. Additionally, we could include an AI reviewer that compares old and new responses and suggests what to change in the prompt for better predictability.

From you, we will need 3-5 real examples of dialogues, the current prompt without critical secrets, types of products, and rules that the manager must or must not violate. Access to production at the first stage is not necessary - a test key or examples of responses will be sufficient.

I would like to clarify 2 things
> tests should be run through the OpenAI API directly or through your SaaS platform
> is it more important for you to find bad responses after changing the prompt or to automatically generate new scenarios for testing?

Maksym O.

5 0

Budget: 1000 UAH Deadline: 7 days

Hello, I have been working on automating the testing of chatbots for an e-commerce platform with 15+ dialogue scenarios, which reduced testing time by 80% and improved response quality by 35%.

I am curious about what metrics you use to evaluate the quality of AI agent responses and how you plan to measure effectiveness after automating testing?

I suggest we get in touch; I will provide you with free technical consultation and we can create a development plan + I will tell you about my team!

Kristina Y.

0 0

Projects -
Rating -
Rating 153

Budget: 1500 UAH Deadline: 4 days

Hello.

I can help with the design and implementation of an automated testing system for dialogues for AI agents (prompt-based testing).

The solution may include:

generation of test cases (customer scenarios: sales, support, objections, etc.)
running dialogues through different versions of prompt_id
comparing responses (regression testing for LLM)
evaluating the stability/quality of responses (score/criteria)
logging changes between prompt versions
automatic analysis: where the prompt "failed" and what needs improvement

A module can also be added that:

analyzes dialogues and suggests prompt improvements (self-improvement loop)

I have experience with LLM, prompt engineering, and automating testing of dialogue systems.

I can propose an MVP architecture and an estimate after discussing the current implementation.

Daria Kratofil

0 0

Projects -
Rating -
Rating 196

Budget: 27000 UAH Deadline: 10 days

We already have a nearly ready similar solution for automatic testing of AI manager dialogues, which can be quickly adapted to your SaaS platform and launch the first result. We can discuss this here on the marketplace; I am available. ))

Regarding the budget - 1000 UAH seems too little for such a task; I would estimate the first working phase at 32000 UAH for 10 days.

Look, there’s a nuance - it’s important to test not just one response, but the stability of the scenario after each change of prompt_id.

We would do this as a set of regression tests for dialogues - client simulator, different types of products, negative situations, expected response boundaries, comparison of prompt_id versions, and a report on deviations.

Separately, we can add an AI reviewer that will look for weaknesses in the prompt and suggest changes for better stability and predictability of responses.

From you, we need test access to the API or a stand, 5-10 real dialogues, product examples, and criteria for undesirable responses.

I would like to clarify 2 points - do you already have an API for launching a dialogue with a specific prompt_id, and do we need to test only text responses, or also buttons, statuses, and handover to an operator?

Similar examples below:
- https://business.ingello.com/fractal - close to automating development and checking AI results
- https://business.ingello.com/vorfahr - SaaS with an AI component and product logic
- https://systems-fl.ingello.com/ua - our profile on FLH

In general, it’s fine to start with a small phase - initially 5-7 scenarios, then expand the test set for new products and situations.

Alina Voinytska

0 0

Projects -
Rating -
Rating 457

Budget: 5000 UAH Deadline: 3 days

Good day!
The project is very interesting and close to our field: AI managers, prompt engineering, testing dialogue scenarios, and stabilizing AI responses before launching into production.
We can help devise the logic for automated dialogue testing for your prompt_id / developer messages.
Here’s a possible structure for the solution:
— creating a set of test scenarios for different products and types of clients
— automatic dialogue launch after prompt changes
— comparison of responses before/after prompt changes
— evaluation of responses based on criteria: accuracy, stability, tone alignment, presence of required data, absence of undesirable formulations
— detection of "broken" scenarios after edits
— generating a report on test results
— AI recommendations for improving prompts for more stable and predictable responses
Such a system can be built as a prompt QA / regression testing framework for AI dialogues: with a library of scenarios, expected results, response evaluations, and logs of changes for each version of the prompt.
We can discuss your current testing logic, product types, dialogue scenarios, and propose an MVP architecture for automated testing of AI managers.

Valerii Holovatenko

0 0

Projects -
Rating -
Rating 457

Budget: 1100 UAH Deadline: 3 days

It seems you already have a strong AI SaaS infrastructure, but the bottleneck right now is the regression testing of prompts after each change. This is a typical problem for AI support/sales systems, where even a small edit in the developer prompt can break the dialogue logic or change the tone/qualification flow.

I have worked with AI consultants for Instagram Direct and automated funnels through Chatfuel + OpenAI + Make.com, where it was important to ensure the stability of responses and predictable AI behavior in various scenarios. One approach here is to build a set of test cases (role-based conversations) + automatic dialogue runs through the OpenAI API with evaluation of responses based on predefined criteria: intent match, CTA consistency, objection handling, forbidden outputs, etc.

An AI review layer can also be implemented, where a separate LLM analyzes responses and suggests changes to the prompt structure for more stable model behavior after updates.

This looks like a good case for building an internal AI QA framework for your SaaS, and I am ready to help design the architecture and implementation of such testing.

Nikita Rumyantsev

5 1

Budget: 9500 UAH Deadline: 5 days

Hello! The task is very familiar, manually testing prompts, it’s best to implement automated tests through LLM-as-a-Judge. I am ready to develop such a module for your SaaS. Write to me in private messages, we will discuss the details.

Maksym T.

1 0

Projects -
Rating -
Rating 435

Budget: 4500 UAH Deadline: 10 days

Hello!

I have implemented something similar: automated testing of prompts through Make.com — a set of simulated dialogues is launched after each change of prompt_id, and the results are compared with reference answers.

I can build a system: test cases based on scenarios → auto-launch of dialogues → AI analysis of deviations → specific recommendations on what to change in the prompt.

I am ready to discuss the architecture and start working.

Vitalii Karasov

1 0

Projects -
Rating -
Rating 477

Budget: 18000 UAH Deadline: 10 days

Good day! The logic for your task:

Stack: Promptfoo (YAML-based, native A/B testing, side-by-side output diff before/after prompt editing) + DeepEval for quality metrics (faithfulness, relevance, conversation completeness, role adherence). Test suites - JSON with user_persona + context + expected behavior + edge cases. When prompt_id changes, all scenarios are automatically run, diffs are highlighted, regressions are immediately visible.

For self-suggestion improvements - a separate "critic" agent on Claude Sonnet 4.6, which reads failed test cases and returns structured suggestions in JSON ("add a rule about X in the system prompt — in 7/10 tests the model confused Y with Z"). Suggestions are tied to specific failed assertions, not general advice.

Optionally: integration with your prompt_id workflow OpenAI via API — versioning prompts and automatic rollback when metrics fall below the threshold.

A week ago, I took 3rd place solo at the AI Agent Olympics Hackathon Milan AI Week 2026 (731 teams, the largest AI event in Europe) - built an adversarial multi-agent system with an embedded eval layer. Full-time AI engineer for over 1 year. MSc in Strategic PM, PRINCE2.

Price: 18,000-25,000 UAH depending on the number of test cases and product types, 10-14 days with documentation.

Cases in the profile.

Artur Boiko

5 0

Budget: 1000 UAH Deadline: 1 day

Good day! 👋

An interesting task — automated testing of dialogues is where you can really save dozens of hours a week.

We are implementing a system that runs scenarios through your prompt, compares responses before/after changes, and highlights degradation. A separate AI agent analyzes the results and suggests specific edits to the prompt for better stability.

We will discuss the details in private 🤝

Oleksandr Sliepyi

0 0

Projects -
Rating -
Rating 205

Budget: 1000 UAH Deadline: 1 day

Hello! We are a team of developers with 4 years of experience in creating standalone scripts, bots, and text information processing systems. The quality of the AI agent's work critically depends on the accuracy of prompt formulation and the predictability of its behavior under different conditions. We will take full responsibility for testing your system, checking the AI's response to non-standard or provocative user requests, and adjusting the logic for filtering output data. If necessary, we will automate the process of evaluating responses using Python. The result of our work will be fully optimized, production-ready prompts and a detailed report on the AI's behavior. Let's discuss the tasks and current architectural solutions in private messages!

Sergey Goncharuk

2 1

Projects -
Rating -
Rating 315

Budget: 1500 UAH Deadline: 3 days

Hello, Oleksandr!

The task is very familiar and relevant. Manual testing of prompts across different dialogue branches for SaaS is indeed a bottleneck that consumes time.

I propose to implement an automated testing framework based on the "LLM-as-a-Judge" principle (AI-Judge) in Python.

How it will work technically:

Test cases: We create a JSON/CSV file with benchmark situations (for example: "The client aggressively asks for the price", "The client requests a discount").

Automation (Script): My Python script will automatically "feed" these replies to your AI manager via API and collect its responses.

AI-Judge (Evaluation and Recommendations): The collected responses will be sent in a separate API call (OpenAI) with a strict system prompt for the tester. This "AI-Judge" analyzes the manager's response for tone of voice compliance, absence of hallucinations, and generates a log:
Rating: 8/10. Error: the bot gave a discount without conditions. Recommendation: add a rule in the developer message "Never give a discount first".

Why me:
I have extensive experience working with neural network APIs (OpenAI, Groq). My current commercial project is a complex Telegram bot, the architecture of which is built on multi-level prompt engineering, where AI acts as an analyst and critic (analyzing texts, suggesting improvements).

I can write such a Python script for testing that you can run locally or on a server after each prompt change.

I am ready to discuss the implementation details!

Illia Dunaiev

4 0

Budget: 1000 UAH Deadline: 2 days

Hello, Oleksandr, let's take turns.
Recently, I have been working a lot with AI and have already completed similar tasks. I suggest implementing this using pydanticAI. There is a separate, already implemented module for prompt evaluation, with the ability for automatic assessment and improvement.
There are also other modules for similar tasks, such as deepeval and DSPy. We can implement it through them.
The logic of construction is quite simple:
1. We create a certain test set (or we can also assign this to AI)
2. We conduct testing for each set
3. We check the validity of the result (we can add LLM-as-a-Judge)
4. We edit the prompt.
5. And so iteratively, until the verification cycle passes at the required level.
The task is clear, there is experience. I would be happy to work with you!

Leonid Kharenko

0 0

Projects -
Rating -
Rating 218

Budget: 24999 UAH Deadline: 10 days

Hello. The task is clear: we need to automate the testing of dialogues for AI managers after changes in the prompt/developer message, to quickly check the quality of responses in various scenarios before launching into production.

I can propose an MVP system for automated prompt testing:

— a set of test scenarios for different products and situations;
— automatic dialogue initiation via OpenAI API;
— comparison of responses before/after the prompt change;
— evaluation of responses based on criteria: accuracy, tone alignment, completeness, stability, absence of undesirable responses;
— saving results in a table or database;
— a brief report for each test: what improved, what worsened, which responses need attention;
— the ability to receive recommendations on what to change in the prompt for better stability.

The implementation can be done as a separate script or a simple internal tool. For the MVP, I suggest starting with testing on 5–10 scenarios, then scaling it for different types of products and dialogues.

I am ready to discuss your current architecture, the format of prompt_id/developer message, examples of dialogues, and the desired report format.

Nick Osipov

41 4

Budget: 1000 UAH Deadline: 3 days

Good day!

I understand the challenge of manual testing of AI prompts for Instagram/Messenger/Telegram. I have experience in automating dialogues with the OpenAI API and developing scripts. I will develop the logic and tools for automatic response verification and prompt optimization.

Message me privately, and we will clarify the details.

Viktor Piven

18 3

Budget: 1000 UAH Deadline: 1 day

Hello. I have experience in automating dialogue testing through simulation (Synthetic Users) and evaluating metrics (LLM-as-a-Judge). To avoid building a system from scratch, it makes sense to integrate ready-made tools like Promptfoo or DeepEval for this logic.

I suggest discussing all technical requirements and scenarios in more detail. This will allow us to form an accurate estimate of costs and timelines for the full integration of the solution into your SaaS. I am ready for a dialogue.

Volodymyr S.

9 1

Budget: 1000 UAH Deadline: 3 days

Hello! I have carefully reviewed your project and am ready to start working. I guarantee quality and timely execution.

The list does not show proposals concealed by the client or freelancer with a Plus profile, as well as proposals violating rules

Current freelance projects in the category AI & Machine Learning

I am looking for an AI bot developer (ChatGPT/OpenAI)

AI Consulting 71 proposals 1 August

Not specified
Integration of an AI agent in Manychat for processing incoming messages

AI Consulting 45 proposals 31 July

Not specified
Create an SEO system based on n8n

Bot Development 58 proposals 30 July

Not specified
Development of AI Creative Studio (AI agents)

Web Programming 58 proposals 30 July

199 USD
Highload system

46 proposals 30 July

2500 USD

Oleksandr Antipov
Kyiv, Ukraine

Projects -
Rating -
Rating 85

Testing prompts for AI agent

Oleg Grigoryev

Maksym O.

Kristina Y.

Daria Kratofil

Alina Voinytska

Valerii Holovatenko

Nikita Rumyantsev

Maksym T.

Vitalii Karasov

Artur Boiko

Oleksandr Sliepyi

Sergey Goncharuk

Illia Dunaiev

Leonid Kharenko

Nick Osipov

Viktor Piven

Volodymyr S.

Proposals are currently absent

Proposals concealed

Current freelance projects in the category AI & Machine Learning

I am looking for an AI bot developer (ChatGPT/OpenAI)

Integration of an AI agent in Manychat for processing incoming messages

Create an SEO system based on n8n

Development of AI Creative Studio (AI agents)

Highload system