Testing prompts for AI agent
We have implemented a SaaS platform for connecting AI managers for Instagram, Messenger, and Telegram (so that AI communicates with clients instead of a human).
We use prompt_id (developer message) from OpenAI as a prompt for communication in chats.
Once the prompt is created, it is necessary to test dialogues for different scenarios and products before launching into production.
Currently, testing is done manually - changes are made to the prompt, the dialogue is tested 5-10 times, and this takes a lot of time because after each edit, it is necessary to test conversation scenarios for different types of products and under different situations.
It is necessary to think through the logic and tools that can be used for automated testing of dialogues (automated conversations) - if the client writes this way, the AI responds that way; if the prompt is changed, how will the AI's response change.
Also, the AI should suggest what to change in the prompt for better stability and predictability of responses.
Please write if you have already implemented automated testing of dialogues between AI and clients.
-
We can start with the first stage - to design the logic of automated tests for dialogues and create a working prototype for 10-20 scenarios. The budget of 1000 UAH for this task, in my opinion, won't even cover proper design, but we can keep it simple and start with a compact stage for 12000 UAH over 5 days =)
We have had similar tasks in AI and automation
> https://business.ingello.com/vorfahr - AI logic, automation of decisions, and quality control of responses
> https://business.ingello.com/fractal - agent processes, scenarios, system behavior stability
> https://systems-fl.ingello.com/ua - a brief overview of Ingello Systems for the exchange
I would build this as a test stand - a set of scenarios, benchmark expectations, evaluating responses not only by exact text but by content, tone, adherence to rules, absence of forbidden promises, and stability after changing prompt_id. Additionally, we could include an AI reviewer that compares old and new responses and suggests what to change in the prompt for better predictability.
… From you, we will need 3-5 real examples of dialogues, the current prompt without critical secrets, types of products, and rules that the manager must or must not violate. Access to production at the first stage is not necessary - a test key or examples of responses will be sufficient.
I would like to clarify 2 things
> tests should be run through the OpenAI API directly or through your SaaS platform
> is it more important for you to find bad responses after changing the prompt or to automatically generate new scenarios for testing?
-
Hello, I have been working on automating the testing of chatbots for an e-commerce platform with 15+ dialogue scenarios, which reduced testing time by 80% and improved response quality by 35%.
I am curious about what metrics you use to evaluate the quality of AI agent responses and how you plan to measure effectiveness after automating testing?
I suggest we get in touch; I will provide you with free technical consultation and we can create a development plan + I will tell you about my team!
-
177 Hello.
I can help with the design and implementation of an automated testing system for dialogues for AI agents (prompt-based testing).
The solution may include:
generation of test cases (customer scenarios: sales, support, objections, etc.)
running dialogues through different versions of prompt_id
comparing responses (regression testing for LLM)
… evaluating the stability/quality of responses (score/criteria)
logging changes between prompt versions
automatic analysis: where the prompt "failed" and what needs improvement
A module can also be added that:
analyzes dialogues and suggests prompt improvements (self-improvement loop)
I have experience with LLM, prompt engineering, and automating testing of dialogue systems.
I can propose an MVP architecture and an estimate after discussing the current implementation.
-
196 We already have a nearly ready similar solution for automatic testing of AI manager dialogues, which can be quickly adapted to your SaaS platform and launch the first result. We can discuss this here on the marketplace; I am available. ))
Regarding the budget - 1000 UAH seems too little for such a task; I would estimate the first working phase at 32000 UAH for 10 days.
Look, there’s a nuance - it’s important to test not just one response, but the stability of the scenario after each change of prompt_id.
We would do this as a set of regression tests for dialogues - client simulator, different types of products, negative situations, expected response boundaries, comparison of prompt_id versions, and a report on deviations.
Separately, we can add an AI reviewer that will look for weaknesses in the prompt and suggest changes for better stability and predictability of responses.
…
From you, we need test access to the API or a stand, 5-10 real dialogues, product examples, and criteria for undesirable responses.
I would like to clarify 2 points - do you already have an API for launching a dialogue with a specific prompt_id, and do we need to test only text responses, or also buttons, statuses, and handover to an operator?
Similar examples below:
- https://business.ingello.com/fractal - close to automating development and checking AI results
- https://business.ingello.com/vorfahr - SaaS with an AI component and product logic
- https://systems-fl.ingello.com/ua - our profile on FLH
In general, it’s fine to start with a small phase - initially 5-7 scenarios, then expand the test set for new products and situations.
-
457 Good day!
The project is very interesting and close to our field: AI managers, prompt engineering, testing dialogue scenarios, and stabilizing AI responses before launching into production.
We can help devise the logic for automated dialogue testing for your prompt_id / developer messages.
Here’s a possible structure for the solution:
— creating a set of test scenarios for different products and types of clients
— automatic dialogue launch after prompt changes
— comparison of responses before/after prompt changes
— evaluation of responses based on criteria: accuracy, stability, tone alignment, presence of required data, absence of undesirable formulations
— detection of "broken" scenarios after edits
… — generating a report on test results
— AI recommendations for improving prompts for more stable and predictable responses
Such a system can be built as a prompt QA / regression testing framework for AI dialogues: with a library of scenarios, expected results, response evaluations, and logs of changes for each version of the prompt.
We can discuss your current testing logic, product types, dialogue scenarios, and propose an MVP architecture for automated testing of AI managers.
-
457 It seems you already have a strong AI SaaS infrastructure, but the bottleneck right now is the regression testing of prompts after each change. This is a typical problem for AI support/sales systems, where even a small edit in the developer prompt can break the dialogue logic or change the tone/qualification flow.
I have worked with AI consultants for Instagram Direct and automated funnels through Chatfuel + OpenAI + Make.com, where it was important to ensure the stability of responses and predictable AI behavior in various scenarios. One approach here is to build a set of test cases (role-based conversations) + automatic dialogue runs through the OpenAI API with evaluation of responses based on predefined criteria: intent match, CTA consistency, objection handling, forbidden outputs, etc.
An AI review layer can also be implemented, where a separate LLM analyzes responses and suggests changes to the prompt structure for more stable model behavior after updates.
This looks like a good case for building an internal AI QA framework for your SaaS, and I am ready to help design the architecture and implementation of such testing.
-
690 5 1 Hello! The task is very familiar, manually testing prompts, it’s best to implement automated tests through LLM-as-a-Judge. I am ready to develop such a module for your SaaS. Write to me in private messages, we will discuss the details.
-
432 1 0 Hello!
I have implemented something similar: automated testing of prompts through Make.com — a set of simulated dialogues is launched after each change of prompt_id, and the results are compared with reference answers.
I can build a system: test cases based on scenarios → auto-launch of dialogues → AI analysis of deviations → specific recommendations on what to change in the prompt.
I am ready to discuss the architecture and start working.
-
472 1 0 Good day! The logic for your task:
Stack: Promptfoo (YAML-based, native A/B testing, side-by-side output diff before/after prompt editing) + DeepEval for quality metrics (faithfulness, relevance, conversation completeness, role adherence). Test suites - JSON with user_persona + context + expected behavior + edge cases. When prompt_id changes, all scenarios are automatically run, diffs are highlighted, regressions are immediately visible.
For self-suggestion improvements - a separate "critic" agent on Claude Sonnet 4.6, which reads failed test cases and returns structured suggestions in JSON ("add a rule about X in the system prompt — in 7/10 tests the model confused Y with Z"). Suggestions are tied to specific failed assertions, not general advice.
Optionally: integration with your prompt_id workflow OpenAI via API — versioning prompts and automatic rollback when metrics fall below the threshold.
A week ago, I took 3rd place solo at the AI Agent Olympics Hackathon Milan AI Week 2026 (731 teams, the largest AI event in Europe) - built an adversarial multi-agent system with an embedded eval layer. Full-time AI engineer for over 1 year. MSc in Strategic PM, PRINCE2.
…
Price: 18,000-25,000 UAH depending on the number of test cases and product types, 10-14 days with documentation.
Cases in the profile.
-
650 2 0 Good day! 👋
An interesting task — automated testing of dialogues is where you can really save dozens of hours a week.
We are implementing a system that runs scenarios through your prompt, compares responses before/after changes, and highlights degradation. A separate AI agent analyzes the results and suggests specific edits to the prompt for better stability.
We will discuss the details in private 🤝
-
253 Hello! We are a team of developers with 4 years of experience in creating standalone scripts, bots, and text information processing systems. The quality of the AI agent's work critically depends on the accuracy of prompt formulation and the predictability of its behavior under different conditions. We will take full responsibility for testing your system, checking the AI's response to non-standard or provocative user requests, and adjusting the logic for filtering output data. If necessary, we will automate the process of evaluating responses using Python. The result of our work will be fully optimized, production-ready prompts and a detailed report on the AI's behavior. Let's discuss the tasks and current architectural solutions in private messages!
-
256 Hello! Our team has 4 years of experience in process automation, developing smart bots, and working with data in Python. We are professionally engaged in integrating language models and prompt engineering, so testing and calibrating prompts for your AI agent is our specialty. We will approach the process systematically: we will develop test scenarios, conduct stress testing based on prepared datasets, minimize model hallucinations, and ensure clear adherence to system prompts. We will ensure high relevance, stability of responses, and optimize API token costs. We are ready to start testing the first hypotheses today. When would it be convenient for you to discuss the agent's logic in chat?
-
315 2 1 Hello, Oleksandr!
The task is very familiar and relevant. Manual testing of prompts across different dialogue branches for SaaS is indeed a bottleneck that consumes time.
I propose to implement an automated testing framework based on the "LLM-as-a-Judge" principle (AI-Judge) in Python.
How it will work technically:
Test cases: We create a JSON/CSV file with benchmark situations (for example: "The client aggressively asks for the price", "The client requests a discount").
…
Automation (Script): My Python script will automatically "feed" these replies to your AI manager via API and collect its responses.
AI-Judge (Evaluation and Recommendations): The collected responses will be sent in a separate API call (OpenAI) with a strict system prompt for the tester. This "AI-Judge" analyzes the manager's response for tone of voice compliance, absence of hallucinations, and generates a log:
Rating: 8/10. Error: the bot gave a discount without conditions. Recommendation: add a rule in the developer message "Never give a discount first".
Why me:
I have extensive experience working with neural network APIs (OpenAI, Groq). My current commercial project is a complex Telegram bot, the architecture of which is built on multi-level prompt engineering, where AI acts as an analyst and critic (analyzing texts, suggesting improvements).
I can write such a Python script for testing that you can run locally or on a server after each prompt change.
I am ready to discuss the implementation details!
-
919 4 0 Hello, Oleksandr, let's take turns.
Recently, I have been working a lot with AI and have already completed similar tasks. I suggest implementing this using pydanticAI. There is a separate, already implemented module for prompt evaluation, with the ability for automatic assessment and improvement.
There are also other modules for similar tasks, such as deepeval and DSPy. We can implement it through them.
The logic of construction is quite simple:
1. We create a certain test set (or we can also assign this to AI)
2. We conduct testing for each set
3. We check the validity of the result (we can add LLM-as-a-Judge)
4. We edit the prompt.
5. And so iteratively, until the verification cycle passes at the required level.
… The task is clear, there is experience. I would be happy to work with you!
-
266 Hello. The task is clear: we need to automate the testing of dialogues for AI managers after changes in the prompt/developer message, to quickly check the quality of responses in various scenarios before launching into production.
I can propose an MVP system for automated prompt testing:
— a set of test scenarios for different products and situations;
— automatic dialogue initiation via OpenAI API;
— comparison of responses before/after the prompt change;
— evaluation of responses based on criteria: accuracy, tone alignment, completeness, stability, absence of undesirable responses;
— saving results in a table or database;
… — a brief report for each test: what improved, what worsened, which responses need attention;
— the ability to receive recommendations on what to change in the prompt for better stability.
The implementation can be done as a separate script or a simple internal tool. For the MVP, I suggest starting with testing on 5–10 scenarios, then scaling it for different types of products and dialogues.
I am ready to discuss your current architecture, the format of prompt_id/developer message, examples of dialogues, and the desired report format.
-
4975 41 4 1 Good day!
I understand the challenge of manual testing of AI prompts for Instagram/Messenger/Telegram. I have experience in automating dialogues with the OpenAI API and developing scripts. I will develop the logic and tools for automatic response verification and prompt optimization.
Message me privately, and we will clarify the details.
-
2248 18 3 Hello. I have experience in automating dialogue testing through simulation (Synthetic Users) and evaluating metrics (LLM-as-a-Judge). To avoid building a system from scratch, it makes sense to integrate ready-made tools like Promptfoo or DeepEval for this logic.
I suggest discussing all technical requirements and scenarios in more detail. This will allow us to form an accurate estimate of costs and timelines for the full integration of the solution into your SaaS. I am ready for a dialogue.
-
726 9 1 Hello! I have carefully reviewed your project and am ready to start working. I guarantee quality and timely execution.
Current freelance projects in the category AI & Machine Learning
Automatic posting of stories on InstagramGood day, I need help with setting up automatic posting of stories on Instagram. There are already stories in the Instagram archive that have been published, and they need to be reposted. AI & Machine Learning, Bot Development ∙ 11 hours 48 minutes back ∙ 22 proposals |
Creation of an AI assistant for communication with ClientsIt is necessary to create an AI assistant for communication with Clients. The chat window will be located on our website, followed by communication with the bot. Questions about products, settings, capabilities, etc. In cases where the information is unknown or the request can… AI & Machine Learning, AI Consulting ∙ 1 day 7 hours back ∙ 33 proposals |
I am looking for a video editor who creates AI videos.Creation of AI videos for dentists and other experts Objective: To create short vertical videos for Instagram Reels, Facebook Reels, TikTok, and YouTube Shorts that explain complex topics in simple language and hold the viewer's attention through a combination of AI animation… AI & Machine Learning ∙ 1 day 14 hours back ∙ 2 proposals |
I am looking for a mentor/teacher for ComfyUI for online learning (working through RunPod)
16 USD
Hello. I am looking for a practicing specialist and mentor who can help me master working with ComfyUI. The main feature of my request is that the work will be done entirely in the cloud, without downloading the program to a local computer. I plan to rent a graphics card through… AI & Machine Learning ∙ 2 days 1 hour back ∙ 1 proposal |
AI agent of sports nutrition technologistThe agent helps develop formulations for new sports nutrition products — protein bars, proteins, pre-workouts, isotonic drinks, bars, etc. The main feature: the agent knows the legislation of different countries and automatically takes it into account when creating the… AI & Machine Learning, Web Programming ∙ 2 days 1 hour back ∙ 61 proposals |