Switch to English?
Yes
Переключитись на українську?
Так
Переключиться на русскую?
Да
Przełączyć się na polską?
Tak
Post your project for free and start receiving proposals from freelancers within minutes after publication!

Testing prompts for AI agent

Translated22 USD

  1. 5093
     30  0
    Work example:
    Mobile app with admin
    5 days267 USD

    We can start with the first stage - to design the logic of automated tests for dialogues and create a working prototype for 10-20 scenarios. The budget of 1000 UAH for this task, in my opinion, won't even cover proper design, but we can keep it simple and start with a compact stage for 12000 UAH over 5 days =)

    We have had similar tasks in AI and automation
    > https://business.ingello.com/vorfahr - AI logic, automation of decisions, and quality control of responses
    > https://business.ingello.com/fractal - agent processes, scenarios, system behavior stability
    > https://systems-fl.ingello.com/ua - a brief overview of Ingello Systems for the exchange

    I would build this as a test stand - a set of scenarios, benchmark expectations, evaluating responses not only by exact text but by content, tone, adherence to rules, absence of forbidden promises, and stability after changing prompt_id. Additionally, we could include an AI reviewer that compares old and new responses and suggests what to change in the prompt for better predictability.

    From you, we will need 3-5 real examples of dialogues, the current prompt without critical secrets, types of products, and rules that the manager must or must not violate. Access to production at the first stage is not necessary - a test key or examples of responses will be sufficient.

    I would like to clarify 2 things
    > tests should be run through the OpenAI API directly or through your SaaS platform
    > is it more important for you to find bad responses after changing the prompt or to automatically generate new scenarios for testing?

  2. 673
     5  0

    7 days22 USD

    Hello, I have been working on automating the testing of chatbots for an e-commerce platform with 15+ dialogue scenarios, which reduced testing time by 80% and improved response quality by 35%.

    I am curious about what metrics you use to evaluate the quality of AI agent responses and how you plan to measure effectiveness after automating testing?

    I suggest we get in touch; I will provide you with free technical consultation and we can create a development plan + I will tell you about my team!

  3. 177  
    4 days33 USD

    Hello.

    I can help with the design and implementation of an automated testing system for dialogues for AI agents (prompt-based testing).

    The solution may include:

    generation of test cases (customer scenarios: sales, support, objections, etc.)
    running dialogues through different versions of prompt_id
    comparing responses (regression testing for LLM)
    evaluating the stability/quality of responses (score/criteria)
    logging changes between prompt versions
    automatic analysis: where the prompt "failed" and what needs improvement

    A module can also be added that:

    analyzes dialogues and suggests prompt improvements (self-improvement loop)

    I have experience with LLM, prompt engineering, and automating testing of dialogue systems.

    I can propose an MVP architecture and an estimate after discussing the current implementation.

  4. 196  
    10 days601 USD

    We already have a nearly ready similar solution for automatic testing of AI manager dialogues, which can be quickly adapted to your SaaS platform and launch the first result. We can discuss this here on the marketplace; I am available. ))

    Regarding the budget - 1000 UAH seems too little for such a task; I would estimate the first working phase at 32000 UAH for 10 days.

    Look, there’s a nuance - it’s important to test not just one response, but the stability of the scenario after each change of prompt_id.

    We would do this as a set of regression tests for dialogues - client simulator, different types of products, negative situations, expected response boundaries, comparison of prompt_id versions, and a report on deviations.

    Separately, we can add an AI reviewer that will look for weaknesses in the prompt and suggest changes for better stability and predictability of responses.

    From you, we need test access to the API or a stand, 5-10 real dialogues, product examples, and criteria for undesirable responses.

    I would like to clarify 2 points - do you already have an API for launching a dialogue with a specific prompt_id, and do we need to test only text responses, or also buttons, statuses, and handover to an operator?

    Similar examples below:
    - https://business.ingello.com/fractal - close to automating development and checking AI results
    - https://business.ingello.com/vorfahr - SaaS with an AI component and product logic
    - https://systems-fl.ingello.com/ua - our profile on FLH

    In general, it’s fine to start with a small phase - initially 5-7 scenarios, then expand the test set for new products and situations.

  5. 457  
    3 days111 USD

    Good day!
    The project is very interesting and close to our field: AI managers, prompt engineering, testing dialogue scenarios, and stabilizing AI responses before launching into production.
    We can help devise the logic for automated dialogue testing for your prompt_id / developer messages.
    Here’s a possible structure for the solution:
    — creating a set of test scenarios for different products and types of clients
    — automatic dialogue launch after prompt changes
    — comparison of responses before/after prompt changes
    — evaluation of responses based on criteria: accuracy, stability, tone alignment, presence of required data, absence of undesirable formulations
    — detection of "broken" scenarios after edits
    — generating a report on test results
    — AI recommendations for improving prompts for more stable and predictable responses
    Such a system can be built as a prompt QA / regression testing framework for AI dialogues: with a library of scenarios, expected results, response evaluations, and logs of changes for each version of the prompt.
    We can discuss your current testing logic, product types, dialogue scenarios, and propose an MVP architecture for automated testing of AI managers.

  6. 457  
    3 days24 USD

    It seems you already have a strong AI SaaS infrastructure, but the bottleneck right now is the regression testing of prompts after each change. This is a typical problem for AI support/sales systems, where even a small edit in the developer prompt can break the dialogue logic or change the tone/qualification flow.

    I have worked with AI consultants for Instagram Direct and automated funnels through Chatfuel + OpenAI + Make.com, where it was important to ensure the stability of responses and predictable AI behavior in various scenarios. One approach here is to build a set of test cases (role-based conversations) + automatic dialogue runs through the OpenAI API with evaluation of responses based on predefined criteria: intent match, CTA consistency, objection handling, forbidden outputs, etc.

    An AI review layer can also be implemented, where a separate LLM analyzes responses and suggests changes to the prompt structure for more stable model behavior after updates.

    This looks like a good case for building an internal AI QA framework for your SaaS, and I am ready to help design the architecture and implementation of such testing.

  7. 690    5  1
    5 days212 USD

    Hello! The task is very familiar, manually testing prompts, it’s best to implement automated tests through LLM-as-a-Judge. I am ready to develop such a module for your SaaS. Write to me in private messages, we will discuss the details.

  8. 432    1  0
    10 days100 USD

    Hello!

    I have implemented something similar: automated testing of prompts through Make.com — a set of simulated dialogues is launched after each change of prompt_id, and the results are compared with reference answers.

    I can build a system: test cases based on scenarios → auto-launch of dialogues → AI analysis of deviations → specific recommendations on what to change in the prompt.

    I am ready to discuss the architecture and start working.

  9. 472    1  0
    10 days401 USD

    Good day! The logic for your task:

    Stack: Promptfoo (YAML-based, native A/B testing, side-by-side output diff before/after prompt editing) + DeepEval for quality metrics (faithfulness, relevance, conversation completeness, role adherence). Test suites - JSON with user_persona + context + expected behavior + edge cases. When prompt_id changes, all scenarios are automatically run, diffs are highlighted, regressions are immediately visible.

    For self-suggestion improvements - a separate "critic" agent on Claude Sonnet 4.6, which reads failed test cases and returns structured suggestions in JSON ("add a rule about X in the system prompt — in 7/10 tests the model confused Y with Z"). Suggestions are tied to specific failed assertions, not general advice.

    Optionally: integration with your prompt_id workflow OpenAI via API — versioning prompts and automatic rollback when metrics fall below the threshold.

    A week ago, I took 3rd place solo at the AI Agent Olympics Hackathon Milan AI Week 2026 (731 teams, the largest AI event in Europe) - built an adversarial multi-agent system with an embedded eval layer. Full-time AI engineer for over 1 year. MSc in Strategic PM, PRINCE2.

    Price: 18,000-25,000 UAH depending on the number of test cases and product types, 10-14 days with documentation.

    Cases in the profile.

  10. 650    2  0
    1 day22 USD

    Good day! 👋

    An interesting task — automated testing of dialogues is where you can really save dozens of hours a week.

    We are implementing a system that runs scenarios through your prompt, compares responses before/after changes, and highlights degradation. A separate AI agent analyzes the results and suggests specific edits to the prompt for better stability.

    We will discuss the details in private 🤝

  11. 253  
    1 day22 USD

    Hello! We are a team of developers with 4 years of experience in creating standalone scripts, bots, and text information processing systems. The quality of the AI agent's work critically depends on the accuracy of prompt formulation and the predictability of its behavior under different conditions. We will take full responsibility for testing your system, checking the AI's response to non-standard or provocative user requests, and adjusting the logic for filtering output data. If necessary, we will automate the process of evaluating responses using Python. The result of our work will be fully optimized, production-ready prompts and a detailed report on the AI's behavior. Let's discuss the tasks and current architectural solutions in private messages!

  12. 256  
    1 day22 USD

    Hello! Our team has 4 years of experience in process automation, developing smart bots, and working with data in Python. We are professionally engaged in integrating language models and prompt engineering, so testing and calibrating prompts for your AI agent is our specialty. We will approach the process systematically: we will develop test scenarios, conduct stress testing based on prepared datasets, minimize model hallucinations, and ensure clear adherence to system prompts. We will ensure high relevance, stability of responses, and optimize API token costs. We are ready to start testing the first hypotheses today. When would it be convenient for you to discuss the agent's logic in chat?

  13. 315    2  1
    3 days33 USD

    Hello, Oleksandr!

    The task is very familiar and relevant. Manual testing of prompts across different dialogue branches for SaaS is indeed a bottleneck that consumes time.

    I propose to implement an automated testing framework based on the "LLM-as-a-Judge" principle (AI-Judge) in Python.

    How it will work technically:

    Test cases: We create a JSON/CSV file with benchmark situations (for example: "The client aggressively asks for the price", "The client requests a discount").

    Automation (Script): My Python script will automatically "feed" these replies to your AI manager via API and collect its responses.

    AI-Judge (Evaluation and Recommendations): The collected responses will be sent in a separate API call (OpenAI) with a strict system prompt for the tester. This "AI-Judge" analyzes the manager's response for tone of voice compliance, absence of hallucinations, and generates a log:
    Rating: 8/10. Error: the bot gave a discount without conditions. Recommendation: add a rule in the developer message "Never give a discount first".

    Why me:
    I have extensive experience working with neural network APIs (OpenAI, Groq). My current commercial project is a complex Telegram bot, the architecture of which is built on multi-level prompt engineering, where AI acts as an analyst and critic (analyzing texts, suggesting improvements).

    I can write such a Python script for testing that you can run locally or on a server after each prompt change.

    I am ready to discuss the implementation details!

  14. 919    4  0
    2 days22 USD

    Hello, Oleksandr, let's take turns.
    Recently, I have been working a lot with AI and have already completed similar tasks. I suggest implementing this using pydanticAI. There is a separate, already implemented module for prompt evaluation, with the ability for automatic assessment and improvement.
    There are also other modules for similar tasks, such as deepeval and DSPy. We can implement it through them.
    The logic of construction is quite simple:
    1. We create a certain test set (or we can also assign this to AI)
    2. We conduct testing for each set
    3. We check the validity of the result (we can add LLM-as-a-Judge)
    4. We edit the prompt.
    5. And so iteratively, until the verification cycle passes at the required level.
    The task is clear, there is experience. I would be happy to work with you!

  15. 266  
    10 days557 USD

    Hello. The task is clear: we need to automate the testing of dialogues for AI managers after changes in the prompt/developer message, to quickly check the quality of responses in various scenarios before launching into production.

    I can propose an MVP system for automated prompt testing:

    — a set of test scenarios for different products and situations;
    — automatic dialogue initiation via OpenAI API;
    — comparison of responses before/after the prompt change;
    — evaluation of responses based on criteria: accuracy, tone alignment, completeness, stability, absence of undesirable responses;
    — saving results in a table or database;
    — a brief report for each test: what improved, what worsened, which responses need attention;
    — the ability to receive recommendations on what to change in the prompt for better stability.

    The implementation can be done as a separate script or a simple internal tool. For the MVP, I suggest starting with testing on 5–10 scenarios, then scaling it for different types of products and dialogues.

    I am ready to discuss your current architecture, the format of prompt_id/developer message, examples of dialogues, and the desired report format.

  16. Nick Osipov Web4Business
    4975    41  4   1
    3 days22 USD

    Good day!

    I understand the challenge of manual testing of AI prompts for Instagram/Messenger/Telegram. I have experience in automating dialogues with the OpenAI API and developing scripts. I will develop the logic and tools for automatic response verification and prompt optimization.

    Message me privately, and we will clarify the details.

  17. 2248    18  3
    1 day22 USD

    Hello. I have experience in automating dialogue testing through simulation (Synthetic Users) and evaluating metrics (LLM-as-a-Judge). To avoid building a system from scratch, it makes sense to integrate ready-made tools like Promptfoo or DeepEval for this logic.

    I suggest discussing all technical requirements and scenarios in more detail. This will allow us to form an accurate estimate of costs and timelines for the full integration of the solution into your SaaS. I am ready for a dialogue.

  18. 726    9  1
    3 days22 USD

    Hello! I have carefully reviewed your project and am ready to start working. I guarantee quality and timely execution.

  19. Another 4 proposals concealed
    1 proposal concealed

Current freelance projects in the category AI & Machine Learning

Automatic posting of stories on Instagram

Good day, I need help with setting up automatic posting of stories on Instagram. There are already stories in the Instagram archive that have been published, and they need to be reposted.

AI & Machine LearningBot Development ∙ 11 hours 48 minutes back ∙ 22 proposals

Creation of an AI assistant for communication with Clients

It is necessary to create an AI assistant for communication with Clients. The chat window will be located on our website, followed by communication with the bot. Questions about products, settings, capabilities, etc. In cases where the information is unknown or the request can…

AI & Machine LearningAI Consulting ∙ 1 day 7 hours back ∙ 33 proposals

I am looking for a video editor who creates AI videos.

Creation of AI videos for dentists and other experts Objective: To create short vertical videos for Instagram Reels, Facebook Reels, TikTok, and YouTube Shorts that explain complex topics in simple language and hold the viewer's attention through a combination of AI animation…

AI & Machine Learning ∙ 1 day 14 hours back ∙ 2 proposals

I am looking for a mentor/teacher for ComfyUI for online learning (working through RunPod)

16 USD

Hello. I am looking for a practicing specialist and mentor who can help me master working with ComfyUI. The main feature of my request is that the work will be done entirely in the cloud, without downloading the program to a local computer. I plan to rent a graphics card through…

AI & Machine Learning ∙ 2 days 1 hour back ∙ 1 proposal

AI agent of sports nutrition technologist

The agent helps develop formulations for new sports nutrition products — protein bars, proteins, pre-workouts, isotonic drinks, bars, etc. The main feature: the agent knows the legislation of different countries and automatically takes it into account when creating the…

AI & Machine LearningWeb Programming ∙ 2 days 1 hour back ∙ 61 proposals

Client
Project published
26 days 8 hours back
915 views
Tags
  • saas
  • messenger
  • openai
  • Telegram
  • Instagram