First Stage: Baseline Performance – Testing of the Final Output
Second Stage: Deep Dive into Baseline Performance – Testing of Individual Components
First and Second Stages: Robustness Testing – Perturbed Questions
The test design laid a foundation for future iterations that could: (a) account for other factors in user satisfaction, such as clarity, relevance, and fluency; (b) take into account the relative importance of the claims in the ground-truth data; and (c) account for LLMs’ tendency to generate varying outputs for the same query. These aspects, elaborated upon below, represent opportunities for a more comprehensive evaluation of the application’s performance.
Capturing linguistic quality and user experience: Responses that are correct but poorly structured or difficult to understand can undermine an application’s utility. Additional metrics such as linguistic quality, relevance or user experience are thus vital to obtaining a more comprehensive understanding of the application’s overall performance. That said, there will still be a need for ground truth-based metrics. User feedback is useful to understand user experience but given that users are unlikely to have ground-truth information, it is unlikely to accurately capture the correctness of a response.
Prioritising claims: Incorporating a measure of relative importance between claims would facilitate a prioritisation of the most critical claims, allowing the testing process to better align with human judgment and provide more meaningful insights into the application’s performance.
Executing multiple runs: Since LLMs are probabilistic and can generate varying outputs for the same query, conducting multiple runs per input can help improve the statistical robustness of the testing process. Averaging performance across multiple runs would provide a more reliable measure of the application’s output correctness, reducing the impact of response variability. Additionally, multiple runs allow for the identification of inputs that the application consistently struggles to get right, highlighting particularly problematic queries that may warrant greater focus.
1 Bengio, Y., Mindermann, S., Privitera, D., Besiroglu, T., Bommasani, R., Casper, S., Choi, Y., Fox, P., Garfinkel, B., Goldfarb, D., Heidari, H., Ho, A., Kapoor, S., Khalatbari, L., Longpre, S., Manning, S., Mavroudis, V., Mazeika, M., Michael, J., … Zeng, Y. (2025). International AI Safety Report. arXiv. https://doi.org/10.48550/arXiv.2501.17805
2 As set out in Bengio et al. (2025), risk identification and assessment form the initial stages of AI risk management practices. Bengio, Y., Mindermann, S., Privitera, D., Besiroglu, T., Bommasani, R., Casper, S., Choi, Y., Fox, P., Garfinkel, B., Goldfarb, D., Heidari, H., Ho, A., Kapoor, S., Khalatbari, L., Longpre, S., Manning, S., Mavroudis, V., Mazeika, M., Michael, J., … Zeng, Y. (2025). International AI Safety Report. arXiv. https://doi.org/10.48550/arXiv.2501.17805
3 We referred to the risk taxonomy in Slattery et al. (2025). Slattery, P., Saeri, A. K., Grundy, E. A. C., Graham, J., Noetel, M., Uuk, R., Dao, J., Pour, S., Casper, S., & Thompson, N. (2024). The AI risk repository: A comprehensive meta-review, database, and taxonomy of risks from artificial intelligence. arXiv. https://doi.org/10.48550/arXiv.2408.12622
4 Claims are distinct pieces of information or factual statements. For example, a generated response from the application might be “Travellers require a visa for entry and their passport must be valid for at least six months”. This response might be broken down into two claims: (1) A visa is required for entry and (2) The passport must be valid for at least six months.
5 Precision measures the accuracy of a system’s outputs by assessing how many of its results are correct. Imagine searching for “red shoes” online. High precision means that most of the results are actually red shoes, not sandals.
6 Recall measures the completeness of a system’s outputs by assessing how many of the total possible correct results it successfully identifies. Using the same example, high recall means the search engine found most of the red shoes available online, even if it also showed other red items.
7 The F1 score combines precision and recall into a single score. It measures the overall accuracy of a system, balancing showing only relevant results (i.e, precision) with finding all the relevant results (i.e. recall). A high F1 score for red shoes would mean that most of the results are red shoes and not too many other things.
8 For example, in email spam filters, while high recall (i.e. catching all spam emails) is beneficial, it is generally less harmful to let some spam emails through than to misclassify critical messages as spam and risk them being overlooked.
Your organisation’s background – Could you briefly share your organisation’s background (e.g. sector, goods/services offered, customers), AI solution(s) that has/have been developed/used/deployed in your organisation, and what it is used for (e.g. product recommendation, improving operation efficiency)?
Your AI Verify use case – Could you share the AI model and use case that was tested with AI Verify? Which version of AI Verify did you use?
Your experience with AI Verify – Could you share your journey in using AI Verify? For example, preparation work for the testing, any challenges faced, and how were they overcome? How did you find the testing process? Did it take long to complete the testing?
Your key learnings and insights – Could you share key learnings and insights from the testing process? For example, 2 to 3 key learnings from the testing process? Any actions you have taken after using AI Verify?