Project Moonshot

An LLM Evaluation Toolkit

Transforming LLM testing with Project Moonshot

Project Moonshot is one of the world’s first Large Language Model (LLM) Evaluation Toolkits, designed to integrate benchmarking, red teaming, and testing baselines. It helps developers, compliance teams, and AI system owners manage LLM deployment risks by providing a seamless way to evaluate their applications’ performance, both pre- and post-deployment. This open-source tool is hosted on GitHub and is currently in beta.

Project Moonshot’s python library is accompanied with a Web UI that guides users through a streamlined testing workflow

Extending AI Verify into Generative AI

To enable AI Verify Foundation’s mission of responsible AI, we have extended our products to cover LLMs. This will help companies address the significant opportunities, and associated risks of generative AI and the base LLM technology.
In the LLM space, companies often ask “which foundation LLM model best suits our goals?” and “how do we ensure our application, building on the model we chose, is robust and safe?”
Moonshot helps companies answer these questions through a comprehensive suite of benchmarking tests and scoring reports, so they can deploy responsible Generative AI systems with confidence.

We’re very excited about Project Moonshot, it is one of the first tangible representations globally of what it means to approach AI Safety, in a way that is actionable for companies and AI teams.

Leveraging the Global Community

Project Moonshot embodies the AI Verify Foundation’s commitment to involve the global community in making AI trustworthy and safe for humanity. The Foundation collaborates with industry, governments, and civil societies, to ensure that the unique culture, heritage, and values of our communities are represented and tested.

The launch of Project Moonshot was made possible by the active participation and support of our design partners and contributors.

Design Partners


As one of the first-movers towards global testing standards, the Foundation is partnering ML Commons to develop globally aligned safety benchmarks for LLMs. Our common objective is to develop robust benchmarks that will advance AI safety endeavours, ensuring that AI technologies serve the betterment of humanity.

Why Use Project Moonshot?

1. Benchmark to Measure Model Safety and Performance
Benchmarks are “Exam questions” to test the model across a variety of competencies, e.g., language and context understanding.
Project Moonshot offers a list of benchmarks which are popular, including those widely discussed in the community, and those used by leaderboards such as Hugging Face, to measure your model’s performance. This provides developers with valuable insights to improve and refine the application.
2. Setting Testing Baselines & Simple Rating Systems
Project Moonshot simplifies model testing with curated benchmarks scored on a graded scale. This empowers users to make informed decisions based on clear and intuitive test results, enhancing the reliability and trustworthiness of their AI models.
3. Enabling Manual and Automated Red-Teaming
Project Moonshot facilitates manual and automated red-teaming, incorporating automated attack modules based on research-backed techniques to test multiple LLM applications simultaneously.
Red-Teaming allows the adversarial prompting of LLMs to induce them to behave in a manner incongruent with their design.
As Red-Teaming conventionally relies on humans, it is hard to scale. Project Moonshot has developed some attack modules that enable automated prompt generation, which allows automated red teaming.
4. Reduce Testing & Reporting Complexity
Project Moonshot streamlines testing processes and reporting, seamlessly integrating with CI/CD pipelines for unsupervised test runs and generating shareable reports. This saves time and resources while ensuring thorough evaluation of model performance.
5. Customisation for your unique application needs
Recognising the diverse needs of different applications, Project Moonshot’s Web UI guides users to identify and run only the most relevant tests, to optimise the testing process. Users can also tailor their tests with custom datasets, to evaluate their models for their unique use cases.

Explore Project Moonshot Today!

Use Project Moonshot to evaluate your LLM applications, and share your ideas or contribute to this community effort.
Enter your name and email address below to download the Discussion Paper by Aicadium and IMDA.
Disclaimer: By proceeding, you agree that your information will be shared with the authors of the Discussion Paper.

Thank you for completing the form. Your submission was successful.

Preview all the questions


Your organisation’s background – Could you briefly share your organisation’s background (e.g. sector, goods/services offered, customers), AI solution(s) that has/have been developed/used/deployed in your organisation, and what it is used for (e.g. product recommendation, improving operation efficiency)?


Your AI Verify use case – Could you share the AI model and use case that was tested with AI Verify? Which version of AI Verify did you use?


Your reasons for using AI Verify – Why did your organisation decide to use AI Verify?


Your experience with AI Verify – Could you share your journey in using AI Verify? For example, preparation work for the testing, any challenges faced, and how were they overcome? How did you find the testing process? Did it take long to complete the testing?


Your key learnings and insights – Could you share key learnings and insights from the testing process? For example, 2 to 3 key learnings from the testing process? Any actions you have taken after using AI Verify?


Your thoughts on trustworthy AI – Why is demonstrating trustworthy AI important to your organisation and to any other organisations using AI systems? Would you recommend AI Verify? How does AI Verify help you demonstrate trustworthy AI?