You are here: Home / Project Moonshot

Project Moonshot

An LLM Evaluation Toolkit

Transforming LLM testing with Project Moonshot

Project Moonshot is one of the world’s first Large Language Model (LLM) Evaluation Toolkits. It is an open-source tool to bring benchmarking and red teaming together to help LLM App developers and compliance teams to test and evaluate their LLMs and LLM applications. This open-source tool is hosted on GitHub.

Project Moonshot’s python library is accompanied with a Web UI that guides users through a streamlined testing workflow

Project Moonshot in 2025 —
Updated to support your growing challenges

01

Ease to implement IMDA’s Starter Kit. Our web interface makes it easy to connect your applications and run benchmark tests recommended in IMDA’s Starter Kit.

02

Comprehensive benchmark datasets. One place to access more than 100 (and still growing) benchmark datasets, with pre-built evaluators.

How is Project Moonshot different from the AI Verify Toolkit?

Simple. Project Moonshot covers LLMs, and AI Verify Toolkit covers traditional AI.

In the LLM space, companies often ask “which foundation LLM model best suits our goals?” and “how do we ensure our application, building on the model we chose, is robust and safe?” Moonshot helps companies answer these questions through a comprehensive suite of benchmarking tests and scoring reports, so they can deploy responsible Generative AI systems with confidence.

We’re very excited about Project Moonshot, it is one of the first tangible representations globally of what it means to approach AI Safety, in a way that is actionable for companies and AI teams.

- JX Wee, Regional Director, Asia Pacific at DataRobot

Extending AI Verify into Generative AI

To enable AI Verify Foundation’s mission of responsible AI, we have extended our products to cover LLMs. This will help companies address the significant opportunities, and associated risks of generative AI and the base LLM technology.

In the LLM space, companies often ask “which foundation LLM model best suits our goals?” and “how do we ensure our application, building on the model we chose, is robust and safe?”

Moonshot helps companies answer these questions through a comprehensive suite of benchmarking tests and scoring reports, so they can deploy responsible Generative AI systems with confidence.

We’re very excited about Project Moonshot, it is one of the first tangible representations globally of what it means to approach AI Safety, in a way that is actionable for companies and AI teams.

- JX Wee, Regional Director, Asia Pacific at DataRobot

Leveraging the Global Community

Project Moonshot embodies the AI Verify Foundation’s commitment to involve the global community in making AI trustworthy and safe for humanity. The Foundation collaborates with industry, governments, and civil societies, to ensure that the unique culture, heritage, and values of our communities are represented and tested.

The launch of Project Moonshot was made possible by the active participation and support of our design partners and contributors.

Design Partners

Contributors

As one of the first-movers towards global testing standards, the Foundation is partnering ML Commons to develop globally aligned safety benchmarks for LLMs. Our common objective is to develop robust benchmarks that will advance AI safety endeavours, ensuring that AI technologies serve the betterment of humanity.

Why Use Project Moonshot?

1. Benchmark to Measure Model Safety and Performance

Benchmarks are “Exam questions” to test the model across a variety of competencies, e.g., language and context understanding.

Project Moonshot offers a list of benchmarks which are popular, including those widely discussed in the community, and those used by leaderboards such as Hugging Face, to measure your model’s performance. This provides developers with valuable insights to improve and refine the application.

2. Setting Testing Baselines & Simple Rating Systems

Project Moonshot simplifies model testing with curated benchmarks scored on a graded scale. This empowers users to make informed decisions based on clear and intuitive test results, enhancing the reliability and trustworthiness of their AI models.

3. Enabling Manual and Automated Red-Teaming

Project Moonshot facilitates manual and automated red-teaming, incorporating automated attack modules based on research-backed techniques to test multiple LLM applications simultaneously.

Red-Teaming allows the adversarial prompting of LLMs to induce them to behave in a manner incongruent with their design.

As Red-Teaming conventionally relies on humans, it is hard to scale. Project Moonshot has developed some attack modules that enable automated prompt generation, which allows automated red teaming.

4. Reduce Testing & Reporting Complexity

Project Moonshot streamlines testing processes and reporting, seamlessly integrating with CI/CD pipelines for unsupervised test runs and generating shareable reports. This saves time and resources while ensuring thorough evaluation of model performance.

5. Customisation for your unique application needs

Recognising the diverse needs of different applications, Project Moonshot’s Web UI guides users to identify and run only the most relevant tests, to optimise the testing process. Users can also tailor their tests with custom datasets, to evaluate their models for their unique use cases.

Explore Project Moonshot Today!

Use Project Moonshot to evaluate your LLM applications, and share your ideas or contribute to this community effort.