The UK’s leading cyber and AI security agencies have broadly welcomed efforts to crowdsource the process of finding and fixing AI safeguard bypass threats.
In a blog post published today, the National Cyber Security Centre’s (NCSC) technical director for security of AI research, Kate S, and AI Security Institute (AISI) research scientist, Robert Kirk, warned of the threat to frontier AI systems from such threats.
Cybercriminals have already shown themselves adept at bypassing built in security and safety guardrails in models such as ChatGPT, Gemini, Llama and Claude. Just last week, ESET researchers discovered the “first known AI-powered ransomware” built using OpenAI.
The NCSC and AISI said newly launched bug bounty programs from OpenAI and Anthropic could be a useful strategy for mitigating such risks, in the same way that vulnerability disclosure works to make regular software more secure.
Read more on safeguard bypass: GPT-5 Safeguards Bypassed Using Storytelling-Driven Jailbreak
Apart from keeping frontier AI system safeguards fit for purpose after deployment, they will hopefully help encourage a culture of responsible disclosure and industry collaboration, increase engagement across the security community and enable researchers to practice their skills, they added.
However, the NCSC and AISI warned that there could be significant overheads associated with triaging and managing threat reports, and that participating developers must first have good foundational security practices in place.
The Ingredients of a Good Disclosure Program
The blog outlined several best practice principles for developing effective public disclosure programs in the field of safeguard bypass threats:
- A clearly defined scope to help participants understand what success looks like
- Internal reviews and initially discovered weaknesses to be handled before the program is launched
- Reports to be easy to track and reproduce, such as via unique IDs, and copy and share tools
The NCSC and AISI noted that the presence of such a program doesn’t automatically make a model more safe or secure, and encouraged further research into questions such as:
- Can other areas of cybersecurity offer useful tools or approaches to borrow?
- What incentives need to be offered to program participants?
- How should discovered safeguard weaknesses be mitigated?
- Are there methods for cross-sector collaboration that could support the handling of attacks which transfer across models and programs?
- How should we judge the severity of safeguard bypass weaknesses, especially when we don’t know the deployment context?
- How public and open should such programs be?