
Just a month after its initial unveiling, Anthropic’s highly anticipated Claude Mythos model is already pushing the boundaries of AI capability. Originally deemed too powerful for general release, this advanced AI has demonstrated surprising new capabilities, outperforming even its initial impressive benchmarks.
The UK AI Security Institute (AISI) recently published a blog post detailing its updated evaluation of Mythos. Their tests revealed a newer preview version of the model significantly surpassed both its earlier performance and that of OpenAI’s GPT-5.5, indicating a rapid evolution in AI capabilities.
Mythos Breaks New Cyber Testing Ground
The AISI’s updated assessment focused on sophisticated cyber ranges, where Mythos truly shone. The model successfully completed “The Last Ones” range in 6 out of 10 attempts and, remarkably, solved the previously insurmountable “Cooling Tower” in 3 out of 10 tries. This marks the first time any AI model has successfully completed the challenging “Cooling Tower” range.
When Anthropic first launched Mythos Preview and its Project Glasswing cybersecurity alliance, the AISI initially evaluated it as a significant leap forward in AI. This independent validation helped to contextualize the hype surrounding Mythos, confirming it was more than just marketing and truly represented a substantial advancement in AI capabilities.
Rapid Acceleration in AI Performance
The AISI’s latest findings underscore a critical trend: AI models are rapidly improving their ability to handle complex cyber tasks, with profound implications for cybersecurity. Mythos, in particular, has shown a notable aptitude for detecting software vulnerabilities.
According to internal AISI estimates from February 2026, the complexity of cyber tasks AI models could complete had been doubling every 4.7 months since late 2024, an acceleration from their November 2025 estimate of 8 months. However, the performance of Claude Mythos Preview and OpenAI’s GPT-5.5 has substantially exceeded even these impressive doubling rates.
- In February 2026, AISI estimated AI cyber task capabilities doubled every 4.7 months.
- This was an acceleration from November 2025’s estimate of 8 months.
- Mythos and GPT-5.5 have since substantially exceeded these trends.
The Limits of Current Evaluation Methods
Despite these breakthroughs, AISI acknowledges several unknowns in its current testing methodology. To facilitate historical comparisons, tasks were capped at 2.5 million tokens, which inherently understates the full potential of frontier models.
The success rates of Mythos Preview and GPT-5.5 on the longest tasks in AISI’s “narrow cyber suite” were nearly 100%, even with this token limit. This means that without such constraints, their performance would likely be even higher, making it difficult to precisely pinpoint their failure points or the models’ reliability at much higher task lengths.
Researchers noted that a 2.5 million token limit is relatively low, especially for advanced models that benefit significantly from greater context. Experiments using up to 100 million tokens demonstrated that performance would likely continue to improve beyond that budget, particularly for recent models like Mythos, which thrive on extensive token access and complex agent infrastructure.
Source: ZDNet – AI