Can AI Migrate Java? ScarfBench Offers Real-World Test

Modernizing existing enterprise applications is a monumental task, often ranking among the largest and most expensive endeavors for organizations. Teams undertake these migrations across different frameworks to unlock better maintainability, enhance cloud readiness, boost developer productivity, and gain access to cutting-edge capabilities.

The recent surge in advanced coding agents has ignited considerable excitement around the prospect of AI-assisted modernization. However, a critical question lingers: can these AI agents reliably modernize complex, real-world enterprise applications?

While existing software engineering benchmarks have showcased impressive strides in areas like bug fixing and code generation, framework migration presents a fundamentally different beast. True success in this domain demands not only accurate code translation but also the preservation of application behavior, adaptation of intricate build systems, and careful navigation of runtime dependencies.

Introducing ScarfBench: The Enterprise Java Migration Benchmark

To bridge this crucial gap, we are proud to introduce ScarfBench (Self-Contained Application Refactoring Benchmark). This open benchmark is specifically designed to rigorously evaluate AI agents on cross-framework migration tasks within the challenging landscape of Enterprise Java.

ScarfBench focuses on migrations across three prominent Java ecosystems, reflecting common modernization paths for businesses:

Spring
Jakarta EE (formerly Java EE)
Quarkus

Crucially, ScarfBench distinguishes itself from traditional benchmarks by going beyond mere code comparison. Instead, it measures whether migrated applications actually build successfully, deploy correctly, and, most importantly, preserve their original behavior under real-world conditions.

The True Complexity of Framework Migration

Framework migration is far more intricate than simply swapping out annotations or minor code snippets. A seemingly straightforward repository migration, for instance, can necessitate widespread changes across dependency injection configurations, persistence settings, complex queries, and various framework descriptors.

Even small, seemingly insignificant errors in any of these interconnected components can lead to complete deployment failure. This underscores a fundamental truth: successful framework migration demands the translation of underlying framework semantics and architectural patterns, not just a line-by-line source code conversion.

ScarfBench provides a systematic approach to evaluating AI agents in this complex environment, incorporating both focused migration tasks and comprehensive whole-application migrations. These are derived from a JSR-based enterprise Java taxonomy, with expert-verified implementations across all three target frameworks.

Frontier Agents and Their Performance Insights

We put several state-of-the-art coding agents to the test on ScarfBench, and the results offered valuable insights. Despite these agents demonstrating strong performance on many traditional software engineering benchmarks, framework migration proved to be a significantly more difficult challenge.

Success rates varied considerably depending on the specific framework pair involved, with whole-application migrations remaining particularly challenging. A key finding was the progression of success: compile success consistently outstripped deploy success, which in turn was higher than behavioral success. This suggests that relying solely on build success can significantly overestimate the true quality of a migration.

Furthermore, the difficulty of migration was strongly influenced by the target framework itself. Our evaluations revealed that migrating to Jakarta EE proved particularly challenging for the agents, highlighting specific areas for future AI development.

Key Learnings: What ScarfBench Revealed About AI Agents

Beyond simple success rates, ScarfBench helped us understand the nuanced behaviors of AI agents during the modernization process.

Are Agents Overconfident in Their Migrations?

A crucial question is whether an agent can reliably determine when a migration is truly complete and functional. We compared agent-reported outcomes against independent build verification.

Our finding: Agents are often overconfident in their self-assessments. For example, Claude Code reported successful builds for 29 out of 30 whole applications. However, upon independent verification, only 22 of those applications actually built successfully. Conversely, one application classified as a failure by the agent ultimately built correctly. This strongly suggests that agent self-assessment should not be considered a reliable indicator of migration completion; independent build and test validation remain absolutely essential.

How Do Agents Navigate Application Dependencies?

Framework migrations rarely impact just a single file or a single layer of an application. Changes in configuration, services, databases, and web components frequently cascade throughout the entire application stack.

Our finding: Migration is often an iterative, rather than linear, process. We observed that the most frequently revisited layers by agents were configuration, persistence, and web components. This pattern suggests that successful migration is more akin to an iterative dependency-resolution process than a simple, one-pass source-to-source transformation.

Where Do Agents Focus Their Effort?

By analyzing the frequency with which agents revisited different application layers, we gained insight into where they expended the most effort. Layers requiring repeated visits often involved complex debugging, intricate dependency resolution, or detailed framework adaptation.

Our finding: Configuration tasks dominate migration effort. Rather than proceeding linearly, agents repeatedly returned to configuration-related artifacts. This constant revisiting was driven by the need to resolve subtle framework differences and address stubborn dependency issues, highlighting the critical role of configuration in successful migrations.

Beyond Code Transformation: The Role of Environment and Tooling

Not every migration issue or challenge originates solely from the source code itself. Agents frequently encountered and struggled with environmental factors.

Our finding: Environment and tooling significantly matter. Agents often faced difficulties with classpath conflicts, missing dependencies, and JVM version mismatches. These operational and environmental concerns frequently delayed the successful validation of migrations, even when the underlying source code transformations were largely complete. Modernization failures, therefore, span build systems, deployment environments, dependency injection, databases, endpoints, and general infrastructure.

The Path Forward for AI-Assisted Modernization

The overarching takeaway from our ScarfBench evaluations is clear: the biggest challenge in framework modernization isn’t merely translating Java code. It’s effectively managing the intricate web of dependencies across configuration, infrastructure, and diverse runtime environments.

While cutting-edge AI agents can certainly automate substantial portions of the migration process, reliable validation and sophisticated architectural reasoning remain absolutely critical for achieving truly successful outcomes. ScarfBench is specifically designed to expose these complex challenges and provide a standardized, transparent way to measure progress towards genuinely autonomous application modernization.

We invite researchers and practitioners alike to explore ScarfBench as an open resource. Researchers can leverage it to compare novel agent architectures and techniques, while practitioners can utilize it to evaluate various modernization solutions before committing them to production environments.

Framework migration continues to be one of the largest unsolved problems in AI-assisted software engineering. We sincerely hope that ScarfBench will empower the community to measure progress more accurately and accelerate the development of the next generation of AI-driven application modernization tools. We extend an open invitation to researchers, practitioners, and framework communities to evaluate their agents, contribute new migration scenarios, and collectively advance the state of the art.

Source: Hugging Face Blog

Kristine Vior

With a deep passion for the intersection of technology and digital media, Kristine leads the editorial vision of HubNextera News. Her expertise lies in deciphering technical roadmaps and translating them into comprehensive news reports for a global audience. Every article is reviewed by Kristine to ensure it meets our standards for original perspective and technical depth.