AI Offensive Security, Offensive Security, Vulnerability Research

Which Came First: The System Prompt, or the RCE?

During a recent penetration test, we came across an AI-powered desktop application that acted as a bridge between Claude (Opus 4.5) and a third-party asset management platform. The idea is simple: instead of clicking through dashboards and making API calls, users just ask the agent to do it for them. “How many open tickets do we have?” “Update this record.” That kind of thing.

The agent ran inside a sandboxed environment, and the client was confident in their controls. Rigid system prompts (even prepended to each message), deterministic hooks in place to prevent accidental disclosure, and so on. To their credit, those controls held up; we just found another way to do what we wanted.

Automating the Recon

Manual LLM testing is a drag. You’re sitting there typing prompts one at a time, waiting for responses, trying to keep track of what worked and what didn’t. It’s tedious, and it doesn’t scale.

Our go-to approach is to get another LLM to do the dirty work. For this engagement, the target was accessible via an Electron desktop application, meaning you could launch it in debug mode and access the app’s DOM tree directly. We wrote a Python script that could interact with the target directly, gave it to Claude (alongside our Augustus LLM testing methodology), and let it run.

This essentially meant we had Claude talking to another version of itself. Back and forth, hundreds of times, working through the Augustus attack paths automatically:

This kind of automated LLM-on-LLM testing saved us an immense amount of effort on this engagement. It’s exactly why we’ve been building tools like Julius (for fingerprinting AI services) and Augustus, which we’ve recently added to our Guard platform. If the attack surface keeps growing, the testing efficiency has to keep up.

Discovering Weaknesses

After a couple hours of this, patterns started to emerge. The agent had strong restrictions on most dangerous operations; ask it to run a bash command or write a shell script and it would refuse.

However, it really liked “Hello World” programs. It was more than happy to create and run a simple test script. This is worth noting for similar-style engagements. LLMs are trained to be helpful, and “Hello World” scripts are some of the most common within their training data. That makes this a reliable foot-in-the-door when testing agents with code execution.

More importantly, while it wouldn’t create anything it considered dangerous, it was perfectly willing to modify existing files. Change a file extension or make something executable, all fine.

Using the application’s file upload feature you could upload text, images, or csv files, and they’d land in the sandbox. Combined with the agent’s willingness to rename and chmod, this was effectively arbitrary file upload.

Hitting a Wall

Getting code execution should have been straightforward from there – just upload a script and have the agent execute it – but it wasn’t. Opus 4.5 is genuinely good at catching malicious payloads, and it blocked everything we threw at it:

Obfuscated/encoded scripts would get decoded and blocked (or rejected outright citing an inability to verify safety).
A wrapper script that runs everything in a directory failed, as it understood the context and inspected each script individually first.
Swapping a benign script for a malicious one after building trust didn’t work because it re-read the file before every execution (and the architecture prevented races).

The model had a clear policy: if it couldn’t understand what something did, it wouldn’t run it. And if it could understand it and it looked dangerous, it wouldn’t run it either. We needed something that Opus 4.5 thought it understood and believed was harmless.

The Trick

The answer came from the “Hello World” observation from earlier. The agent trusted simple programs, and it trusted its own ability to read C code and verify what a binary does. But what if the source code doesn’t actually match the binary?

We created two separate programs. The first was legitimate C source code for a hello world program that also calls system(“date”), giving a plausible reason for a system() call to exist in the binary.

The second was the actual payload binary. A statically compiled ELF that downloads and executes a shell script from our server, while also running the same commands as the decoy. We statically compiled it so the binary would be large enough (~600KB) that Claude couldn’t easily analyze it. We also performed a basic XOR against the payload to make it harder to decode.

Both files were uploaded as . txt files through the file upload feature. Then we asked the agent to move them to the working directory, strip the file extension, and make the binary executable, which it did without complaint.

When we asked it to run the binary, it checked the C source code, saw a harmless program with a date command, and executed the binary. The binary pulled our script and ran it, giving us the shell on the sandbox.

The sandbox contained the application’s entire business logic. System prompts, MCP tool definitions, the Dockerfile used to build the environment, and the Node.js server running all the integrations. In short: the crown jewels.

Takeaways

If you’re building or deploying AI agents with code execution capabilities, don’t rely on the model’s judgment as a security control. It can be fooled. Here’s what we recommended to the client:

Restrict executable operations at the OS level: The sandbox should prevent the agent from setting the execute bit on user-uploaded files or running binaries that weren’t pre-installed in the image.
Validate uploaded files by content, not just extension: Check magic bytes. An ELF binary with a .txt extension should be rejected. Content-type validation should persist through file rename operations.
Add tool-level guardrails for dangerous operations: The first was legitimate C source code for a hello world program. It also calls system(“date”), giving a plausible reason for a system() call to exist in the binary.
Lock down network egress: The sandbox didn’t need total outbound internet connectivity, as it only really interfaced with a couple external APIs. Outbound firewalling, while not a primary control, can greatly hinder post-exploitation activity.

Findings and recommendations like these are part of what we do continuously through our Guard platform. Guard continuously monitors and tests internet-facing and internally-deployed infrastructure, including AI agents and LLM-powered applications, as they evolve and change over time. Got a new agentic deployment, model version, or MCP tool? Guard picks it up and our operators test it, identifying vulnerabilities just like this.

FAQ

Why did static compilation matter?

A dynamically linked binary is small enough that the LLM could potentially inspect the raw bytes and identify suspicious library calls like network functions. A statically compiled binary is hundreds of kilobytes of noise, making it effectively opaque to the model.

What is Augustus?

Augustus is Praetorian’s Go-based LLM vulnerability scanner. It tests large language models against a wide range of adversarial attacks and jailbreaks. In this case, we used its methodology to automate the discovery of behavioral weaknesses in the target agent.

Did this affect real customer data?

This was conducted during an authorized penetration test. The sandbox environment was isolated, and we did not access or exfiltrate any customer data. The client has since remediated the vulnerability.

Could this work against other LLM agents?

The core technique exploits a general weakness: LLMs are trained to assist, and “Hello World” scripts are among the most common things they’re asked to produce. Any agent with code execution that treats “simple test script” as a safe category is potentially giving attackers a method of initial access. On top of that, the faked source code trick exploits another general weakness: LLMs will often trust contextual information over direct inspection when the direct inspection is too difficult.

About the Authors

Khael Kugler

Khael is a Lead Offensive Security Engineer at Praetorian specializing in medical device, LLM application, and network assessments. He holds an OSCP+ and a Computer Science degree from UT Austin, has contributed to Praetorian's open-source AI security tooling, and placed first in two DEF CON Red Team Village competitions.

Catch the Latest

Catch our latest exploits, news, articles, and events.

Offensive Security, Vulnerability Research

June 9, 2026

Centurion: Bring Your Own Execution Environment

AI Offensive Security, Labs, Offensive Security

June 4, 2026

Enter the WasmForge: Compiling Sliver into WebAssembly

IoT Security, Offensive Security, Uncategorized, Vulnerability Research

May 28, 2026

When Encryption Isn’t Really Encryption

Ready to Discuss Your Next Continuous Threat Exposure Management Initiative?

Praetorian’s Offense Security Experts are Ready to Answer Your Questions

Praetorian Guard Platform

Penetration Testing Services

Advanced Offensive Security

Continuous Offensive Security

Customer Case Studies

Resources

Use Cases

About Praetorian

Join Praetorian