Förderjahr 2025 / Stipendium Call #20 / ProjektID: 7733 / Projekt: LLM Agents for Offensive Security
Can an LLM hack an enterprise network? In this second post, we trace the rapid evolution of AI-driven penetration testing --- from early single-machine prototypes to autonomous agents tackling Active Directory environments --- and ask what consistenc
Using AI for Penetration Testing: From Single Machines to Enterprise Networks
In the previous blog post, I outlined why we're researching LLM agents for offensive security: there simply aren't enough pentesters to go around, and AI could help close that gap. Now let's look at how --- specifically, how researchers have progressed from hacking a single Linux box to compromising entire enterprise networks.
Step 1: Can an LLM Hack a Single Machine?
The first question was deceptively simple: can you put an LLM in a loop, let it run shell commands, and have it gain root on a vulnerable machine?
The answer, as of 2023, turned out to be "yes, sometimes." Our initial prototype wintermute connected GPT-3.5 to a Linux VM via SSH and let it explore freely. The results were promising but chaotic --- the LLM could occasionally escalate privileges, but it also loved going down rabbit holes, fixating on one approach long after most human pen-testers would have moved on.
This evolved into hackingBuddyGPT, an open-source framework where we formalized the approach: the LLM observes, suggests a command, the framework executes it, feeds back the result.. and repeats this over and over again, until it has become the all-powerful root user. With GPT-4-Turbo and some architectural improvements --- especially LLM-driven state reflection, where the same LLM periodically summarizes the current attack state to keep itself on track — we were able to almost match the success rate of a professional penetration tester on our Linux privilege escalation benchmark, while comparing favourably when it comes to costs.
Around the same time, PentestGPT took a different approach: instead of full autonomy, it kept a human in the loop, guiding the tester through a structured "Pentesting Task Tree." It scored 228% better than a raw GPT-3.5 baseline and placed 24th out of 248 teams in a CTF competition.
The shared takeaway from both projects? Architecture matters as much as model capability. Just prompting an LLM and hoping for the best doesn't work: you need structured reasoning, state tracking, and context management.
Step 2: Can It Hack an Enterprise Network?
Single machines are one thing. Real enterprise networks --- with Active Directory, multiple domains, lateral movement, credential chaining --- are a different beast entirely. This scenario is very similar to typical small to large enterprise business networks.
Cochise was our attempt at this jump. It uses a Planner/Executor architecture against GOAD, a realistic test environment that resembles a typical SME network with Windows domain controllers, Microsoft Defender anti-virus/EDR, and simulated users. The most striking finding: reasoning models (like OpenAI's o1) dramatically outperformed standard models. Network pentesting demands strategic thinking, not just command generation, and reasoning model seem to be capable of doing that.
Incalmo (Singer et al., Carnegie Mellon/Anthropic) arrived at a complementary insight: LLMs fail at multi-host attacks not because they lack knowledge, but because they lack the right abstractions. By providing a high-level interface (like offering them a concrete command to "scan the network" instead of instructing it to execute raw shell commands itself), even smaller models succeeded where larger models without abstraction failed entirely.
That said, it's worth noting a counter-trend: in software engineering, the momentum is shifting away from specialized tool integrations and toward generic tool calling --- models that just run commands on the shell (like OpenClaw) rather than relying on many purpose-built wrappers. Will the same happen in pentesting? If future models become capable enough to drive Nmap, Metasploit, and BloodHound directly through generic function calls, the carefully crafted abstraction layers of today might become unnecessary scaffolding. It's an open question whether pentesting will follow that trajectory or whether the domain is complex enough to keep specialized orchestration relevant.
The Hard Problems That Remain
This all sounds impressive but several fundamental challenges remain open.
Consistency and reliability. We now know that LLMs can solve security problems, but can they do so reliably? A system that pulls off a perfect hacking run one time out of ten isn't much use to a professional pen-tester who needs dependable results. The non-deterministic nature of LLMs means that the same agent, on the same target, can succeed brilliantly or fail completely depending on the run. Improving consistency --- not just peak performance --- is in my opinion one of the most important open challenges.
And perhaps most important: what's the goal? LLMs are already great at reconnaissance, report drafting, and iterating on payloads at speed.. and we're now seeing the first glimpses of them being capable of high-level strategic decision-making, planning multi-step attack paths across complex networks. That's exciting, but it also makes questions of ethics more pressing. We want to help and augment human professionals, not replace them. But if full automation does become feasible, what would that mean for the security workforce? And more fundamentally: what would it do to the balance between attackers and defenders?