In classical security stacks the division of labour is clear: IDS/IPS, SIEM, NDR, firewalls, and SOAR playbooks all follow fixed rules. The problem is that attackers change their tactics faster than teams can update those rules.

Reinforcement learning (RL) addresses this directly. An RL agent does not learn individual if-then rules. It learns, through trial and error, which defensive strategies minimise damage over time. It observes network state, tries actions — quarantine, segmentation, traffic rerouting — and is rewarded or penalised based on outcome. What has worked in games and robotics is now entering security environments.

Autonomous Network Agents (ANAs) are not a purely defensive tool. They are dual-use by design. This article looks at both sides.

What a Cyber Range Actually Is

A cyber range is an isolated, realistic IT/OT environment where attacks, defences, and forensics can be practised and evaluated automatically. Think of it as a flight simulator for blue and red teams — virtualised or containerised infrastructure with orchestrated scenarios, metrics, and open interfaces.

The shift happening now is that these environments are no longer built only for human teams. They are being built explicitly as training grounds for RL-based defence agents. Two projects stand out.

CSLE: A Cyber Range Built for RL

CSLE (Cybersecurity Learning Environment) was developed at KTH Royal Institute of Technology, principally by Kim Hammar, and published from 2021 onwards. It is explicitly framed as a cyber range for reinforcement learning agents — built so that RL agents for security use-cases can be trained and evaluated systematically.

CSLE operates across three layers:

Emulation. Container-based infrastructure with hosts, services, user traffic, and attack scenarios producing realistic logs, flows, and events.

Simulation. Mathematical models — Markov decision processes and Markov games — where attacker and defender strategies compete. These include intrusion-response and intrusion-tolerance game formulations.

Management. Web UI, CLI, and API for starting and stopping scenarios, collecting data, and connecting RL libraries via OpenAI Gym / Gymnasium interfaces.

In practice this means security teams or researchers can train RL agents on precisely scoped problem slices — for example, when an aggressive response (isolate a host) is justified versus when it is better to collect more telemetry and avoid false positives. CSLE ships with pre-built games and metrics that substantially lower the entry barrier for RL experimentation.

CybORG: The Benchmark Standard

CybORG (Cyber Operations Research Gym) was developed by the Australian Defence Science and Technology Group and released through the CAGE Challenges (Cyber Autonomy Gym for Experimentation). The goal from the start was an open research environment where both human and autonomous blue/red agents could be trained and compared in standardised scenarios.

At the centre is an abstracted enterprise network with hosts, services, vulnerabilities, and credentials. Scenarios run where red agents progressively compromise the network while blue agents attempt to detect and contain the campaign.

What makes CybORG technically significant are its wrappers:

This makes CybORG a benchmark set for RL-based network defence. Anyone testing a new algorithm or policy architecture gets reproducible scenarios, defined rewards, and genuine comparability — something that has been conspicuously absent in security research.

From Lab to Production: The Gap That Remains

CSLE and CybORG share a common characteristic: they work with deliberately abstracted network models. There is no direct connection to eBPF probes in the Linux kernel, P4 switch pipelines, or production SDN controllers in an OT or enterprise network. This is sensible — a research system should not experiment directly on a hospital’s operational infrastructure.

Transferring to real infrastructure requires an architecture that combines the learning capability from cyber ranges with the hard reality of production logs, flows, policies, and SLAs.

The natural next step is to treat defence not in isolation but as part of an autonomous adversarial network. On one side, RL-based defence agents. On the other, automated attacker and stress-test components continuously generating new tactics and traffic patterns. Instead of running through static test suites, you get a permanent arms race in the lab — the defender adapts its policy while attacker and generator models try to evade the sensors. This coupling of RL control, generative attacker, and realistic telemetry is the core of the blueprint below.

Blueprint: An RL Network Agent in Practice

Layer 1 — Observation: Telemetry as raw material. eBPF/XDP delivers process and flow visibility at the kernel level. P4 data planes provide per-flow counters and header anomalies. SDN controllers contribute topology and zone context. From this a consistent state space — vector or graph — is assembled for the agent to observe at each step, including costs for latency, packet loss, and policy violations.

Layer 2 — Decision and policy: the same stack for blue and red. A defence agent balances damage limitation, availability, and intervention cost. It can throttle or reroute flows, tighten microsegmentation, or activate quarantine zones. A GenAI layer translates decisions into P4, SDN, or firewall rules and produces legible change proposals. With a different reward function the same stack drives an attacker agent — reconnaissance, lateral movement, evasion. Blue or red is a question of objective, not of code.

Layer 3 — Autonomous adversarial network and governance. In the lab, RL defenders, RL attackers, and generative models run simultaneously, sharpening their policies in CSLE- and CybORG-style scenarios against an eBPF/P4/SDN test network. Permanent adversarial training replaces the one-time pen test.

A policy governor limits blast radius. Initially the agent runs in shadow mode — proposals only. Later, tightly scoped automations. Every decision remains loggable, auditable, and overridable, including in view of the fact that the same agent type can be deployed offensively outside regulated environments.

The Dual-Use Reality

When you put CSLE, CybORG, and an RL network agent blueprint together, you are not looking at a useful assistant feature. You are looking at a possible next escalation stage: networks where RL-based defence and attack agents permanently face each other.

Near term, the most likely entry point is still relatively contained — a blue-team copilot running in a digital twin of the corporate network. RL agents see the same telemetry as the live network and rapidly simulate which measures would have which effect: what happens if this path is throttled, which zones can be separated without killing a critical application? From these simulations come concrete policy snippets and change proposals — not as an oracle but as a tactics generator that works through more variants than any human team could.

The discomfort arrives when the same architecture is mirrored. On the other side, RL-based attacker agents use exactly the same mechanisms — to find blind spots, optimise lateral movement, or work around deception zones.

Technologically this has not been science fiction for some time. Programmable data planes, eBPF, RL stacks, and generative models all exist. Whether an agent is red or blue is decided entirely by reward function and deployment context.

Security teams that think of RL only as a defence feature and ignore the offensive side are underestimating the risk. If security teams do not experiment with autonomous adversarial networks themselves, others will — with the same tools and fewer constraints. Classical controls — firewalls, IDS, NDR — remain in place but increasingly become the sensor and actuator layer beneath a learning control system that sits above them.

RL network agents are not a nice-to-have. They are the next maturity step beyond SIEM, SOAR, and attack simulation. What matters is who masters them first — and who tests their offensive side in their own lab before someone else deploys them in their network.