At the Swiss Plasma Center, the Tokamak à Configuration Variable (TCV) serves as a highly adaptable research platform for exploring advanced plasma control. With a major radius of 0.88 m and a vessel height of 1.50 m, TCV’s flexible magnetic coil system enables a wide range of plasma configurations. External heating is provided through electron cyclotron resonance and neutral beam injection systems, while real-time magnetic flux and field measurements from wire loops and probes feed into the control architecture at a 10‑kHz rate.

The control framework integrates a free‑boundary simulator, FGE, which models the coupled dynamics of plasma and conductors. Plasma equilibrium is defined by the Grad–Shafranov equation, balancing Lorentz forces with pressure gradients. Radial profiles of pressure and current density are parameterized by normalized plasma pressure (βp) and the safety factor at the axis (qA), with total plasma current evolution governed by lumped-parameter magnetohydrodynamic equations. Synthetic magnetic measurements from FGE replicate TCV’s sensors, enabling policy training without direct plant interaction.
To ensure robustness, domain randomization varies key plasma parameters during training, including resistivity, βp, and qA, within experimentally derived ranges. Sensor and power-supply dynamics are modeled with time delays, Gaussian noise, and offsets to capture hardware-specific behaviors. Notably, the control policy learns to avoid conditions that cause coil currents to “stick” when reversing polarity by increasing voltages during polarity changes.
The reinforcement learning architecture employs two neural networks: a recurrent critic network with a long short‑term memory layer to handle non‑Markovian dynamics, and a compact policy network optimized for execution within 50 μs to meet the 10‑kHz control rate. The policy network outputs Gaussian action parameters, though only the mean is used during deployment for deterministic control. Training uses the MPO algorithm with episodic simulation runs, a distributed setup of up to 5,000 parallel actors, and stochastic policies for exploration.
Multiple objectives—such as plasma shape, position, and current—are combined into a single scalar reward using non‑linear transformations and weighted smooth max or geometric mean combiners. Shape targets are generated via spline‑based canonicalization of boundary points, while scalar quantities like plasma current are compared against time‑varying references. Termination conditions yield large negative rewards, enforcing stability.
Controllers have demonstrated adaptive behaviors, including autonomously forming low‑elongation plasmas to suppress vertical instability, dynamically balancing poloidal and ohmic coil usage, and avoiding undesirable X‑points or hardware limitations when specified in the reward structure. External heating tests showed consistent performance compared to unheated runs, underscoring repeatability.
Deployment requires compiling the trained policy into optimized binary code using tfcompile, stripping exploration components, and tailoring memory access for cache efficiency. This ensures real‑time execution within the strict timing constraints of TCV’s control loop. Plasma shape and position are validated post‑experiment using magnetic‑equilibrium reconstruction with LIUQE, comparing reconstructed boundaries against targets via root‑mean‑square error metrics.
Compared to prior model‑based and predictive control approaches, this deep reinforcement learning framework unifies diverse fusion‑control challenges within a single adaptable system. Previous RL efforts have targeted specific parameters like safety factor or beta, but here the architecture generalizes across shape, position, and current objectives. The design leverages established Grad–Shafranov modeling and free‑boundary simulation, making it transferable to other tokamaks with appropriate parameter adjustments.
For application to different devices, the simulator must be configured with accurate coil, vessel, and sensor properties, along with operational parameter ranges. The learning algorithm adapts input and output dimensions automatically, but deployment requires a centralized control system capable of high‑frequency neural network evaluation. Existing controllers handle plasma breakdown and ramp‑up before handover to the learned policy, and machine‑protection layers are advised to mitigate disruption risks.
This work demonstrates how targeted reinforcement learning, grounded in realistic physics models and hardware constraints, can deliver precise, robust magnetic control in fusion research, offering a pathway to adaptable control strategies across future tokamak designs.
