Reinforcement learning traditionally relies on approximating the action-value function \(Q(\mathbf{s}_n, a_n)\) to guide decision-making. In this framework, an agent interacts with an environment, selecting actions to maximize cumulative rewards. While deep neural networks have been widely used to approximate \(Q\), they demand substantial computational resources. A photonic delay-based reservoir computing approach offers an alternative, leveraging fast optoelectronic processing to reduce learning costs.

This reservoir computing architecture comprises input, reservoir, and output layers. The reservoir itself is formed by a nonlinear element with a feedback loop, its nodes implemented virtually by slicing the temporal output into short intervals \(\theta\). The number of virtual nodes \(N\) is determined by the feedback delay \(\tau\) divided by \(\theta\). This temporal multiplexing removes the need for numerous physical nodes.
In the input layer, the environment’s state vector \(\mathbf{s}_n\) undergoes a masking procedure. A random mask matrix \(\mathbf{M}\) assigns connection weights from state elements to reservoir nodes, scaled by \(\mu\) and augmented with a bias \(b\). The bias ensures nonzero input even when state elements are near zero, and introduces node-specific nonlinearities by shifting the oscillation center of each node’s input. This diversity in nonlinear response improves the reservoir’s approximation capability.
The output layer computes \(Q(\mathbf{s}_n, a)\) as a weighted sum of virtual node states. Weights \(\mathbf{w}_a\) are trained using the Q-learning algorithm, which updates them based on temporal difference error \(\delta_n\) and a step-size \(\alpha\). The discount factor \(\gamma\) controls the influence of future rewards.
The optoelectronic delay system used for implementation consists of a laser diode, Mach–Zehnder modulator (MZM), optical fiber for delayed feedback, photodetector, and amplifier. The MZM’s \(\cos^2(\cdot)\) nonlinearity transforms electrical inputs into optical outputs. Feedback through the fiber introduces memory effects, enabling the reservoir to incorporate past states into its processing. In experiments without delayed feedback, the system functions as an extreme learning machine.
Performance was tested on OpenAI Gym’s CartPole-v0 task, where a pole must be balanced on a moving cart. With an input bias of \(b=0.8\), numerical simulations reached the maximum reward of 200 within 31 episodes and maintained it for 100 consecutive episodes. Without bias, the system failed to solve the task, often selecting the same action regardless of state. Experimental runs achieved maximum reward by episode 110, though learning was slower due to measurement noise and absence of delayed feedback.
Compared to deep neural networks requiring over 150 episodes and large parameter sets, the photonic reservoir achieved similar or better speed with far fewer training parameters.
The MountainCar-v0 task, involving driving a car up a slope, further tested the scheme. Numerical results showed the system solving the task by episode 267. Experiments, however, struggled to meet the reward threshold due to the negative reward structure discouraging certain actions. Fixing the reservoir weights after training at episode 180 allowed the task to be solved, demonstrating robustness to parameter perturbations.
Bias magnitude proved critical. Numerical sweeps revealed that \(b>0.5\) significantly improved rewards, with optimal performance near \(b=1\), aligning with the MZM’s normalized half-wave voltage. Time-delayed feedback also enhanced performance; without it, rewards were lower. Varying feedback strength \(\kappa\) showed best results near \(\kappa=1\), the edge of chaos, where the system’s dynamics balance stability and rich temporal behavior. Beyond \(\kappa=1\), periodic oscillations reduced consistency, impairing learning.
These findings highlight the interplay between input bias, feedback dynamics, and nonlinear optoelectronic behavior in optimizing reinforcement learning. By tuning these parameters, photonic reservoir computing can deliver fast, low-cost decision-making for control tasks, offering potential advantages in domains requiring rapid, adaptive responses.
