Reinforcement learning (RL) for Markov decision processes (MDPs) has long been split between model-based and model-free approaches, each with distinct strengths and weaknesses. Model-based algorithms, such as R-max, exploit transition data to build explicit models, yielding high sample efficiency when state-action spaces are small. Model-free methods, exemplified by Delayed Q-learning, bypass model construction, scaling better to large state spaces but often requiring more data. The new Dyna-Delayed Q-learning (DDQ) algorithm integrates both paradigms while preserving the probably approximately correct (PAC) property, which guarantees convergence to near-optimal policies within a bounded number of steps.

DDQ draws inspiration from Sutton’s Dyna-Q framework, which accelerates Q-learning with simulated experiences from a learned model. It merges the model-free update mechanism of Delayed Q-learning with the model-based planning of R-max. Two update types govern learning: type-1 (model-free) updates rely on recent experience samples, decreasing Q-values by at least a set threshold; type-2 (model-based) updates invoke value iteration after sufficient state-action visits. Parameters m₁ and m₂ control the balance between these modes, allowing DDQ to mimic either parent algorithm when tuned accordingly.
Theoretical analysis shows DDQ’s worst-case sample complexity matches the minimum bound between R-max and Delayed Q-learning, scaling as O(min(|S|²|A|/ε³(1−γ)⁸, |S||A|/ε⁴(1−γ)⁸)). While its bound is higher than the best-known PAC algorithms in isolation, empirical tests reveal DDQ often outperforms them in practice, particularly on difficult-to-learn MDPs. In such cases, DDQ not only requires fewer samples than Mormax (model-based) and UCB Q-learning (model-free) but also reduces computational load by limiting the number of costly model resolutions.
A grid-world benchmark with nine states demonstrated modest sample complexity gains over its parents when m₁ and m₂ were tuned for parity. In a challenging MDP designed to resemble biased coin detection, DDQ achieved near-optimal policies with 5,662 samples on average, compared to 7,770 for Mormax and 8,097 for UCB Q-learning, while performing fewer model resolutions.
The motivating application for DDQ lies in pediatric motor rehabilitation via human-robot interaction (HRI). Infants with motor delays often have limited opportunities for self-initiated exploration and social engagement. Interactive robots can act as adaptive playmates, encouraging goal-driven movement and reducing caregiver workload. However, each child’s behavior is unique, and data from rehabilitation sessions is sparse, challenging conventional RL methods.
For this domain, MDP modeling is preferred over more complex partially observable models, as it reduces parameter count and data requirements. In a pilot study, DDQ was applied to a simple “chase” game between a mobile robot (Dash) and a 10-month-old infant. States captured the child’s attention and activity levels—such as not looking, looking, touching/expressing excitement, or chasing—while robot actions varied distance and movement patterns. Rewards favored states involving active engagement.
Six one-hour sessions were conducted with the robot teleoperated by a human, generating annotated video data. DDQ was trained on this dataset to produce an optimal policy, then deployed autonomously in two additional sessions. Accumulated rewards, normalized by interaction time, were compared between human-controlled and DDQ-driven play. Statistical analysis yielded a 95% confidence interval for improvement of [0.0289, 3.4197] with a p-value of 0.0477, indicating a significant likelihood that DDQ’s policy outperformed the human operator’s strategy.
By blending model-based precision with model-free scalability, DDQ addresses the data efficiency and adaptability demands of real-world HRI in rehabilitation contexts. Its PAC guarantees provide theoretical assurance, while empirical results suggest practical gains in both learning speed and computational economy. This hybrid approach offers a pathway toward more autonomous, responsive robotic partners capable of enhancing therapeutic outcomes for children with motor impairments.
