Heating, ventilation, and air-conditioning systems are responsible for nearly half of a building’s total energy consumption in many countries, making them a prime target for efficiency improvements. Balancing occupant comfort with reduced energy use is a complex challenge, largely due to the dynamic nature of building thermal properties, fluctuating electricity prices, and unpredictable environmental conditions. Traditional model predictive control (MPC) methods have been widely applied, leveraging mathematical models to determine optimal control signals under given constraints. While MPC offers flexibility, its performance depends heavily on model accuracy, which can degrade under real-world complexity such as variable occupancy schedules and weather.

Reinforcement learning (RL) offers a model-free alternative, enabling an agent to learn optimal control policies through interaction with the environment, without prior system knowledge. Early applications in HVAC control often used Q-learning to adjust heating or cooling setpoints, sometimes incorporating occupancy models. Deep reinforcement learning (DRL) extends RL by using deep neural networks to approximate value and policy functions, allowing control in high-dimensional state-action spaces. DRL methods such as Deep Q-Networks, deep deterministic policy gradient (DDPG), and multi-agent approaches have demonstrated potential in reducing energy costs while maintaining temperature within comfort ranges.
However, DRL’s trial-and-error learning can be slow and costly when applied directly to real systems. To address this, a hybrid-model-based DRL (HMB-DRL) framework has been proposed for HVAC control. This approach combines a knowledge-driven model—built from physical principles and measurements—with a data-driven model learned from operational data. The framework operates in two phases: pre-training in a simulated environment using the knowledge-driven model to establish a baseline policy, followed by online learning in the real environment where both models are used to guide policy refinement.
The HVAC control problem is formulated as a Markov Decision Process (MDP) with states including zone temperatures, ambient temperature, electricity price, and time of day. Actions are continuous, representing the power percentage of variable air volume (VAV) units, constrained between 0 and 1. The reward function penalizes energy cost, temperature violations, and action violations, weighted by tunable coefficients.
A key innovation in the HMB-DRL framework is the protection mechanism. Before executing an RL-generated action in the real environment, the knowledge-driven model predicts its reward. If the predicted reward falls below a threshold, the system blends the RL action with an MPC-derived action to ensure acceptable performance. This limits worst-case outcomes and reduces costly missteps during learning. Another enhancement is the adjusting reward method, which modifies penalty weights between pre-training and online learning. Initially, strong penalties guide the agent toward feasible actions; later, penalties are reduced to encourage exploration near action boundaries.
The proposed hybrid-model-based DDPG (HMB-DDPG) algorithm integrates these mechanisms. During online learning, the agent alternates between interacting with the real environment and the data-driven model, updating its policy and value networks. The knowledge-driven model continues to serve as a protector, not a simulator, in this phase.
Simulation tests used a building model with a VAV system, ambient temperature and electricity price data from New South Wales and the Australian Energy Market Operator. Pre-training ran for 60 simulated days, followed by 10 days of testing in the real-environment model, then 60 days of online learning. Results showed that the final policy, after both phases, reduced average energy cost by 26.99% across all periods and by 32.17% during peak price periods compared to one-period MPC, while keeping temperature violations minimal. Compared to the pre-trained policy alone, the final policy reduced temperature violation impact from environmental randomness by over 61% in all periods.
Further comparisons revealed that HMB-DDPG achieved higher learning efficiency and lower online learning costs than standard DDPG. The protection mechanism reduced energy costs and temperature violations during high-risk periods, while the adjusting reward method accelerated convergence by easing penalties after the agent learned feasible action ranges.
These findings underscore the potential of hybrid model-based reinforcement learning to enhance HVAC efficiency in dynamic environments. By leveraging both physical models and operational data, the approach mitigates the drawbacks of purely model-free learning, delivering faster, safer, and more cost-effective policy development.
