Between 1978 and 2019, China issued a vast array of environmental policies across multiple levels of government and ministries. To capture this breadth, researchers undertook a systematic retrieval of policy documents from official databases and websites, ensuring both comprehensive coverage and integrity in the collected data. Each policy text was read in detail, with only those directly relevant to environmental regulation retained, thereby safeguarding the accuracy of the dataset.

With this curated corpus, the team explored whether learning-based models could automatically estimate the intensity of environmental policy from 1978 to 2016. Several statistical and machine learning algorithms were tested, ranging from linear regression and regularization techniques such as Ridge regression and LASSO, to more complex models like support vector machines and random forests. The dataset was split into training (75%) and testing (25%) subsets, with 10-fold cross validation applied to assess training error. Random forest emerged as the most effective model, achieving the lowest root mean squared error in both training and test sets, and was thus selected for intensity measurement.
Beyond prediction accuracy, the study examined which policy features most strongly influenced intensity scores. Using a model-agnostic approach, the importance of each variable was evaluated by removing it from the model and observing the change in loss function. This yielded a “vip score” ranking the top 20 contributors. Energy and technology objectives dominated the list, particularly those aimed at optimizing the energy consumption structure and promoting technological transformation for energy conservation and emission reduction. Among policy measures, administrative actions, fiscal and tax instruments, and financial tools were most influential.
Policies that combined clear objectives with concrete measures tended to register higher intensity. Detailed provisions related to energy systems, industrial processes, fiscal mechanisms, and technological upgrades correlated strongly with elevated scores. Robustness checks using mean square error (IncMSE) and node purity (IncNodePurity) confirmed that objectives generally exerted a greater impact on intensity than measures, though specific interventions—such as industrial upgrading or targeted fiscal policies—remained critical.
To validate the machine-derived scores, the researchers compared them against manual ratings by human experts for the 1978–2016 period. Two similarity metrics were employed: standardized Euclidean distance, a lockstep measure sensitive to scale, and dynamic time warping (DTW), an elasticity measure capable of aligning sequences of differing lengths and rhythms. The standardized Euclidean distance between the two series was 0.05, while the DTW distance was 5.59, indicating high concordance. Visualizing the DTW alignment path revealed a smooth trajectory closely parallel to the main diagonal, signifying minimal warping cost and strong agreement between manual and automated quantifications.
The work demonstrates the potential of text analysis and machine learning to quantify policy intensity in a rigorous, replicable manner. The environmental policy lexicon developed through this process offers a valuable reference for quantitative policy studies, enabling systematic evaluation of China’s environmental governance. The resulting dataset not only streamlines access to relevant policies but also provides a foundation for future research in environmental policy assessment.
While the current collection focuses on central government issuances, its structure and methodology can be adapted for broader scopes, including regional or sector-specific policies. For engineers, data scientists, and policy analysts, this integration of computational modeling with governance documentation exemplifies how advanced analytical tools can illuminate complex regulatory landscapes. It underscores the interplay between technological capability and policy design, revealing how targeted objectives and measures shape the trajectory of environmental regulation over decades.
