Study area
Shanxi Province (34°34′-40°44′N, 110°14′-114°33′E) is located in northern China, in the eastern part of the Loess Plateau, with a total area of 156,700 square kilometers (Fig. 1). The province encompasses twelve planned mining areas, featuring a wealth of diverse mineral resources and a complex geological environment. The climate is classified as a temperate continental monsoon climate, characterized by significant temperature variations across regions, with temperatures decreasing from south to north and from plains to mountainous areas. The province’s geomorphology is predominantly mountainous and hilly, comprising approximately 80% of the total area. The study area includes critical regions such as the Taihang and Lüliang mountains, which are rich in resources but face numerous challenges, including resource depletion and ecological restoration.
Location map of the study area. (This image was created using ArcGIS 10.8 software, which is an Esri product and is publicly available at the URL (https://www.esri.com/). The base map is based on the standard map (GS2024, No. 0650), and no modifications have been made to the map boundaries.)
Technical route
The detailed workflow of this study is illustrated in Fig. 2: (1) RSEI was constructed based on the GEE platform to invert the ecological quality changes in Shanxi Province and its planned mining areas from 2000 to 2020; (2) Theil-Sen analysis and Mann-Kendall trend tests were employed to analyze the RSEI trend; (3) Moran’s I was utilized to assess the spatial correlation of RSEI in Shanxi Province; (4) The interactions between RSEI and potential driving factors were analyzed using CatBoost and GWR models.

Technology roadmap. (This maps was created using ArcGIS 10.8 software, which is an Esri product, generated using our dataset, and is publicly available at the URL (https://www.esri.com/).)
Datasets and preprocessing
This study employed the MODIS dataset to construct the ecological quality assessment index. The data sources included the MOD09A1 (2000–2023, 500 m resolution, 8-day interval) and MOD11A2 (2000–2023, 500 m resolution, 8-day interval) from the MODIS collection. The JRC annual water classification historical data was utilized for water body masking. Additional datasets were leveraged to extract potential factors that may influence ecological quality in the study area, including Terra Climate (2000–2020, providing climate factor data), NASA’s SRTM Digital Elevation data (90 m resolution), population density data provided by East View Cartographic, and MYD17A3HGF (2000–2020, net primary productivity data). All data were acquired from the GEE cloud platform, with all datasets resampled to a resolution of 500 m. Atmospheric correction and relevant preprocessing techniques were applied to enhance data quality. To account for seasonal variation and meteorological conditions, the data time window was set from June 15 to September 15 each year, ensuring consistency in vegetation status and ecological outcomes.
Methods
RSEI indicator construction
RSEI leverages ground information and ecological data obtained through remote sensing technologies, employing mathematical and statistical methods for processing and analysis, thereby facilitating rapid assessment and monitoring of the health status and environmental quality of specific regions31. Given that greenness, humidity, heat, and dryness are crucial components of the ecological environment, this study selects these four ecological factors to construct the RSEI index32.
$$ :RSEI = f(Greenness,Wetness,Heat,Dryness) $$
(1)
where f is the set of functions used to perform PCA.
kNDVI applies a weighted average of NDVI for each pixel and its surrounding pixels to more accurately capture the details and variations in vegetation coverage. Thus, kNDVI serves as a representation of greenness. The calculation formula is as follows:
$$:NDVI=frac{NIR-Red}{NIR+Red}$$
(2)
$$ :kNDVI = tanhleft( {left( {frac{{NIR – Red}}{{2sigma :}}} right)^{2} } right) = tanhleft( {left( {frac{{NDVI}}{{2sigma :}}} right)^{2} } right) $$
(3)
where NIR and RED represent the reflectances of the near-infrared (NIR1) and red bands of MOD09A1, respectively. The symbol σ denotes a length scale that is proportional to the average reflectance of the near-infrared and red bands, which can be adjusted. When σ is set to 0.5(NIR + RED), the calculation formula is as follows:
$$:kNDVI=tanhleft({NDVI}^{2}right)$$
(4)
Humidity (WET) is calculated using the formula:
$$ begin{gathered} :WET = 0.1147 times :rho :_{1} + 0.2489 times :rho :_{2} + 0.2408 times :rho :_{3} + 0.3132 hfill \ quad quad times :rho :_{4} – 0.3122 times :rho :_{5} – 0.6416 times :rho :_{6} – 0.5087 times :rho :_{7} hfill \ end{gathered} $$
(5)
where (:{rho:}_{i}) (i = 1, 2, …, 7) represents the reflectance of the red, near-infrared 1, blue, green, near-infrared 2, short-wavelength infrared 1, and short-wavelength infrared 2 bands of the MOD09A1 image, respectively.
Dryness consists of the Index of building (IBI) and the index of bare Soil (SI). The formula for calculation is:
$$:NDBSI=left(IBI+SIright)/2$$
(6)
$$:IBI=frac{2{rho:}_{6}/({rho:}_{6}+{rho:}_{2})-{rho:}_{2}/({rho:}_{2}+{rho:}_{1})+{rho:}_{4}/({rho:}_{4}+{rho:}_{6})}{2{rho:}_{6}/({rho:}_{6}+{rho:}_{2})+{rho:}_{2}/({rho:}_{2}+{rho:}_{1})+{rho:}_{4}/({rho:}_{4}+{rho:}_{6})}$$
(7)
$$:SI=frac{left({rho:}_{6}+{rho:}_{1}right)-left({rho:}_{2}+{rho:}_{3}right)}{left({rho:}_{6}+{rho:}_{1}right)+left({rho:}_{2}+{rho:}_{3}right)}$$
(8)
where (:{rho:}_{i}) (i = 1, 2, …, 6) denotes the reflectance of the red, near-infrared 1, blue, green, near-infrared 2, and short-wavelength infrared 1 bands of the MOD09A1 image, respectively.
Heat is expressed in terms of land surface temperature (LST). MOD11A2 LST products providing the necessary data. The raw LST data (LST0) requires unit conversion from Kelvin (K) to degrees Celsius (°C). The formula for this conversion is:
$$:LST=0.02LS{T}_{0}-273.15$$
(9)
In this paper, the RSEI is calculated using the following formula:
(:RSEI=f(kNDVI,Wet,NDBSI,LST))(10).
After obtaining the results of the four ecological factors, it is essential to linearly map each indicator’s values to the range of [0, 1] for normalization. This process eliminates the impacts caused by differing units and value ranges, allowing for PCA to construct the RSEI.
Furthermore, RSEI is subjected to additional normalization to facilitate comparison and measurement, as shown in the following formula:
$$:RSEI=frac{{RSEI}_{0}-{RSEI}_{min}}{{RSEI}_{max}-{RSEI}_{min}}$$
(11)
where (:{RSEI}_{0})is the initial RSEI, and (:{RSEI}_{min}) and (:{RSEI}_{max}) represent the minimum and maximum values of (:{RSEI}_{0}), respectively. A higher RSEI value, approaching 1, indicates better ecological environmental quality, while a lower RSEI value signifies poorer ecological environmental quality.
To enhance the assessment and comparison of ecological conditions, the RSEI values were categorized into five levels from high to low: excellent (0.8–1), good (0.6–0.8), moderate (0.4–0.6), fair (0.2–0.4), and poor (0–0.2). This classification provides a clearer understanding of ecological quality, with values closer to 1 indicating better ecological environment health.
Sen’s slope estimator and Mann–Kendall statistical test
Sen’s Slope Estimator can accurately estimate the trend slope in time series data, encompassing both linear and nonlinear trends. Its robustness allows it to effectively handle outliers and data with significant fluctuations. By employing Sen’s Slope Estimator, the strength of the trend in each time series can be determined, thereby identifying the rate and magnitude of ecosystem changes. The calculation formula is as follows:
$$:beta:=Medianleft(frac{{X}_{j}-{X}_{i}}{j-i}right),forall:j>i$$
(12)
where (:beta:) denotes the slope and (:{X}_{i}:)and (:{X}_{j}:)denote the data values for year i and year j, respectively.
The MK test is employed to detect the presence and significance of trends in time series data33. It does not rely on assumptions about the data distribution, making it applicable to various types of ecological data. While Sen’s Slope Estimator provides a quantitative assessment of the trends, the Mann–Kendall test confirms the significance and statistical stability of these trends34. The calculation formulas for the MK test are as follows:
$$:S={sum:}_{i=1}^{n-1}{sum:}_{j=i+1}^{n}sgnleft({x}_{j}-{x}_{i}right)$$
(13)
$$:sgnleft({x}_{j}-{x}_{i}right)=left{begin{array}{c}1,:{x}_{j}-{x}_{i}>0\:0{,x}_{j}-{x}_{i}=0\:-1{,x}_{j}-{x}_{i}<0end{array}right.$$
(14)
$$:V=nleft(n-1right)left(2n+5right)/18$$
(15)
$$:Z=left{begin{array}{c}frac{S-1}{sqrt{V}},S>0\:0,:S=0\:frac{S+1}{sqrt{V}},S<0end{array}right.$$
(16)
where n denotes the time series length; sgn is the sign function; S is the statistic; and Z is the trend significance test statistic.
In the MK test, the significance of RSEI changes is determined based on a set significance level α. Typically, α is set at 0.05, which corresponds to a Z value of ± 1.96. Therefore, when the absolute value of the calculated Z from the MK test exceeds 1.96, the change passes the significance test at a 95% confidence level, indicating that the trend change is significant; conversely, the opposite holds true. The results are categorized into five classes, as shown in the table below(Table 1).
Hurst exponent
The Hurst exponent is an indicator used to measure the long-term memory in time series or fractals, aiding in understanding the fundamental dynamics of the series and providing a degree of predictability for changes in ecological quality35. The formula for calculating the Hurst exponent is as follows:
$$:H=log(R/S)/logleft(nright)$$
(17)
where H is the Hurst exponent. In the actual computation, it is typically necessary to perform multiple decompositions and calculate the R/S values at different scales, followed by averaging these R/S values to obtain the final estimate of the Hurst exponent.
The Hurst exponent is commonly employed to analyze the persistence or anti-persistence of time series, with values ranging from 0 to 1. Specifically, when 0 < H < 0.5, it indicates anti-persistent change; the closer H is to 0, the stronger the anti-persistence. When 0.5 < H < 1, it suggests persistent change; the closer H is to 1, the stronger the persistence. When H equals 0.5, it denotes that the time series is independent and random, exhibiting uncertainty with no correlation between past and future trends.
Global moran’s
Global Moran’s I is a commonly used method for spatial autocorrelation analysis, employed to measure the spatial correlation among observations within a spatial dataset and to reveal the spatial relationships between reference units and their neighboring units26. The Moran’s I statistic ranges from − 1 to + 1; a value close to + 1 indicates that the observations of neighboring geographic units tend to be similar, while a value close to -1 suggests that the observations are inclined to be opposite. A Moran’s I value near 0 indicates that the data are randomly distributed in space, showing no significant spatial autocorrelation. The formula for calculating Moran’s I is as follows:
$$:Moran{prime:}s:I=frac{m*{sum:}_{i=1}^{m}{sum:}_{j}^{m}{w}_{ij}({x}_{i}-stackrel{-}{x})({x}_{j}-stackrel{-}{x})}{{sum:}_{i=1}^{m}{sum:}_{j}^{m}{w}_{ij}{({x}_{i}-stackrel{-}{x})}^{2}}$$
(18)
where m represents the total number of samples, (:{x}_{i})and (:{x}_{j})denote the RSEI values at positions i and j, (:{w}_{ij})indicates the spatial weight value between i and j, and (:stackrel{-}{x}) is the mean value of the RSEI.
CatBoost
CatBoost is a machine learning algorithm based on gradient-boosted decision trees, demonstrating exceptional performance in handling classification and regression problem36. One notable advantage of CatBoost is its ability to efficiently manage categorical features while achieving high prediction accuracy. Additionally, CatBoost employs regularization techniques to mitigate overfitting, enhancing the robustness of the model when dealing with complex data. The fundamental steps for training a CatBoost model are illustrated in Fig. 3: first, create a CatBoost model and train it using the training dataset; once the model is trained, it can be utilized for predicting new data. Throughout the training and prediction processes, CatBoost simplifies the complexities of data preprocessing and offers an efficient gradient boosting algorithm, assisting users in constructing more accurate models.

CatBoost implementation process. In the figure X_traint ,Y traint is the training set, X_test, Y _test is the test set, pt is the prediction of the t-th sample, a is the weight of the t-th sample, and Y is the final prediction obtained by weighting the prediction of each learner.
Geographically weighted regression (GWR)
In spatial research, particularly at large scales, conventional multivariate linear regression models may fail to accurately capture the variations and heterogeneity in geographic space. The GWR model, on the other hand, is adept at exploring the relationships between variables in spatial data, effectively reflecting the correlations between dependent and independent variables across geographic space37. Moreover, the GWR model estimates parameters through independent linear regressions within each local area, utilizing local parameter estimates instead of global ones. The kernel type is “bisquare”and the number of neighbors is determined by the optimal bandwidth (gwr_bw).The formula for the GWR model is as follows:
$$:{varvec{y}}_{varvec{i}}={varvec{beta:}}_{0}({varvec{u}}_{varvec{i}},{varvec{v}}_{varvec{i}})+{sum:}_{varvec{k}}{varvec{beta:}}_{varvec{k}}({varvec{u}}_{varvec{i}},{varvec{v}}_{varvec{i}}){varvec{x}}_{varvec{k},varvec{i}}+{varvec{epsilon:}}_{varvec{i}}$$
(19)
where (:{y}_{i}) is the dependent variable for sample i. (:({u}_{i},{v}_{i})) are the coordinates of sample i. (:{beta:}_{0}({u}_{i},{v}_{i}))is the intercept term for sample i. (:{beta:}_{k}({u}_{i},{v}_{i})) is the k-th regression parameter for sample i. (:{x}_{k,i}) is the k-th independent variable for sample i. (:{epsilon:}_{i}) represents the random error.
Model evaluation
To evaluate the model’s estimation performance, the dataset is divided into 80% training set and 20% testing set. Two evaluation metrics are used: the coefficient of determination (R2) and the root mean square error (RMSE). Additionally, to assess the model’s generalization ability, we use the 5-fold cross-validation method. Specifically, the data is randomly divided into 5 subsets, and the model is trained and tested on each subset. The final model performance is the average of the results from each fold. The calculation formulas are as follows:
$$:{R}^{2}=1-frac{{sum:}_{i=1}^{n}{left({x}_{i}-{y}_{i}right)}^{2}}{{sum:}_{i=1}^{n}{left({x}_{i}-stackrel{-}{{y}_{i}}right)}^{2}}$$
(20)
$$:RMSE=:sqrt{frac{1}{m}{sum:}_{i=1}^{n}{left({x}_{i}-{y}_{i}right)}^{2}}$$
(21)
Where (:{x}_{i}) is the true value of RSEI, (:{y}_{i}) is the estimated value of RSEI, (:{stackrel{-}{y}}_{i}) is the mean of the true RSEI values, and n is the number of samples in the test set.