Zum Inhalt springen

Estimating Global Instantaneous Near-Surface Air Temperature from Clear-Sky Landsat 8/9 Observations Using Ensemble Machine Learning

Prometheus Redaktion

Open AccessArticle Estimating Global Instantaneous Near-Surface Air Temperature from Clear-Sky Landsat 8/9 Observations Using Ensemble Machine Learning by Zhonghu Jiao Zhonghu Jiao SciProfiles Scilit Preprints.org Google Scholar 1,* and Xihan Mu Xihan Mu SciProfiles Scilit Preprints.org Google Scholar 2 1 State Key Laboratory of Earthquake Dynamics and Forecasting, Institute of Geology, China Earthquake Administration, Beijing 100029, China 2 State Key Laboratory of Remote Sensing and Digital Earth, Faculty of Geographical Science, Beijing Normal University, Beijing 100875, China * Author to whom correspondence should be addressed. Remote Sens. 2026, 18(12), 1885; https://doi.org/10.3390/rs18121885 (registering DOI) Submission received: 2 April 2026 / Revised: 19 May 2026 / Accepted: 3 June 2026 / Published: 8 June 2026 Highlights What are the main findings? A global ensemble machine-learning model based on clear-sky Landsat 8/9 observations estimated instantaneous near-surface air temperature (Ta) at 90 m resolution with an RMSE of about 3 K, near-zero bias, and R 2 of about 0.92. Land surface temperature was the most important predictor, and the Bayesian model averaging of LightGBM, XGBoost, and CatBoost improved performance and robustness relative to the individual models. What are the implications of the main findings? The approach provides a practical way to map fine-scale global air-temperature patterns that are not resolved by sparse station measurements or coarse-resolution products. The resulting high-resolution Ta fields can support applications such as urban heat monitoring, ecosystem and climate studies, and agricultural or other environmental analyses under clear-sky conditions. Abstract High-resolution estimation of global near-surface air temperature (Ta) is essential for investigating microclimates, ecosystem processes, and agricultural suitability. However, sparse in situ observations do not capture local heterogeneity, whereas existing datasets lack fine-scale detail because of their coarse spatial resolution. To address this limitation, we developed an ensemble machine-learning framework using Landsat 8/9 data. Predictions from LightGBM, XGBoost, and CatBoost were combined through Bayesian model averaging (BMA), which assigns probabilistic weights to individual models to improve robustness. The models were trained using a globally distributed spatiotemporal matchup dataset that paired HadISD in situ Ta observations with MODIS/VIIRS products to support subsequent Landsat-based application. Key inputs included land surface temperature (LST), vegetation indices, elevation, solar zenith angle, and spatiotemporal features. The BMA ensemble achieved strong validation performance, with an RMSE of ~3 K, near-zero bias, and an R 2 of 0.92. Feature-importance analysis identified LST as the dominant predictor, underscoring the primary role of surface thermal state in estimating Ta. The proposed method can generate robust global Ta fields at 90 m resolution, revealing fine-scale thermal patterns that have previously been difficult to resolve at the global scale. Unlike many regional models calibrated for single study area or dependent on dynamic external auxiliary fields, our Landsat-predominant application framework supports operational mapping of clear-sky and overpass-time Ta. Such detailed instantaneous data can advance climate research, improve assessments of ecological responses and climate impacts, and support applications such as urban heat island monitoring and precision agriculture. 1. Introduction Despite its importance, obtaining spatially continuous Ta remains challenging. Ta information is primarily obtained from meteorological observations, reanalysis products, and thermal infrared (TIR) remote sensing. In situ measurements from meteorological stations provide accurate point data with high temporal resolution, but they cannot capture spatial heterogeneity over large areas [ 1]. Reanalysis products such as ERA5-Land have a coarse spatial resolution of ~9 km, while AIRS-based products typically have spatial resolutions ranging from 50 km to 1°. Spatial interpolation, which is commonly applied to enhance spatial coverage, can introduce artifacts that depend on station density and the spatial variability of auxiliary parameters [ 4]. To achieve higher spatial resolution at the global scale, moderate-resolution (~1 km) remote sensing data are increasingly used for detailed regional environmental analyses and for examining links between land-surface thermal patterns and climate change. Beyond empirical approaches, surface-energy-balance-based algorithms offer a physically grounded framework by treating Ta as governed by Earth system energy dynamics, including radiation balance, soil, sensible and latent heat fluxes, and horizontal advection [ 17]. Despite their thermodynamic rigor, these methods require extensive parameterization, which is often unavailable at the resolution needed for high-resolution studies. Despite these methodological advances, spatiotemporal Ta patterns remain highly variable because of heterogeneous environmental controls on the land-atmosphere energy balance. Most remote-sensing-based Ta studies have focused on medium to low spatial resolutions. High-resolution (e.g., 90 m) Ta estimation therefore remains relatively underexplored. For example, Landsat-based regression, support vector machine, and random forest (RF) models have been used to estimate urban maximum air temperatures during heatwaves [ 18]. MODIS and Landsat data have supported 200 m Ta models using RF and extreme gradient boosting (XGBoost) to estimate daily minimum, mean, and maximum Ta in France [ 19]. Reference [ 20] applied TVX to Landsat 7 data in northern Germany to estimate instantaneous air temperature above vegetation and soil. Reference [ 9] used Landsat 8 and auxiliary data with multiple linear regression and RF to estimate daily Ta in Northwest China. These studies have largely focused on regional Ta estimation. However, regional models cannot be directly scaled up to the global domain because the LST-Ta relationship varies substantially across biomes, climate zones, terrain conditions, and surface types. Differences in vegetation structure, soil moisture availability, aerodynamic roughness, snow/ice cover, and urban morphology can alter land-atmosphere coupling and introduce region-specific biases. Therefore, a global model must explicitly learn relationships across diverse environmental regimes rather than relying on input parameters calibrated for a single region. Developing a generalized model for global Ta retrieval is therefore essential for producing an operational remote-sensing product for the scientific community. Against this background, this study aims to estimate instantaneous Ta at the standard 2 m screen height using Landsat 8/9 data, which provide operational fine-resolution multispectral and thermal bands. To this end, we developed a globally applicable clear-sky Ta retrieval framework at 90 m spatial resolution. A globally distributed spatiotemporal matchup dataset was first constructed by linking HadISD in situ Ta observations with MODIS/VIIRS-derived thermal, vegetation, topographic, solar-geometry, and geographic predictors. Based on this dataset, three complementary gradient-boosting ML models, namely LightGBM, XGBoost, and CatBoost, were systematically trained and compared to capture the nonlinear relationships between satellite-derived predictors and in situ Ta measurements. Bayesian model averaging (BMA) was further applied to integrate the individual models, thereby enhancing retrieval accuracy, improving robustness, and reducing model-specific uncertainty. The resulting model was rigorously evaluated using independent ground measurements. Compared with previous regional approaches, the proposed framework enables global monitoring of instantaneous 2 m Ta from Landsat 8/9 at 90 m resolution without relying on external dynamic auxiliary data, thereby helping to fill the current gap in fine-resolution global Ta products. 2. Data 2.1. Landsat Products Data were acquired from Landsat 8 (launched 11 February 2013) and Landsat 9 (launched 27 September 2021) satellites [ 22, 23]. Both satellites operate in a 705 km sun-synchronous orbit with a 16-day revisit cycle. Key instruments include the Operational Land Imager (OLI on Landsat 8; OLI-2 on Landsat 9), which captures nine spectral bands from the visible to the shortwave infrared, and the Thermal Infrared Sensor (TIRS on Landsat 8; TIRS-2 on Landsat 9), which measures thermal infrared radiance in two bands. The OLI and OLI-2 were manufactured by Ball Aerospace & Technologies Corporation in Boulder, Colorado, United States, and the TIRS and TIRS-2 were built by NASA Goddard Space Flight Center in Greenbelt, Maryland, United States. Landsat Collection 2 Level-2 LST products, generated from Level-1 inputs using a radiative-transfer-model-based approach that incorporates emissivity from ASTER GED, NDVI products, and NCEP atmospheric profiles derived from multiple sources [ 24], were used. The resulting LST product was resampled to a nominal 30 m resolution and is available for Landsat 4–9. However, the 30 m LST grid does not imply independent 30 m thermal information, because the Landsat 8/9 TIRS thermal bands have a coarser 100-m native thermal footprint and are resampled to the 30 m OLI grid in the delivered product. To avoid overinterpreting resampled thermal details, all Landsat predictors used for final mapping were aggregated within a 3 × 3 window of nominal 30 m pixels; therefore, the final Ta value was reported at a spatial resolution of 90 m. Landsat 8/9 NDVI was calculated using atmospherically corrected surface reflectance ( R 5 and R 4 ) from Band 5 (near-infrared, NIR) and Band 4 (red, R): N D V I = N I R − R N I R + R = R 5 − R 4 R 5 + R 4 (1) 2.2. LST Products To construct the model training dataset, LST products from MODIS aboard NASA’s Terra (MOD21A1D, MOD21A1N) and Aqua (MYD21A1D, MYD21A1N) satellites (Collection 6.1), and the Visible Infrared Imaging Radiometer Suite (VIIRS) aboard the NASA/NOAA Suomi National Polar-orbiting Partnership (Suomi-NPP) satellite (VNP21A1D, VNP21A1N, Version 1), were used. The MODIS and VIIRS instruments were built by Santa Barbara Remote Sensing in Goleta, CA, United States, and Raytheon Intelligence & Space in El Segundo, CA, United States, respectively. These products employ a physics-based Temperature/Emissivity Separation (TES) algorithm coupled with enhanced Water Vapor Scaling (WVS) atmospheric correction, which improves retrieval stability under warm and humid conditions [ 25]. Daily LST values were derived by averaging high-quality, cloud-free retrievals within each grid. These data were used for Ta model training through spatiotemporal matchups with ground observations. 2.3. Vegetation Index Products NDVI is a long-established remote-sensing index calculated from the red and NIR bands and is highly sensitive to vegetation conditions. In this study, NDVI data from MOD13A1 and MYD13A1 (Collection 6.1, Terra and Aqua) and VNP13A1 (Version 1, Suomi NPP) were used to provide global NDVI at 500 m spatial resolution every 16 days, selecting the best available pixel values within each 16-day period based on low cloud cover, low view angle, and maximum NDVI [ 26, 27]. NDVI data corresponding to ground observation locations were extracted for model training. 2.4. Digital Elevation Model (DEM) Data Surface elevation data were obtained from the Advanced Land Observing Satellite (ALOS) Global Digital Surface Model (AW3D30), produced by the Japan Aerospace Exploration Agency (JAXA) using stereo imagery from the Panchromatic Remote-sensing Instrument for Stereo Mapping (PRISM) aboard ALOS that was built by NEC TOSHIBA Space Systems, Ltd. in Yokohama, Japan. AW3D30 provides global coverage at 30 m spatial resolution and represents both natural and anthropogenic surface features. The reported vertical and horizontal accuracies are approximately 5 m and 3 m, respectively, although they vary with terrain and land-cover characteristics [ 28]. 2.5. In Situ Air Temperature Measurements Ta measurements were obtained from the Met Office Hadley Centre Integrated Surface Database (HadISD), version v3.4.0.2023f. HadISD provides globally distributed, quality-controlled, sub-daily (typically hourly) station observations derived from NOAA’s NCEI ISD dataset [ 29]. Data from global sites spanning 2000–2012 were used for model training, whereas data from 2013–2022 were reserved for independent validation. Air temperature measurements recorded at 2 m height, together with record time and station elevation, were extracted from this dataset. 3. Methods 3.1. Construction of the Spatiotemporal Matchup Training Dataset The selection of input predictor variables for estimating Ta prioritized parameters derivable from Landsat sensors whenever feasible. This approach ensured spatiotemporal consistency and minimized errors arising from spatial downscaling or temporal interpolation of auxiliary datasets from disparate sources (e.g., MODIS and ERA5), consistent with the operational objective of the study. Based on established physical relationships and previous studies summarized Table 1, we selected the following input features: LST, NDVI, surface elevation (Elev), solar zenith angle (SZA), day of year (DOY), latitude (Lat), and longitude (Lon). Together, these variables represent the primary drivers influencing Ta, including the surface energy budget (LST), vegetation status (NDVI), topographic effects (Elev), diurnal and seasonal solar cycles (SZA and DOY), and large-scale climate gradients not resolved by LST or NDVI (Lat and Lon) [ 30]. Compared with previous studies in Table 1, the present work focuses on global clear-sky Landsat 8/9 instantaneous Ta at 90 m resolution using a Landsat-predominant operational predictor set. To construct a training dataset with sufficient sample size and global coverage, we integrated HadISD Ta measurements (2000–2012) with concurrent LST and NDVI products from MODIS and VIIRS using a rigorous spatiotemporal matching procedure. MODIS/VIIRS products were used for training because they provide a long-term, globally consistent LST record with sufficient temporal sampling and overlap with HadISD observations. The subsequent application to Landsat data assumes that the first-order physical dependence of Ta on surface thermal state is transferable across sensors, while recognizing that the spatial support differs between the moderate-resolution training data and the fine-resolution Landsat application. Spatially, satellite-derived LST and NDVI values corresponding to the pixel containing each HadISD station’s geographic coordinates were extracted. For raster-to-raster harmonization during Landsat mapping, all continuous predictor layers were co-registered to the Landsat analysis grid. Elevation was resampled using bilinear interpolation, whereas the other predictors were calculated or derived directly from Landsat products. Station-based Ta observations were treated as point measurements and were not spatially interpolated among stations. To avoid spatial leakage and to clearly define the data partitioning strategy, the matched samples were divided by geographic clusters rather than by random samples. Specifically, HadISD sites were grouped into 100 spatial K-means clusters, and 5-fold spatial CV was performed at the cluster level. In each fold, approximately 80% of the spatial clusters were used for model training, and the remaining 20% were used for validation. The independent Landsat 8/9 evaluation data from 2013–2022 were not used for model fitting or hyperparameter tuning and served as the final test set. The data partition is summarized in Table 2. 3.2. Tree-Based ML Models In this study, model input is a structured station-pixel matchup vector rather than a raw image patch. Each sample consisted of scalar numerical predictors extracted or calculated at the satellite-station matchup location, including LST, NDVI, surface elevation, SZA, DOY, latitude, and longitude, with the corresponding 2 m in situ Ta as the target variable. Although these predictors were derived from satellite images, DEM data, and station records, the final learning problem was a heterogeneous tabular regression task. Tree-based ML methods based on gradient boosting are widely recognized for their state-of-the-art predictive performance on tabular datasets with heterogeneous features. They are particularly suitable for this study because the LST-Ta relationship is nonlinear and is modulated by terrain, solar illumination, seasonal cycle, vegetation condition, and geographic background. Tree-based ensembles can capture these nonlinear interactions without imposing a predefined functional form, are relatively robust to feature scaling and multicollinearity, and provide feature-importance diagnostics for physical interpretation. We used three well-established techniques—XGBoost, LightGBM, and CatBoost—which, although more complex than RF, provide superior handling of tabular data, enable interpretation through feature importance, and generally achieve higher predictive performance when properly tuned. XGBoost is an optimized, distributed gradient-boosting framework that is highly efficient, flexible, and scalable [ 33]. It enhances traditional gradient boosting by incorporating L1 and L2 regularization to mitigate overfitting and improve generalization, applying tree pruning to eliminate splits with minimal performance gains, and leveraging second-order gradient information to accelerate convergence. These innovations have established XGBoost as a benchmark for regression and classification tasks, and it often achieves state-of-the-art results in competitive settings. LightGBM, developed as an advancement over XGBoost, uses gradient-boosting decision trees (GBDT) with techniques such as Gradient-based One-Side Sampling (GOSS) and a leaf-wise tree-growth strategy [ 34]. GOSS prioritizes instances with larger gradients to enhance learning effectiveness, while the leaf-wise approach grows the leaves that contribute most to loss reduction. This strategy can produce deeper trees and higher accuracy, but it can also increase the risk of overfitting. In addition, histogram-based splitting accelerates the identification of optimal splits, making LightGBM highly efficient for large-scale datasets in parallel and distributed computing environments. CatBoost excels at handling categorical features without requiring extensive preprocessing such as one-hot encoding [ 35]. It converts categorical variables into numerical representations during training and uses ordered boosting and symmetric decision trees to reduce prediction shift and improve generalization, thereby accelerating training. CatBoost also introduces ordered boosting, which optimizes the sequence of tree additions and enhances both accuracy and efficiency. Built-in mechanisms such as regularization and early stopping further reduce overfitting and improve model generalization. Although the predictors used in this study are primarily numerical, CatBoost remains useful because its ordered boosting and regularization mechanisms can improve generalization in tabular regression. Overall, these gradient-boosting frameworks provide powerful and versatile tools for complex prediction tasks, and each contributes distinct innovations that make it well suited to the present Ta retrieval application. 3.3. Hyperparameter Tuning Unlike trainable model parameters, hyperparameters had to be defined before training. Effective hyperparameter tuning was critical for maximizing predictive performance and mitigating overfitting, as evaluated on validation datasets. Optimizing hyperparameters reduced manual intervention and enhanced model precision. In this study, we employed Bayesian optimization to address the challenge of high-dimensional hyperparameter tuning for gradient-boosting models. Bayesian optimization provides an efficient framework for optimizing complex black-box functions, particularly for ML hyperparameter calibration [ 36]. It constructs a probabilistic surrogate model, typically a Gaussian process, to approximate the objective function and quantify uncertainty, thereby guiding subsequent sampling points strategically. The optimization uses acquisition functions such as Expected Improvement to balance exploration and exploitation. Monte Carlo acquisition-function optimization and parallel candidate evaluation can accelerate this process relative to grid search, random search, or expert-guided tuning. For this optimization protocol, we initialized the search with 20 random points per parameter and specified a maximum of 200 iterations to identify optimal configurations. This iterative refinement, starting from predefined parameter ranges, produced a precise approximation of the objective function and enabled reliable identification of the global minimum within the evaluated sample space. We applied early stopping to prevent overfitting, with a patience parameter of 50. To further reduce overfitting and obtain a reliable, unbiased estimate of model performance in unseen geographic regions, we used a 5-fold spatial cross-validation (CV) strategy to tune hyperparameters and assess performance during training. Spatial CV is important because ground measurements exhibit spatial autocorrelation, which can artificially inflate performance estimates. Unlike random K-fold CV, spatial CV ensures that test folds are spatially separated from training folds, thereby reducing data leakage from nearby points. K-means clustering was used to partition HadISD sites into 100 spatially distinct clusters ( Figure 1). 3.4. Sensitivity Analysis of Model Inputs SHAP (SHapley Additive exPlanations) is a versatile, model-agnostic tool for interpreting ML predictions [ 37]. In this study, we used SHAP package version 0.48.0. Rooted in cooperative game theory and Shapley values, SHAP quantifies the contribution of each feature to model output. By revealing the impact of individual features on predictions, SHAP enhances transparency and interpretability, enabling insight into complex algorithms across domains such as healthcare, finance, and geoscience. This method supports model debugging and builds trust by highlighting feature importance and interaction effects within the decision-making process. The SHAP interaction index enables separate analysis of pairwise feature interactions for individual predictions, thereby uncovering relationships captured by tree-ensemble models. This capability is particularly valuable for nonlinear models such as the gradient-boosting frameworks used here, where the complexity of high-performing algorithms often obscures the rationale behind their outputs. 3.5. Bayesian Model Averaging (BMA) Model averaging generates a meta-model by computing a weighted average of individual models and thereby improves prediction accuracy. In this study, BMA was used to construct a meta-model for improved Ta estimation. BMA is a statistical framework that addresses model uncertainty in prediction and inference [ 38]. Rather than relying on a single “best” model, BMA combines predictions from multiple models weighted by their posterior probabilities, acknowledging that different models may provide similarly good fits to the data. By averaging over a set of plausible models, BMA produces robust predictions and parameter estimates and captures both within-model and between-model uncertainties, which is particularly valuable when model selection is ambiguous or when multiple models provide similar evidence. The fundamental principle of BMA is to assign each model a weight proportional to its posterior probability, reflecting its consistency with the observed data. The posterior probability integrates the model’s prior probability and its likelihood given the data. For a target variable (y), with observational data y t and models ( M 1 , M 2 , ⋯ M K ), the predictive probability density function under BMA is expressed as p y M 1 , M 2 , ⋯ M K = ∑ k = 1 K p k ( M k | y t ) · p k ( y | M k ) (2) where p k ( y | M k ) represents the predictive distribution of y from each individual model, and p k ( M k | y t ) is the posterior probability that the k-th model is optimal: p k M k y t = P ( y t | M k ) P ( M k ) ∑ l = 1 K P ( y t | M l ) P ( M l ) (3) For each model component, its probability density function is assumed to follow a Gaussian distribution: p k y M k ~ N ( μ k , σ k 2 ) (4) where μ k denotes the mean and σ k is the standard deviation. μ k is obtained via linear regression, and σ k is determined using the maximum likelihood method. Assuming forecast error independence across spatial (s) and temporal (t) coordinates, the log-likelihood function of the BMA model under the given observational constraints is L ( μ k , { σ k 2 } | y t ) = ∑ s , t l o g ∑ k = 1 K p k M k y t N ( y t ; μ k s , t , σ k 2 s , t ) (5) Equation (5) lacks an analytical solution; therefore, the Expectation–Maximization algorithm was used to estimate the posterior probabilities [ 38]. 3.6. Model Evaluation The retrieval accuracy of the proposed Landsat Ta models was evaluated using three statistical metrics: mean bias error (bias), root mean square error (RMSE), and the coefficient of determination (R 2), defined as follows: B i a s = 1 N ∑ i = 1 N X i , e s t − X i , t r u e (6) R M S E = 1 N ∑ i = 1 N X i , e s t − X i , t r u e 2 (7) R 2 = ∑ i = 1 N X i , t r u e − X t r u e ପ୍ତ X i , e s t − X e s t ପ୍ତ 2 ∑ i = 1 N X i , t r u e − X t r u e ପ୍ତ 2 ⋅ ∑ i = 1 N X i , e s t − X e s t ପ୍ତ 2 (8) where N is the total number of validation samples; X i , e s t is the retrieved Ta for the i-th sample; X i , t r u e is the corresponding ground measurement; and X t r u e ପ୍ତ and X e s t ପ୍ତ are the means of the measured and retrieved values, respectively. Spatial and temporal consistency between retrieved and ground data was essential for valid assessment. The satellite estimate assigned to each HadISD station was calculated from a 3 × 3 window of nominal 30 m Landsat pixels (90 m × 90 m) centered on the station coordinate. Only clear-sky pixels passing the Landsat quality mask were included, and the window mean was used as the station-scale satellite estimate. This averaging was adopted to reduce spatial representativeness error between point-based station observations and pixel-based retrievals, especially over complex terrain and heterogeneous land-cover mosaics. Landsat overpass time was used to align satellite retrievals with ground-based Ta measurements. Considering the typically hourly sampling interval of HadISD data, the ground Ta corresponding to the Landsat overpass time was obtained through temporal linear interpolation between the two nearest measurements before and after overpass: T ( t s ) = T ( t 1 ) + T ( t 2 ) − T ( t 1 ) t 2 − t 1 t s − t 1 (9) where t s is the satellite overpass time, and t 1 and t 2 are the nearest HadISD observation times before and after the overpass, respectively. This temporal interpolation reduces the mismatch between hourly station observations and instantaneous satellite retrievals. We acknowledge that this procedure assumes approximately linear temperature change within the bracketing interval. Although this assumption is reasonable for short temporal windows and has been widely used for temporal collocation in satellite LST/Ta validation, it may introduce uncertainty during periods of rapid mid-morning warming, frontal passages, or abrupt changes in cloud, radiation, and wind conditions. Such interpolation uncertainty was therefore considered part of the temporal representativeness error in the validation. 4. Results 4.1. Relative Importance of Input Features To interpret the ML models and assess the physical plausibility of the predictions, we applied SHAP to quantify the relative contribution of each input feature to the estimated Ta ( Figure 2). Across the three models, LST consistently emerged as the primary predictor and exhibited the highest mean absolute SHAP values ( Figure 2a–c). This predominance aligns with the well-established thermodynamic relationship between LST and Ta and is supported by previous remote sensing studies that identify LST as a key driver in Ta estimation, e.g., [ 6, 10]. Geographic coordinates (latitude and longitude) and temporal/observational variables (DOY and SZA) showed moderate importance, with minor ranking variations across models, likely reflecting algorithm-specific handling of feature interactions while maintaining comparable overall performance ( Table 3). Bee swarm plots of SHAP value distributions ( Figure 2b,d,f) further illustrate the directional effects of each feature. Higher LST values (red points) corresponded to positive SHAP values, indicating increased predicted Ta, which is consistent with the expected positive relationship. Latitude and SZA showed negative directional effects, consistent with latitudinal temperature gradients and reduced solar insolation at larger zenith angles. The effect of DOY was weaker and more variable, likely because hemispheric differences in seasonal cycles partially offset each other in this global dataset. Collectively, the SHAP analysis provides transparent insight into model behavior, confirming the physically plausible dominance of LST while highlighting the secondary and more nuanced roles of geospatial, temporal, and surface properties in estimating Ta from these inputs. 4.2. Performance During Model Training Optimized LightGBM, XGBoost, and CatBoost models were developed for Ta estimation following rigorous sequential hyperparameter tuning. The performance of each model in 5-fold spatial CV is summarized in Table 3. All models demonstrated strong generalization to the held-out spatial cross-validation folds. The models achieved an RMSE of ~3 K, near-zero bias, and R 2 > 0.9 when predicting entirely new geographic regions, with low variability across folds. This performance represents a high level of accuracy for Ta estimation and compares favorably with recent ML applications in this field [ 40, 41]. Furthermore, these estimation errors approach the intrinsic uncertainty limits (at least 1–2 K) associated with the primary LST input variable [ 31, 42]. These results suggest that the developed models effectively leveraged the information content of the remote sensing and ancillary data and therefore provide robust capability for high-resolution Ta mapping. 4.3. Landsat Ta Maps Building on the training results, we next examined whether the models produced physically coherent spatial patterns at the Landsat scale. Figure 3 presents spatial maps of Ta estimated over a mountainous region near Chengde, China, characterized by strong elevation gradients, forested slopes and ridges, valley bottoms, and mixed agricultural or built-up lowland areas ( Figure 3e,f), derived from Landsat 8 data, together with the native Landsat LST product for reference. This clear-sky scene was selected because the combination of complex terrain and heterogeneous land cover provides a useful test of whether the retrieval models preserve physically plausible fine-scale thermal gradients. All maps showed a general gradient from cooler elevated regions to warmer lowlands and urban zones. The three ML-based estimates ( Figure 3a–c) exhibited strong overall similarity in their spatial patterns, capturing fine-scale variations such as cooler high-elevation ridges and warmer valley lowlands with high fidelity. The direct Landsat LST product ( Figure 3d) showed a broader dynamic range, with more intense purple hues in high-elevation areas covered by snow ( Figure 3e), consistent with LST differing more strongly from Ta under clear-sky conditions because of surface radiative and land-cover effects. Together, these results indicate that the ML models effectively bridged the LST-Ta gap and produced more physically plausible Ta distributions. 4.4. Model Validation Against In Situ Observations To test whether this performance generalized beyond the development dataset, we conducted an independent validation using ground-based Ta measurements from the HadISD network that were temporally and spatially coincident with available Landsat 8 (March 2013–December 2022) and Landsat 9 (October 2021–December 2022) overpasses on a global scale. The validation results for the three models are shown in Figure 4. All models demonstrated strong performance in estimating Ta from both Landsat missions, with predicted values closely aligned with the 1:1 line across the full observed temperature range, especially within the densely sampled range of 240–320 K. Quantitative metrics indicated consistent performance across models and sensors, with RMSE values ranging from 3.52 to 3.64 K, R 2 values from 0.88 to 0.89, and low bias (absolute values ≤ 0.22 K), indicating minimal systematic error. For example, the CatBoost model exhibited a bias of 0.04 K for Landsat 9 data. Among the models, LightGBM showed a slight RMSE advantage (3.54 K for Landsat 8 and 3.52 K for Landsat 9). This global validation accuracy (RMSE ≈ 3.6 K) compares favorably with, or improves upon, previous large-scale air-temperature estimation studies [ 40], particularly given their global coverage and cross-platform transferability between Landsat 8 and Landsat 9 of the proposed models. The consistent performance across two platforms further demonstrates the robustness and transferability of these models across comparable satellite platforms. 4.5. Model Sensitivity to LST Errors Because LST emerged as the dominant predictor, we next quantified how uncertainty in LST propagated into Ta retrievals. To evaluate this sensitivity, Gaussian white noise with zero mean and prescribed standard deviations from 0.5 to 3.0 K was systematically added to the LST values to simulate realistic uncertainties in LST products. As shown in Figure 5, all models exhibited a pronounced linear relationship between LST uncertainty and the RMSE of retrieved Ta. LightGBM showed the greatest robustness to LST noise, followed closely by XGBoost and CatBoost. This robustness was quantified by the slopes of the fitted linear relationships, approximately 0.42 K/K for LightGBM, 0.44 K/K for XGBoost, and 0.49 K/K for CatBoost, indicating the rate at which Ta RMSE increased with LST uncertainty. Quantitatively, an LST RMSE of approximately 2.1 K [ 31] corresponded to a Ta RMSE of about 1 K across the three models, with LightGBM yielding the lowest value (~0.95 K) and CatBoost the highest (~1.1 K). In the context of TIR remote sensing, where LST serves as a proxy for estimating Ta through empirical or physically based relationships, these findings underscore the importance of improving LST accuracy. By comparison, Figure 4 shows that the overall Ta RMSE values against ground measurements ranged from 3.52 K to 3.64 K under nominal conditions. This gap between total error and the LST-induced contribution suggests that the remaining uncertainty (~2.5 K) arose from multiple sources inherent to TIR remote sensing and station-pixel validation. These sources include atmospheric water vapor and residual errors in atmospheric correction, residual cloud contamination, surface emissivity uncertainty, terrain-induced illumination and view-geometry effects, wind-speed-driven variation in surface-air coupling, soil moisture variability, and vegetation phenology. Additional uncertainties may arise from instrumental errors in ground-based observations, spatial representativeness errors, temporal mismatches between satellite overpass times and ground measurement periods, and model-related limitations. Consequently, the reported RMSE should be interpreted as a total validation error that integrates LST uncertainty with atmospheric, surface, terrain, meteorological, phenological, sampling, and model components, rather than as the effect of LST uncertainty alone. 4.6. Model Ensemble by BMA Given the complementary behavior of the three base models, we finally evaluated whether their combination could further improve retrieval performance. Table 4 summarizes the weights, biases, and standard deviations of the BMA ensemble, which combines predictions from LightGBM, XGBoost, and CatBoost. BMA assigns posterior weights in proportion to each model’s predictive performance on the validation data, while the biases and standard deviations quantify systematic offsets and uncertainty, respectively. Higher weights were typically assigned to the better-performing models. In this study, LightGBM received the highest weight for both Landsat 8 and Landsat 9, thereby enhancing robustness and reducing overfitting relative to any individual model. The BMA ensemble achieved a global validation RMSE of 3.03 K for Landsat 8 and 3.02 K for Landsat 9, with R 2 ≈ 0.92 and near-zero bias ( Figure 6). This result represents a modest improvement in consistency and uncertainty quantification relative to the individual models (RMSE ~3.5–3.6 K; R 2 ~0.89 in some subsets), particularly in data-sparse regions or under variable atmospheric conditions. The ensemble also mitigated model-specific sensitivities, including sensitivity to LST uncertainty, for which LightGBM showed the greatest robustness, with Ta RMSE increasing by ~0.42 K per 1 K increase in LST error. This level of accuracy is competitive for global, high-resolution, instantaneous Ta estimation, where heterogeneity in land cover, topography, and diurnal cycles poses substantial challenges. Regional studies using Landsat and ML often reported lower RMSE values (2–3 K) because they relied on localized calibration and denser station networks (e.g., [ 9] in Northwest China using RF; [ 18] for urban heatwaves). By contrast, global studies remain scarce and typically operate at coarser spatial resolution (e.g., MODIS-based regional daily mean Ta with RMSE ~3 K). For reference, recent global LST products have reported clear-sky instantaneous RMSE of ~2.6 K [ 43], whereas Ta estimation is inherently more variable because of air-surface decoupling. The proposed BMA approach therefore extends beyond TVX-based methods (regional RMSE ~3 K; [ 20]) and is consistent with the benefits of ensemble methods reported in other remote sensing applications, such as BMA for latent heat flux [ 44]. Although the approach remains limited by clear-sky conditions and mid-morning overpass bias, the ~3 K RMSE supports its potential operational utility for radiation balance modeling and heat-risk assessment. 4.7. Spatial Heterogeneity of Model Performance Building on this global validation, we further examined how model performance varied geographically. Evaluating the spatial distribution of model accuracy was important for understanding regional variations and limitations. Analysis of prediction bias relative to ground measurements revealed distinct large-scale regional patterns that were consistent across the three models ( Figure 7a–c). Positive biases (overestimation) predominated in North America, Asia, and Australia, whereas negative biases were dominant in Europe, South America, and Africa. These systematic deviations likely reflect unmodeled regional factors, such as persistent differences in atmospheric conditions and surface characteristics [ 6]. Spatial patterns in RMSE ( Figure 7d–f) showed lower errors concentrated in North America and Europe, regions that were heavily represented in the training dataset. Conversely, higher RMSE values were frequent in sparsely sampled regions, including South America, Africa, and parts of Asia. Notably, the CatBoost model exhibited comparatively lower RMSE values in some of these typically higher-error regions, potentially reflecting algorithmic advantages related to its architecture under specific environmental conditions [ 35]. The spatial distribution of R 2 ( Figure 7g–i) was remarkably consistent among the models, with higher values in the mid-latitudes but markedly lower values in tropical regions. This finding highlights the challenge of reliable satellite-based Ta estimation in the tropics, often because of confounding factors such as persistent cloud cover, high atmospheric water vapor, and lower relative temperature variability compared with instrument noise and algorithm uncertainty [ 30]. 4.8. Temporal Variations in Model Performance for Ta Estimation Beyond spatial heterogeneity, model performance also varied systematically over time. The evaluation revealed pronounced seasonal variability and hemispheric asymmetry ( Figure 8). In terms of bias ( Figure 8a), the monthly patterns showed a clear hemispheric contrast. Northern Hemisphere (NH) sites predominantly exhibited negative biases (underestimation of Ta), generally ranging from −0.5 K to near zero, with minor fluctuations that peaked during the transitional seasons (e.g., spring and autumn). In contrast, Southern Hemisphere (SH) sites showed persistent positive biases (overestimation of Ta), occasionally exceeding 1 K, particularly from February to June. This asymmetry may stem from hemispheric differences in climate regime, land-cover distribution (e.g., the greater continental extent of the NH), atmospheric moisture content, or imbalanced representation in the training data, because NH sites typically benefited from denser in situ observations. These biases highlight the limitations of global models when extrapolating across hemispheres and may be further amplified by differences in aerosol loading or vegetation phenology that influence TIR signal attenuation. The models’ explanatory power, evaluated by R 2 ( Figure 8c), remained strong in the NH (>0.75 throughout the year, peaking at ~0.9 in midsummer), indicating stable regression performance and reliable variance capture. In the SH, however, R 2 was lower and more variable (0.5–0.85), with notable declines during austral summer (December–February) and recovery during winter (June–August). This weaker performance is consistent with the documented challenges of satellite-based retrieval in tropical and subtropical regions, which occupy a substantial fraction of SH land areas, where persistent clouds limit TIR retrievals, reduce LST data availability, and weaken model generalization. Recent studies have shown that hot and humid conditions can further amplify these effects by altering atmospheric transmittance and increasing retrieval uncertainties, e.g., [ 39]. In addition, seasonal fluctuations in LST-Ta coupling may modulate model performance throughout the year: coupling is often stronger under cooler, drier, and more stable conditions, whereas under high insolation, LST may overestimate Ta. 4.9. Computational Cost and Inference Complexity The computational cost of the proposed framework is dominated by training the three gradient-boosting base models rather than by the BMA step. For hyperparameter optimization and model training, each base model was evaluated using 5-fold spatial CV. Experiments were conducted on Intel Core i7-7820HQ and 32 GB RAM, and the total wall-clock training times for LightGBM, XGBoost, and CatBoost were approximately 70 h, 46 h, and 32 h, respectively. For inference, the complexity of a tree-ensemble model is approximately O T t r e e · D per sample, where T t r e e is the number of trees and D is the average tree depth. Therefore, evaluating the three base models sequentially has complexity O ∑ k T t r e e , k D k , where k denotes LightGBM, XGBoost, and CatBoost. The BMA ensemble does not train additional decision trees and does not introduce new feature transformations. After the three base-model predictions are obtained, BMA only computes a weighted average, T ^ a , B M A = ∑ k w k T a , k + b k (10) where w k and b k are the posterior weight and bias terms of the k-th model, respectively. Consequently, the additional computational complexity of BMA is O(K) per sample, with K = 3 in this study, and its memory overhead is also O(K). In practice, this cost is negligible compared with traversing the three tree ensembles. If the three base models are evaluated in parallel, the wall-clock inference time is close to that of the slowest individual model, plus a small overhead for BMA. 5. Discussion Taken together, the results show that the ensemble ML framework, which combines gradient-boosting models (LightGBM, XGBoost, and CatBoost) through BMA, provides robust performance for estimating instantaneous Ta at 90 m spatial resolution from clear-sky Landsat data. Validation against independent HadISD observations yields an RMSE of ~3 K, an R 2 of 0.92, and near-zero bias, with consistent results across Landsat 8 and 9 ( Figure 6). This level of accuracy is competitive for global high-resolution instantaneous Ta products and represents a substantial advance over regionally constrained ML studies (e.g., [ 5, 11]), owing to the global training scope and ensemble strategy, which effectively capture complex nonlinear relationships between Ta and the predictors [ 41]. The BMA framework leverages the complementary strengths of the individual models and thereby improves generalization and reduces overfitting relative to any single model [ 47]. This improvement, however, should be interpreted in light of several limitations. Although an RMSE of ~3 K is competitive at the global scale, it remains higher than that reported in some highly optimized regional studies or in studies using sensors with higher revisit frequency, such as MODIS-based approaches [ 39]. This discrepancy partly arises from scale mismatches and inconsistencies in the LST products used for training and application. The scale mismatch is not simply a resampling issue, because MODIS/VIIRS pixels represent spatially mixed thermal states over approximately kilometer-scale areas, whereas Landsat resolves much finer surface contrasts. Consequently, the model may transfer the learned LST-Ta relationship reliably over homogeneous surfaces but may exhibit additional uncertainty in urban mosaics, irrigated croplands, mountains, coastlines, or fragmented vegetation, where the sub-pixel thermal distribution differs strongly between sensor measurements. Moreover, the nominally 30 m Landsat LST grid should be interpreted as a resampled product derived from coarser thermal observations, rather than as fully independent 30 m thermal measurements. We therefore regard scale-transfer uncertainty as a key limitation of the present framework and have highlighted it as a priority for future stratified validation and scale-aware modeling. The models were trained on MODIS and VIIRS LST products retrieved using the physics-based TES algorithm [ 25], but they were applied to Landsat-derived LST data generated using a radiative transfer model that incorporates ASTER GED emissivity and NCEP atmospheric profiles [ 24]. The two LST retrieval systems have different error structures. MODIS/VIIRS TES estimates LST and emissivity simultaneously from multispectral thermal information and applies water-vapor scaling to improve stability under warm and humid conditions, whereas the Landsat radiative-transfer-based product relies on external emissivity information, NDVI adjustment, and atmospheric profiles. As a result, the train-apply inconsistency may produce systematic offsets when atmospheric water vapor is high, when emissivity is poorly constrained over arid or sparsely vegetated surfaces, or when dense vegetation and humid boundary layers weaken the coupling between LST and screen-height Ta. Urban and topographically complex pixels may also be affected because sub-pixel thermal mixtures differ between MODIS/VIIRS and Landsat. These mechanisms are consistent with the larger errors observed in humid tropical/subtropical regions and in spatially heterogeneous areas, although a fully stratified attribution by humidity and vegetation regime remains a priority for future work. Another related source of uncertainty is spatial representativeness error between point-based HadISD station measurements and satellite pixel averages. This mismatch can inflate RMSE, especially in heterogeneous environments such as mixed land-cover mosaics or urban fringes, where a 90 m pixel contains sub-pixel variability that is not captured by a single ground observation [ 4, 31]. This mismatch is inherent to remote sensing validation and likely contributes to the larger global errors than those observed over homogeneous sites. Model generalization is further constrained by the uneven spatial distribution of training data ( Figure 7), with sparse coverage in polar regions, high-elevation terrain, and much of Africa because of limited HadISD station availability. Performance is therefore likely to degrade in these under-sampled regions, where environmental extremes, such as snow/ice effects on LST-Ta relationships or complex topography, combine with sparse observations to reduce model robustness [ 40]. Building on these observations, future work should prioritize more geographically balanced training datasets, for example by incorporating additional stations or synthetic data augmentation, together with robust gap-filling methods fo

www.mdpi.com

Zum Originalartikel