Highlights What are the main findings? A physics-inspired Class Center Residual Attention Network (CCRANet) is proposed, which effectively extracts invariant thin cloud features and reduces surface interference. CCRANet achieves an mIoU of 85.93% on the Landsat-8 dataset, improving thin cloud IoU by 22.58 percentage points over DeeplabV3+ in snow/ice scenarios. What are the implications of the main findings? The class center mechanism provides a new way to decouple cloud signals from surface interference, enhancing the interpretability of deep learning models in remote sensing. The method improves cloud detection accuracy in complex scenes, benefiting downstream tasks such as agricultural monitoring and disaster response. Abstract High-precision cloud detection is essential for remote sensing applications such as agricultural monitoring and disaster response. However, thin clouds severely limit detection accuracy. The difficulty lies in their semi-transparent nature, which causes their reflected signals to couple with the reflectance of various underlying surfaces. This coupling leads to inconsistent cloud signatures and significant intra-class variability. To address this, we propose a Class Center Residual Attention Network (CCRANet), a radiative transfer theory-inspired framework that employs a class center approach to extract the intrinsic reflective characteristics of thin clouds. Specifically, the core of the network is the Class Center Attention (CCA) module, which extracts invariant intrinsic features of thin clouds, supplemented by the Class Center Residual (CCR) module to eliminate surface-induced interference. Experiments on three public datasets (Landsat-8, CSWV, and CloudS26) show that CCRANet achieves a mean Intersection over Union (mIoU) of 85.93% on the Landsat-8 dataset, outperforming the classic DeeplabV3+ baseline by 10.23 percentage points. In particular, it achieves 22.58 percentage point improvement in thin cloud IoU over DeeplabV3+ in snow/ice scenarios, significantly reducing false positive detections caused by surface spectral similarity. 1. Introduction More than two-thirds of the Earth’s surface is covered by clouds at any given time. Cloud interference leads to severe attenuation of surface reflectance information in optical remote sensing images, which limits the effectiveness of remote sensing products and causes issues including redundant data storage and excessive computational resource consumption. Accurate cloud detection is therefore an essential preprocessing step to improve the applicability of remote sensing image products. Thin clouds are particularly challenging due to their semi-transparent nature, which causes spectral mixing between the cloud layer and the underlying surface, leading to significant intra-class variability in cloud features. Current thin cloud detection methods can be broadly categorized into two groups. Threshold-based and conventional machine learning methods set local detection thresholds according to the radiative characteristics of different underlying surfaces. While intuitively straightforward, these methods often fail in complex scenarios with heterogeneous surface features [ 1]. They also share a common deficiency: detection performance degrades significantly when surface reflectance properties change across scenes [ 2]. However, the aforementioned models still exhibit limited performance on thin cloud detection tasks. Thin clouds have low optical thickness, and their observed spectral signatures vary drastically across different underlying surfaces. Most deep learning-based models focus on local spatial information and do not explicitly account for the physical behavior of thin clouds, leading to reduced robustness across complex environments [ 7]. To address the above limitations, we propose a framework inspired by the decomposability of cloud reflectance in radiative transfer, based on a class center residual attention mechanism. Our method targets two core challenges: intra-class feature dispersion caused by underlying surface coupling, and difficulties in extracting weak thin cloud features. Ablation studies show that the Class Center Attention (CCA) module is the primary driver of performance improvement, while the Class Center Residual (CCR) module provides supplementary refinement by reducing surface interference. This article is an extended version of a conference paper titled "Thin Cloud Detection Method in Thin Cloud Scenarios Based on Class Center Residual Attention" presented at the 5th International Conference on Electronic Information Engineering and Data Processing (EIEDP’26) held in Chengdu, China, from 23–25 January 2026 [ 11]. 2. Related Work Thin cloud detection in remote sensing imagery remains challenging due to the fundamental ambiguity introduced by underlying surfaces. Unlike thick clouds that completely obscure the ground, thin clouds are semi-transparent—their observed spectral signatures are a mixture of the cloud’s own reflectance and that of the surface beneath. This means that the same cloud can appear dramatically different over land, ocean, snow, or vegetation, as illustrated in Figure 1. Addressing this intra-class variability has driven much of the recent work in the field. 2.1. Deep Learning-Based Methods with Attention Mechanisms Most existing deep learning-based cloud detection methods adopt UNet-type architectures, which have been proven effective for dense prediction tasks including cloud segmentation. For example, Zhang et al. [ 12] proposed GCDB-UNet, which incorporated global-context dense blocks and non-local self-attention into a standard UNet. This architecture introduces long-range pixel dependencies to capture the spatially dispersed characteristics of thin clouds. However, GCDB-UNet suffers from low computational efficiency when processing large-scale satellite images due to the quadratic complexity of self-attention operations. Subsequent design attempts aim to maintain context modeling ability while improving computational efficiency. CDU-Net adopts a two-path architecture: a local path for fast multi-resolution feature extraction using atrous convolutions, and a global path using a lightweight transformer encoder to capture high-level semantics [ 13]. The two paths are merged via a channel-based attention gate and deformable convolutions to improve boundary localization accuracy. Although this network achieves acceptable performance on simple scenes, it still produces high false alarm rates on complex and diverse underlying surfaces. Another line of research explores multi-scale feature fusion to handle complex surface variations. Thin clouds exhibit effects across multiple scales, so fusing features at different receptive fields can improve detection robustness. EDFF-Unet improved cloud and shadow detection performance by fusing multi-scale edge and semantic features [ 14]. The MSCFF network uses parallel multi-scale convolutional branches to learn cloud features of different sizes, combining results via a learnable weighted channel attention strategy. However, these models still struggle with spectral confusion between thin clouds and bright surfaces such as snow, as attention mechanisms alone cannot fully decouple surface-induced feature variations. Some works attempt to incorporate additional handcrafted features to improve performance on challenging scenes. Shao’s MFCNN combines GLCM texture features and semantic features to distinguish between clouds and bright surfaces [ 8]. However, texture features are less discriminative for thin clouds over high-reflectance surfaces: thin clouds over snow often exhibit similar texture patterns to snow itself, so texture enhancement alone cannot solve the detection problem in snow-covered scenarios. Other works explore spatial consistency constraints to improve detection performance. CDNet combines CNN and CRF into an end-to-end model, using CRF to impose spatial consistency smoothing, which improves detection performance on Sentinel-2 images [ 15]. CSDNet jointly extracts spectral information, shape features, and texture details using two branches to detect clouds and shadows simultaneously [ 16]. However, these methods still exhibit degraded performance on coastal areas where shadows over water produce similar spectral signatures to thin clouds. 2.2. Physics-Informed Thin Cloud Detection Methods Another line of research explicitly introduces radiative transfer priors into deep learning models to improve generalization across complex surfaces. For example, WDCD [ 17] embeds radiative transfer constraints into the loss function to generate pseudo labels for weak supervision, reducing the dependence on manual annotations. TANet [ 7] designs a radiation-aware attention module to model the spectral mixing characteristics of thin clouds, achieving notable performance improvement on bright surfaces such as snow. These methods demonstrate that integrating physical knowledge can effectively reduce the data demand of models and improve the interpretability of detection results, which is also the core motivation of our work. Due to the scarcity of large-scale high-quality thin cloud annotation datasets, researchers are exploring weakly supervised methods that require fewer manual annotations. Ru et al. [ 18] proposed a scene synthesis method that uses a multi-scale Gabor filter bank to extract directional texture features, performs non-linear enhancement, and extracts high-frequency information via NSCT. A synthetic thin cloud mask generated by a refined region-growing algorithm is superimposed onto clean images, with additional Poisson noise introduced to improve realism. While this method reduces manual annotation workload, synthetic thin clouds still differ from natural thin clouds, and small deviations in radiative characteristics can lead to degraded generalization performance. Chen et al. [ 17] proposed a weakly-supervised deep cloud detection (WDCD) model that uses radiative transfer rules to generate pseudo labels instead of full manual annotations. An adaptive thresholding algorithm adjusts thresholds according to cloud coverage, and a deformable boundary refinement module learns the gradient field of thin clouds to improve segmentation accuracy. A harmonic loss function suppresses high-frequency noise in pseudo labels. WDCD cleverly compensates for the lack of annotated data, but it requires calibrated radiometric information for input images, limiting its applicability to uncalibrated remote sensing products. Liu et al. [ 19] proposed a contrastive learning diffusion framework (CLDiff) that uses a diffusion model for thin cloud detection, with a multi-scale feature rectification module and a diffusion decoder to iteratively refine the probability map. The review of existing methods reveals a consistent pattern: the persistent challenge across all methods is the high intra-class variability of thin clouds. This suggests that instead of adding more modules or scales, a more principled approach to decouple cloud signals from surface interference is needed. This points toward integrating the physical principles of radiative transfer more directly into the feature learning process. 2.3. Challenges and Research Gaps Recent years have seen progress in both synthetic data generation and weakly supervised learning for thin cloud detection. Yet when these methods are deployed on real-world imagery, performance degrades as soon as the underlying surface deviates from the training distribution. Physically-based approaches like WDCD build on radiative transfer relationships derived under idealized assumptions, which work reasonably well on dark and uniform surfaces such as open water, but break down over bright, heterogeneous surfaces such as snow or coastal zones. This points to a deeper issue in existing literature: deep learning models for cloud detection have borrowed heavily from generic semantic segmentation architectures rather than being tailored to the physical properties of radiative transfer in remote sensing. For thin clouds, their appearance is a property of the cloud-surface system, not just the cloud itself. A segmentation network without explicit physical guidance has no way of knowing that the same cloud should appear different over different surfaces, so it simply attempts to memorize all variations seen during training. Errors in thin cloud detection propagate through the entire remote sensing processing chain. Therefore, what is needed is a model that explicitly incorporates the decomposable nature of cloud reflectance, rather than forcing the network to rediscover these properties from scratch. This paper introduces a framework inspired by radiative transfer, embedding the idea of feature decomposition directly into the deep learning architecture. 3. Materials and Methods 3.1. Thin Cloud Detection Strategy Thin cloud detection is fundamentally challenging due to the semi-transparent nature of clouds, which causes spectral coupling between clouds and underlying surfaces. The optical reflectance of a thin cloud can be conceptually decomposed into two independent components: the intrinsic reflectance of the cloud layer ( R cloud ) and the transmitted reflectance from the underlying surface ( T · A surface ). This relationship is expressed as: F = R cloud + T · A surface (1) where T is the transmittance of the cloud layer. When cloud optical thickness τ is below 3, transmittance T is typically between 0.6 and 0.8. Since underlying surface reflectance varies drastically (from ∼0.05 for dark water to ∼0.9 for bright snow), the observed thin cloud features exhibit large intra-class variability. To address this challenge, we design the network using two complementary principles: (1) Isolate the intrinsic reflectance characteristics of thin clouds that are independent of underlying surfaces; (2) Use an attention mechanism to emphasize regions with high radiation contrast. Specifically, we introduce two core modules: The Class Center Attention (CCA) module learns a stable representation of the intrinsic thin cloud features R cloud , dynamically updated during training to capture the diversity of real cloud properties. For any input feature, the module calculates a spatial attention map by measuring the similarity between each pixel’s feature and the learned class center. The Class Center Residual (CCR) module calculates the residual between the feature and the learned class center. The class center encodes the invariant properties of clouds, while the residual captures the surface-induced interference T · A surface . Removing this residual helps decouple surface effects from cloud features. 3.2. Network Architecture Design To realize the above design principles, we propose CCRANet (Class Center Residual Attention Network), which follows an encoder-decoder structure built on a ResNet-50 backbone. The key differences from standard segmentation networks are two additional components inserted between the encoder and decoder: the class center residual decoupling module and the feature fusion module. The overall architecture is illustrated in Figure 2. 3.2.1. Deep Feature Extraction High-resolution remote sensing images are first normalized and fed into the pre-trained ResNet-50 backbone to extract high-level semantic features. The ResNet-50 backbone is pre-trained on the ImageNet dataset, and we modify the input channel of the first convolutional layer from 3 to 4 to adapt to the 4 visible/near-infrared bands of Landsat-8 images. At the end of the backbone, we replace standard convolutions with depth-wise separable convolutions, which decouple spatial and channel-wise operations. This design choice is more suitable for preserving the fine semi-transparent texture of thin clouds while reducing computational cost. Thin cloud detection requires explicit modeling of the physical radiative transfer process: solar radiation passes through the cloud layer, interacts with the underlying surface, and is reflected back through the cloud layer to the satellite sensor. The observed signal is therefore a combination of cloud radiance and surface reflectance modulated by the cloud layer. Conventional threshold-based methods fail to resolve this coupling effect. To address this, we introduce the Class Center Residual Attention (CCRA) mechanism, which extracts invariant thin cloud features by jointly modeling physical intuitions and deep feature representations. 3.2.2. Class Center Attention (CCA) Module The class center is defined as the prototype of intrinsic cloud reflectance features R cloud in the feature space. In feature space, the learned class center corresponds to an abstract representation of R cloud . Ideally, given the deepest feature map F ∈ R H ୍ଠ W ୍ଠ C , the class center for category i can be written as: C C i = ∑ j = 0 H W I [ y j = i ] · F j ∑ j = 0 H W I [ y j = i ] (2) where i ∈ { 0 , 1 } denotes the two categories of cloud and non-cloud. This formulation forces the class center to converge to a stable pattern corresponding to R cloud by averaging in feature space. Since ground truth is unavailable during testing, we use coarse segmentation outputs to estimate pixel-wise class probabilities. The coarse segmentation probability matrix P coarse is generated by a 1 ୍ଠ 1 convolutional layer attached to the final output of the ResNet-50 encoder. The class center is computed via matrix multiplication ( Figure 3) as: C C i = ∑ j = 0 H W P coarse i , j · F j ′ ∑ j = 0 H W P coarse i , j (3) Here, P coarse i , j denotes the probability that pixel j belongs to category i in the coarse segmentation. To avoid coarse segmentation noise leading to severe class center fluctuations, we employ a momentum update mechanism to maintain the global class center C C global : C C global ( t ) = m · C C global ( t − 1 ) + ( 1 − m ) · C C batch , (4) where the momentum coefficient m = 0.999 . To further enhance stability, a soft weighting strategy is adopted: the contribution weight of pixel j is w j = P coarse i , j · sigmoid ( ∥ F j ′ ∥ − μ ) , where μ is the mean feature norm of the current batch. This gives higher weights to thick cloud pixels naturally, while thin cloud pixels can still participate in the update with lower weights. The Class Center Attention (CCA) module provides a dynamically optimized representation of pixel-level intrinsic cloud features. Its mathematical expression is: C C A j = ∑ i = 0 N P coarse i , j · C C i (5) Here, C C A j ∈ R C is the class center attention for pixel j. CCA is the primary driver of performance gain in our framework, as it directly extracts the invariant cloud features. Visualization results ( Figure 4) indicate that class center attention effectively measures feature differences between categories. 3.2.3. Class Center Residual (CCR) Module The Class Center Residual (CCR) module provides supplementary refinement by modeling the dynamic offset caused by cloud-surface interactions. While the class center captures intrinsic cloud reflectance R cloud , it does not account for the interference term T · A surface . We define the class center residual as the deviation between the input feature F ∈ R H ୍ଠ W ୍ଠ C and the class center C C ∈ R I ୍ଠ C . To compute the residual, we first determine the category whose center is closest to the pixel feature. This is implemented efficiently by taking the minimum over the category dimension after broadcasting: C C R = min i ∈ I ( F expanded − C C expanded ) (6) This operation is equivalent to selecting the category index i * = arg min i ∥ F − C C i ∥ and calculating the residual with reference to this category center. The min operation is chosen because the residual between a pixel feature and its most likely class center (i.e., the category with minimal distance) provides the most meaningful measure of deviation; aggregating over all categories via mean or max would dilute this category-specific interference signal. This design aligns with the intuition that the residual should be evaluated relative to the correct underlying class. This residual term implicitly encodes the coupling effect of transmittance T and underlying surface reflectance A surface . The magnitude of the residual ∥ C C R ∥ correlates with the strength of surface interference—larger over snow, near-zero over water—consistent with the radiative transfer expectation. By introducing this residual term, the model explicitly separates cloud intrinsic properties from scenario-dependent interference. The combination of class center attention and class center residual forms the class center residual attention mechanism ( Figure 5), defined as: C C R A ( F ) = C C A + C C R (7) 3.2.4. Feature Fusion Module To further enhance the utilization of complementary information captured by CCA and CCR features, we propose a feature fusion module ( Figure 6) that integrates both feature types using local and global attention mechanisms and adaptive feature weighting. The design adopts local-global attention rather than recurrent or sequential fusion strategies, as the spatial nature of attention is better aligned with the dense prediction requirements of cloud detection. This design choice is validated through the overall performance improvement observed in the full CCRANet configuration (Table 5). The feature fusion module operates as follows: we first perform an element-wise addition operation on the CCA and CCR feature vectors to generate a combined representation x a . We then apply local and global attention mechanisms to x a , obtaining x l = local_att ( x a ) and x g = global_att ( x a ) . These are fused via element-wise addition: x l g = x l + x g . A sigmoid function generates a weight map wei = σ ( x l g ) , and the fused output is calculated as: x o = CCR · w e i + CCA · ( 1 − w e i ) (8) Each spatial location is assigned an independent weight, allowing the network to adaptively emphasize either the residual information or the attention-enhanced cloud features based on local context. 4. Experiments 4.1. Datasets Our primary dataset for this research is the Landsat-8 Cloud Coverage Assessment (Landsat-8 CCA) dataset, which includes 96 representative scenes across 8 land cover types with pixel-level cloud annotations published by USGS. Clouds are classified into two categories by optical thickness: thin (optical thickness < 3 ) and thick (optical thickness ≥ 3 ) for all experiments. All remote sensing images were cropped into 512 ୍ଠ 512 pixel patches using a sliding window with a stride of 512 pixels. We also conduct cross-dataset experiments on two public datasets: CloudS26 (containing 2600 cloud-snow coexistence scenes) and CSWV (containing 1800 winter snow cover scenes). 4.2. Implementation Details All experiments are conducted on a workstation with an NVIDIA RTX 3090 GPU (24GB VRAM) and an AMD Ryzen 5600X CPU. We use the Adam optimizer with cross-entropy loss, an initial learning rate of 1 ୍ଠ 10 − 4 , and the learning rate is decayed by 0.1 every 150 epochs. The batch size is set to 8, and the model is trained for 500 epochs. We use the blue, green, red, and near-infrared bands of all datasets as input, and modify the first convolution layer of ResNet-50 (pre-trained on ImageNet) from 3 input channels to 4 to adapt to multi-spectral input. The Landsat-8 dataset is randomly split into training/validation/test sets with a ratio of 8:1:1, all experiments are repeated 3 times with random seed 42, and we report the average value and standard deviation of all metrics. CCRANet has 41.02M parameters, 33.8G FLOPs for 512 ୍ଠ 512 input, and an inference speed of 76.15 FPS on RTX 3090, which meets the demand of large-scale remote sensing image processing. 4.3. Evaluation Metrics To go beyond qualitative comparison, we evaluate the model’s performance using a set of standard metrics widely adopted in the remote sensing and computer vision research communities. Quantitative assessment is based on the confusion matrix shown in Table 1. Based on the four fundamental values (TP, FN, FP, TN), we calculate Precision, Recall, F1 score, Overall Accuracy (OA), and mean Intersection over Union (mIoU). For binary cloud detection, mIoU is defined as the average of IoU for cloud and non-cloud categories: mIoU = IoU C + IoU NC 2 (9) where IoU C and IoU NC represent the IoU of cloud and non-cloud categories, respectively. 4.4. Results 4.4.1. Training Process Analysis Figure 7 shows the training/validation loss and mIoU curves of CCRANet on Landsat-8. Both training loss (red) and validation loss (green) decrease rapidly, stabilizing around 0.71 and 0.69 after approximately 100 epochs, indicating good convergence. Validation mIoU (blue) steadily increases to 85.93 ବ୍ଦ 0.21 %, confirming the effectiveness of the class center residual mechanism for thin cloud feature modeling. 4.4.2. Feature Visualization Analysis We employed t-SNE to visualize high-level semantic features before and after applying the class center residual mechanism ( Figure 8 and Figure 9). Without the mechanism ( Figure 8), cloud features (thin clouds in light green, thick clouds in yellow) heavily overlap with underlying surface features (dark purple), indicating conventional methods fail to isolate thin cloud properties from surface interference. After applying the class center residual mechanism ( Figure 9), the separation between cloud features and underlying surface features is improved: thin cloud features form a more distinct cluster, and overlap with thick cloud features is reduced. In feature space, the class center corresponds to an abstract representation of the intrinsic cloud reflectance R cloud , while the residual magnitude ∥ C C R ∥ correlates with the strength of surface interference—larger over snow, near-zero over water—consistent with the radiative transfer expectation. 4.4.3. Comparison with State-of-the-Art Methods We compared CCRANet with seven representative cloud detection methods and two recent state-of-the-art (SOTA) models on the Landsat-8 dataset, covering classic semantic segmentation models (DeeplabV3+, UNet3+, SegFormer) and cloud detection-specific models (MSCFF, CDNet, CSDNet, MFCNN, TANet [ 7]). Quantitative results are shown in Table 3. CCRANet achieves an mIoU of 85.93 ବ୍ଦ 0.21 %, outperforming the classic DeeplabV3+ baseline by 10.23 percentage points, and outperforming the 2025 cloud detection SOTA TANet by 2.76 percentage points. For thin cloud detection specifically, CCRANet achieves an IoU of 60.34 ବ୍ଦ 0.32 %, which is 2.11 percentage points higher than TANet. The CCA module is the primary driver of this performance gain, while the CCR module provides supplementary refinement. We also evaluated the inference efficiency of all models to verify practical applicability. As shown in Table 2, CCRANet contains 41.02M parameters and 33.80G FLOPs, achieving an inference speed of 76.15 FPS. Compared with DeepLabV3+, our method reduces the parameter count by 25.0% and FLOPs by 59.5%, while improving mIoU by 10.23 percentage points. Although MFCNN is lightweight with only 14.78M parameters, its FLOPs are substantially higher (126.86G) and its mIoU is 4.06 percentage points lower than ours. MSCFF suffers from extremely high FLOPs (561.33G) and low inference speed (20.36 FPS), making it impractical for large-scale processing. Overall, CCRANet achieves a favorable balance between detection accuracy and computational efficiency. Discussion on Transformer-based methods: In addition to the CNN-based baselines, recent Transformer architectures (e.g., SegFormer) have shown strong performance in remote sensing segmentation. In our experiments, SegFormer achieves an mIoU of 82.45% ( Table 3), which is 2.96 percentage points lower than our CCRANet. This indicates that the physics-inspired inductive bias of our method provides complementary benefits beyond the attention mechanism of Transformers. We hypothesize that a hybrid design—combining a Transformer backbone with CCRA modules—could be a promising direction for future work. Visualization results ( Figure 10) show that CCRANet accurately identifies thin cloud regions with clear boundaries, and reduces false positives in bright surface scenarios such as snow and sand. 4.4.4. Sub-Scenario Performance Evaluation To verify the robustness of the model across different land cover types, we divided the Landsat-8 test set into 8 sub-scenarios according to land cover types: Barren, Forest, Grass_Crops, Shrubland, Snow_Ice, Urban, Water, and Wetland. The results are shown in Table 4. Experimental results show that CCRANet achieves competitive performance across all land cover types, with the most notable improvement in scenarios with severe intra-class feature dispersion. Specifically, in the Snow_Ice scenario where spectral confusion between clouds and snow is the most serious, CCRANet achieves a thin cloud IoU of 71.88%, which is 22.58 percentage points higher than MFCNN and 29.56 percentage points higher than DeeplabV3+. It should be noted that the thick cloud IoU of CCRANet in this scenario drops to 61.14%, which represents a trade-off: the model prioritizes reducing false positives of snow misclassified as clouds at the cost of slightly lower thick cloud recall, an acceptable compromise for downstream applications where snow-cloud confusion is more detrimental. 4.4.5. Ablation Study To verify the contribution of each module in CCRANet, we conducted ablation experiments on the Landsat-8 dataset. The baseline model is DeeplabV3+ without the CCA, CCR, and FFM modules. The results are shown in Table 5. The results show that the CCA module, as the core of the model, contributes 8.08 percentage points of mIoU improvement, verifying that the class center mechanism can effectively extract invariant intrinsic features of thin clouds. Adding the CCR module further improves the mIoU by 0.84 percentage points, providing supplementary refinement. It should be noted that CCR cannot work independently without the class center provided by CCA. Finally, the FFM module brings an additional 1.31 percentage points improvement by adaptively fusing CCA and CCR features. Visualization results of ablation experiments ( Figure 11) show that adding CCA can restore most of the missing thin cloud regions, adding CCR can further clarify cloud boundaries, and adding FFM reduces false positives and false negatives. 4.5. Sensitivity to Coarse Segmentation Quality The class center estimation in CCA (Equation ( 3)) depends on the coarse segmentation outputs. To quantitatively evaluate this dependency, we degraded the coarse probability maps via morphological erosion/dilation (kernel sizes 3 and 5) and confidence thresholding (thresholds 0.3, 0.5, 0.7) on all eight land cover types of the Landsat-8 dataset. Table 6 reports the average mIoU over the eight scenarios under each degradation configuration, benchmarked to the overall Landsat-8 mIoU of 85.93%. As shown in Table 6, even under the most severe degradation (threshold = 0.7), the average mIoU drops by only 0.58 percentage points (from 85.93% to 85.35%). The Snow_Ice scenario exhibits the largest drop (1.46 percentage points, see Supplementary Material), which is expected due to the intrinsic difficulty of discriminating clouds from snow. These results demonstrate that CCRANet is highly robust to imperfections in the coarse segmentation. This robustness stems from two design choices: (1) the momentum update (Equation ( 3)) with m = 0.999 , which prevents the class center from being dominated by noisy batch estimates; and (2) the soft weighting strategy w j = P coarse i , j · sigmoid ( ∥ F j ′ ∥ − μ ) , which down-weights low-confidence pixels that are more likely to be erroneous. Importantly, the robustness does not imply that the class center mechanism is unnecessary. The ablation study ( Table 5) shows that removing CCA (i.e., the Baseline) reduces mIoU by 8.08 percentage points, confirming that the class center guidance is essential. The sensitivity analysis further shows that given a reasonably accurate coarse segmentation (as produced by our trained coarse head), the model can tolerate mild to moderate errors without performance collapse. This combination of effectiveness and robustness makes CCRANet suitable for practical remote sensing applications. Cross-Dataset Evaluation To further assess the transferability of CCRANet across different datasets, we conducted experiments on two public datasets: CloudS26 and CSWV. The model was trained only on the Landsat-8 dataset and tested directly on the other two datasets without fine-tuning. The results are shown in Table 7. CCRANet achieves 85.05% mIoU on CloudS26 and 89.02% mIoU on CSWV, which is competitive with or better than most baselines without fine-tuning. Notably, some cloud-snow specialized models (e.g., [ 17]) achieve higher performance on CSWV (dominated by cloud-snow coexistence), while CCRANet shows advantages on Landsat-8 where thin cloud samples are more abundant. These results indicate promising transferability, but the claims are limited to the evaluated datasets (Landsat-8, CloudS26, CSWV) and cloud/snow scenarios. Generalization to other sensors (e.g., Sentinel-2) or land cover types (e.g., urban, forest) requires further validation. 5. Discussion Because of the semi-transparent nature of thin clouds, the signals received by satellites are a mixture of cloud albedo and surface albedo, resulting in significant intra-class differences. CCRANet is inspired by the decomposability of apparent cloud reflectance, and extracts an invariant representation of thin clouds through the class center residual attention mechanism. The magnitude of the residual ∥ C C R ∥ correlates with the strength of surface interference—larger over snow, near-zero over water—consistent with physical expectations, providing qualitative evidence for the interpretability of the learned features. We note that the class center calculation relies on coarse segmentation outputs; therefore, the quality of these initial predictions influences the stability of the learned class centers. In challenging scenarios where coarse segmentation produces high false positive rates (e.g., bright surfaces misclassified as clouds), the estimated class center may be biased, potentially affecting detection performance. Future work will incorporate more robust class center estimation strategies, such as uncertainty-aware weighting or iterative refinement, to mitigate this dependency. Failure case analysis. Figure 12 shows representative examples where CCRANet still struggles. The three rows from top to bottom correspond to: (a) extremely thin cloud over land, (b) thin cloud over water (first example), and (c) thin cloud over water (second example). In the extreme thin cloud case (top row), the semi-transparent cloud is nearly invisible in the RGB image, leading to severe under-detection (red circles) where the model predicts no cloud. Over water bodies (middle and bottom rows), the low reflectance contrast between thin clouds and the dark water surface causes both false negatives (missed thin clouds) and false positives (water misclassified as cloud). These failures highlight the model’s sensitivity to very weak signals and to the spectral similarity between thin clouds and certain surfaces. The results are consistent with the physical expectation that when the cloud transmittance is high (i.e., very thin clouds) or the surface reflectance is extremely low (water), the effective signal-to-noise ratio drops, making discrimination difficult. The proposed method still has other limitations. First, under extremely low-light conditions (such as dawn, dusk, and high-latitude winter), the spectral signal is weak, and detection accuracy decreases. Similarly, when thin clouds are mixed with aerosols, the additional scattering effect will also degrade model performance, as the current model does not consider the influence of aerosols in the radiative transfer decomposition. In future work, we will introduce multi-temporal features and multi-spectral prior constraints to address these limitations, and explore unsupervised learning methods to reduce the demand for large-scale annotated datasets. 6. Conclusions To address the intra-class feature dispersion problem of thin clouds caused by underlying surface coupling, this paper proposes a framework inspired by the decomposability of cloud reflectance, implemented as a Class Center Residual Attention Network (CCRANet) for high-precision thin cloud detection. The core idea of the method is to decompose the observed thin cloud features into invariant cloud intrinsic features and scene-dependent interference components. The Class Center Attention (CCA) module dynamically learns a stable prototype of cloud intrinsic features and is the primary driver of performance gain. The Class Center Residual (CCR) module provides supplementary refinement by calculating the deviation between input features and the class center. The Feature Fusion Module adaptively combines the two types of features to further improve detection performance. Experiments on three public datasets show that CCRANet achieves an mIoU of 85.93% on the Landsat-8 dataset, outperforming DeeplabV3+ by 10.23 percentage points, and achieves consistent performance across 8 different land cover types, especially in snow/ice scenarios where thin cloud IoU is improved by 22.58 percentage points. Cross-dataset experiments also verify the model’s transferability to related scenes. Important note: The current framework does not enforce hard physical constraints. The physical interpretation remains qualitative, as we do not directly optimize or validate a specific radiative transfer equation. The residual magnitude ∥ C C R ∥ correlates with surface interference as expected from theory, but this is an emergent property, not a quantitatively verified physical quantity. Future work could integrate differentiable physical simulators for stronger guidance. This study demonstrates that incorporating a physics-inspired inductive bias into deep learning is an effective way to improve robustness and interpretability for remote sensing image processing, while the claims are limited to qualitative consistency rather than quantitative physical validation. Supplementary Materials The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/rs18111840/s1. Author Contributions Conceptualization, M.Z. and P.W.; methodology, M.Z.; software, M.Z.; validation, M.Z., P.W., S.Z. and J.H.; formal analysis, M.Z.; investigation, M.Z.; resources, P.W.; data curation, M.Z.; writing—original draft preparation, M.Z.; writing—review and editing, P.W.; visualization, M.Z.; supervision, P.W.; project administration, P.W.; funding acquisition, P.W. All authors have read and agreed to the published version of the manuscript. Funding This research received no external funding. The APC was funded by the corresponding author’s research grant. Data Availability Statement The datasets used in this study are publicly available: Landsat-8 CCA dataset ( https://www.usgs.gov/landsat-missions/landsat-cloud-cover-assessment-validation-data, accessed on 1 June 2026), CloudS26 and CSWV (available upon request from the original providers). To ensure full reproducibility, we provide a supplementary package containing: (1) raw per-scene evaluation logs and confusion matrices for all results reported in Table 4; (2) exact training/validation/test splits; (3) evaluation scripts to reproduce the numbers; and (4) inference code and pretrained model weights (anonymized link for review). The package is available as Supplementary Material of this manuscript. Conflicts of Interest The authors declare no conflict of interest. References Irish, R.R.; Barker, J.L.; Goward, S.N.; Arvidson, T. Characterization of the Landsat-7 ETM+ Automated Cloud-Cover Assessment (ACCA) Algorithm. Photogramm. Eng. Remote Sens. 2006, 72, 1179–1188. [ Google Scholar] [ CrossRef] Jeppesen, J.H.; Jacobsen, R.H.; Inceoglu, F.; Toftegaard, T. A Cloud Detection Algorithm for Satellite Imagery Based on Deep Learning. Remote Sens. Environ. 2019, 229, 247–259. [ Google Scholar] [ CrossRef] Zhang, J.; Wang, H.; Wang, Y.; Zhou, Q.; Li, Y. Deep Network Based on Up and Down Blocks Using Wavelet Transform and Successive Multi-Scale Spatial Attention for Cloud Detection. Remote Sens. Environ. 2021, 261, 112483. [ Google Scholar] [ CrossRef] Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [ Google Scholar] [ CrossRef] [ PubMed] Li, Z.; Shen, H.; Cheng, Q.; Liu, Y.; You, S.; He, Z. Deep learning based cloud detection for medium and high resolution remote sensing images of different sensors. ISPRS J. Photogramm. Remote Sens. 2019, 150, 197–212. [ Google Scholar] [ CrossRef] Ding, L.; Xia, M.; Lin, H.; Hu, K. Multi-Level Attention Interactive Network For Cloud And Snow Detection Segmentation. Remote Sens. 2023, 16, 112. [ Google Scholar] [ CrossRef] Xu, X.; He, W.; Xia, Y.; Zhang, H.; Wu, Y.; Jiang, Z.; Hu, T. TANet: Thin Cloud Aware Network for Cloud Detection in Optical Remote Sensing Image. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–15. [ Google Scholar] [ CrossRef] Shao, Z.; Pan, Y.; Diao, C.; Cai, J. Cloud detection in remote sensing images based on multiscale features-convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2019, 57, 4062–4076. [ Google Scholar] [ CrossRef] Poussin, C.; Peduzzi, P.; Giuliani, G. Snow Observation from Space: An approach to improving snow cover detection using four decades of Landsat and Sentinel-2 imageries across Switzerland. Sci. Remote Sens. 2025, 11, 100182. [ Google Scholar] [ CrossRef] Zhang, J.; Li, Y.; Yang, X.; Jiang, R.; Zhang, L. RSAM-Seg: A SAM-Based Model with Prior Knowledge Integration for Remote Sensing Image Semantic Segmentation. Remote Sens. 2025, 17, 590. [ Google Scholar] [ CrossRef] Zhang, M.; He, J.; Zhou, S.; Wang, P. Thin cloud detection method in thin cloud scenarios based on class center residual attention. In Proceedings of the 5th International Conference on Electronic Information Engineering and Data Processing (EIEDP 2026), Chengdu, China, 23–25 January 2026. [ Google Scholar] Li, X.; Yang, X.; Li, X.; Lu, S.; Ye, Y.; Ban, Y. GCDB-UNet: A novel robust cloud detection approach for remote sensing images. Knowl.-Based Syst. 2022, 238, 107890. [ Google Scholar] [ CrossRef] Hu, K.; Zhang, D.; Xia, M. CDUNet: Cloud detection UNet for remote sensing imagery. Remote Sens. 2021, 13, 4533. [ Google Scholar] [ CrossRef] Xie, F.; Shi, M.; Shi, Z.; Yin, J.; Zhao, D. Multilevel Cloud Detection in Remote Sensing Images Based on Deep Learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 3631–3640. [ Google Scholar] [ CrossRef] Yang, J.; Guo, J.; Yue, H.; Liu, Z.; Hu, H.; Li, K. CDnet: CNN-based cloud detection for remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6195–6211. [ Google Scholar] [ CrossRef] Zhang, G.; Gao, X.; Yang, Y.; Wang, M.; Ran, S. Controllably deep supervision and multi-scale feature fusion network for cloud and snow detection based on medium-and high-resolution imagery dataset. Remote Sens. 2021, 13, 4805. [ Google Scholar] [ CrossRef] Chen, Y.; Weng, Q.; Tang, L.; Liu, Q.; Fan, R. An automatic cloud detection neural network for high-resolution remote sensing imagery with cloud–snow coexistence. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [ Google Scholar] [ CrossRef] Ru, Y.; Zhang, F.; Hu, W. Cloud Detection Network Based on Scenario Synthesis and Transformer in Remote Sensing Images. In IGARSS 2023—2023 IEEE International Geoscience and Remote Sensing Symposium; IEEE: New York, NY, USA, 2023; pp. 6886–6889. [ Google Scholar] Li, Y.; Chen, W.; Zhang, Y.; Tao, C.; Xiao, R.; Tan, Y. Accurate cloud detection in high-resolution remote sensing imagery by weakly supervised deep learning. Remote Sens. Environ. 2020, 250, 112045. [ Google Scholar] [ CrossRef] Figure 1. Examples of thin cloud images over different underlying surfaces. The two boxes highlight the appearance of thin clouds over water (left box) and over bare land (right box). Figure 1. Examples of thin cloud images over different underlying surfaces. The two boxes highlight the appearance of thin clouds over water (left box) and over bare land (right box). Figure 2. Overall architecture of CCRANet for thin cloud detection. Figure 2. Overall architecture of CCRANet for thin cloud detection. Figure 3. Class center calculation process. Figure 3. Class center calculation process. Figure 4. Feature visualization of class center, class center residual and class center residual attention. Figure 4. Feature visualization of class center, class center residual and class center residual attention. Figure 5. Class center residual and class center residual attention mechanism. Figure 5. Class center residual and class center residual attention mechanism. Figure 6. Attention fusion module structure. Figure 6. Attention fusion module structure. Figure 7. Training loss, validation loss, and mIoU curves of CCRANet on Landsat-8 dataset. Figure 7. Training loss, validation loss, and mIoU curves of CCRANet on Landsat-8 dataset. Figure 8. Feature visualization results without the class center residual mechanism. Figure 8. Feature visualization results without the class center residual mechanism. Figure 9. Feature visualization results with the class center residual mechanism. Figure 9. Feature visualization results with the class center residual mechanism. Figure 10. Visual comparison of cloud detection results on Landsat-8 dataset (red: thin cloud false negative, green: thin cloud false positive, yellow: thick cloud false negative, blue: thick cloud false positive). Figure 10. Visual comparison of cloud detection results on Landsat-8 dataset (red: thin cloud false negative, green: thin cloud false positive, yellow: thick cloud false negative, blue: thick cloud false positive). Figure 11. Visual results of ablation experiments showing the effect of each module (red: thin cloud false negative, green: thin cloud false positive). Figure 11. Visual results of ablation experiments showing the effect of each module (red: thin cloud false negative, green: thin cloud false positive). Figure 12. Typical failure cases of CCRANet. From top to bottom: extremely thin cloud over land, thin cloud over water (two examples). Left column: RGB image; middle column: ground truth cloud mask; right column: prediction by CCRANet. Red circles/arrows indicate missed thin clouds; green arrows indicate false positives. Figure 12. Typical failure cases of CCRANet. From top to bottom: extremely thin cloud over land, thin cloud over water (two examples). Left column: RGB image; middle column: ground truth cloud mask; right column: prediction by CCRANet. Red circles/arrows indicate missed thin clouds; green arrows indicate false positives. Table 1. Binary Classification Confusion Matrix for Cloud Detection. Table 1. Binary Classification Confusion Matrix for Cloud Detection. True/Predicted Class Cloud Non-Cloud Cloud TP FN Non-Cloud FP TN Table 2. Comparison of model efficiency for different cloud detection methods (input size 512 ୍ଠ 512 ). Table 2. Comparison of model efficiency for different cloud detection methods (input size 512 ୍ଠ 512 ). Method Params (M) FLOPs (G) FPS DeepLabV3+ 54.71 83.42 67.57 MFCNN 14.78 126.86 77.99 MSCFF 51.90 561.33 20.36 CDNet 67.65 70.19 57.43 CSDNet 8.66 152.75 50.33 CCRANet (Ours) 41.02 33.80 76.15 All models were evaluated on an NVIDIA RTX 3090 GPU with an input size of 512 ୍ଠ 512 pixels. FPS values for MFCNN, MSCFF, CDNet, and CSDNet were measured in our environment using their publicly available implementations. FPS values for UNet3+, SegFormer, and TANet are cited from their original papers (tested on different hardware: NVIDIA 2080Ti for UNet3+ and TANet, A100 for SegFormer). Therefore, the comparison of absolute FPS numbers is indicative but not strictly head-to-head under identical conditions. We provide relative rankings but caution against exact numerical cross-comparison. Table 3. Comparison of cloud detection performance on Landsat-8 dataset ( IoU 1 : thin cloud, IoU 2 : thick cloud). * FPS values for UNet3+, SegFormer and TANet are referenced from official public reports, not tested in our experiments. Table 3. Comparison of cloud detection performance on Landsat-8 dataset ( IoU 1 : thin cloud, IoU 2 : thick cloud). * FPS values for UNet3+, SegFormer and TANet are referenced from official public reports, not tested in our experiments. Method Precision Recall F1 OA IoU 1IoU 2mIoU (%) FPS DeeplabV3+ 0.7705 0.7812 0.7745 0.9089 0.5063 0.8297 75.70 67.57 UNet3+ 0.7421 0.7328 0.7371 0.8953 0.4812 0.7914 69.84 52.18 * SegFormer 0.8325 0.8417 0.8367 0.9278 0.5712 0.8778 82.45 31.25 * MSCFF 0.8093 0.8207 0.8134 0.9215 0.5432 0.8621 80.06 20.36 CDNet 0.7934 0.8041 0.7972 0.9168 0.5276 0.8419 78.23 57.43 CSDNet 0.8247 0.8362 0.8289 0.9251 0.5578 0.8735 81.26 50.33 MFCNN 0.8291 0.8408 0.8333 0.9267 0.5623 0.8789 81.87 77.99 TANet [ 7] 0.8462 0.8573 0.8514 0.9324 0.5823 0.8811 83.17 28.74 * CCRANet (Ours) 0.8712 0.8834 0.8756 0.9412 0.6034 0.9231 85.93 76.15