Abstract Ultra-high-voltage (UHV) converter transformer equipment is critical for UHVDC transmission systems. This paper proposes a Cross-modal Transformer framework for fault diagnosis by fusing dissolved gas analysis (DGA) and infrared (IR) thermography data. The framework encodes DGA measurements into temporal tokens and processes IR images through a ResNet-18 backbone to generate spatial tokens. A Cross-modal Transformer module enables deep semantic interaction via bidirectional cross-attention, allowing DGA tokens to attend to relevant IR regions and vice versa. A modality-gating mechanism adaptively reweights the two modalities under measurement degradation, including partial and fully missing-modality scenarios. The novelty lies in adapting these components into a leakage-controlled DGA-IR diagnostic framework for UHV converter transformers, with explicit interaction between gas-evolution tokens and spatial thermal tokens. Evaluation is performed under a leakage-controlled grouped chronological split that isolates equipment units, converter stations, and fault episodes across train, validation, and test partitions. Labels are drawn exclusively from maintenance inspection and operational records, independent of the IEC 60599 ratio features seen by the model. Under this protocol, the proposed framework consistently improves accuracy and macro-F1 over encoder-matched simple-fusion baselines (Transformer-DGA + ResNet-18 with concatenation, late fusion, and gated averaging). Additional missing-modality, noise, and ablation experiments indicate that the gains come from bidirectional cross-attention and adaptive gating rather than from stronger unimodal encoders alone. 1. Introduction DGA and IR thermography are widely used in transformer diagnostics: DGA infers internal insulation faults from gas concentrations, while IR images visualize surface temperature anomalies. With the rise of artificial intelligence, diagnosis has shifted from rule-based interpretation to data-driven methods [ 5, 6], improving accuracy and adaptability. Single-modal diagnosis remains limited. Classical DGA methods such as IEC ratio codes and Duval-based graphical interpretation are threshold sensitive and struggle with ambiguous or mixed fault patterns [ 7, 8]. IR-based diagnosis is vulnerable to environment and viewpoint, and some fault types such as low-temperature overheating (LTO) versus high-temperature overheating (HTO) and partial discharge (PD) versus high-energy discharge (HED) show overlapping signatures in one modality. These issues motivate DGA-IR fusion to exploit complementary information. The scientific novelty of this work is the adaptation of cross-modal attention to this diagnostic setting: DGA gas-evolution tokens and IR spatial-temperature tokens are explicitly aligned so that chemical evidence of insulation stress can interact with surface thermal evidence before classification. Beyond architecture, a second gap concerns evaluation rigor. Many multi-modal transformer fault diagnosis studies report in-dataset accuracy under random sample-level splits and use IEC-style heuristics both for label assignment and as direct model inputs. When each DGA window spans multiple consecutive measurements from the same unit, such splits can leak highly similar observations into the test set; when labels are seeded by the same ratios that the model reads, measured performance may merely reflect rule recovery. We therefore design the present study so that (i) training, validation, and test partitions are grouped by equipment unit, converter station, and fault episode under a chronological ordering, and (ii) labels are drawn exclusively from physical maintenance and operational records that are independent of the IEC 60599 ratio features consumed by the model. To address these gaps, we propose a Cross-modal Transformer framework for UHV converter transformer fault diagnosis. DGA measurements are encoded into temporal tokens and IR images into spatial tokens via a convolutional backbone; both are projected into a shared latent space and fused through self- and cross-attention for deep alignment between modalities. A modality-gating mechanism adaptively weights DGA and IR per sample under noisy, degraded, or partially missing measurements. The main contributions are as follows: A fault-mechanism-aware Cross-modal Transformer framework for DGA-IR fusion, in which gas-evolution tokens and IR spatial tokens interact through bidirectional cross-attention and an explicit modality-gating mechanism that supports full, degraded, and missing-modality inputs. A leakage-controlled grouped chronological evaluation protocol for UHV converter transformer fault diagnosis, together with encoder-matched simple-fusion baselines (Transformer-DGA + ResNet-18 with concatenation, late fusion, and gated averaging) that isolate the effect of cross-attention from stronger unimodal backbones. A fully disclosed dataset covering 24,500 DGA-IR windows drawn from 459 monitoring episodes across 18 UHV converter transformer units at 6 converter stations over six years, with unit-, station-, and episode-level provenance statistics, and labels derived from maintenance inspection and operational records that are independent of the DGA ratios used as model inputs. 2. Related Work 3. Problem Formulation and Data Description 3.1. Transformer Equipment Fault Types This study focuses on five operating states of UHV converter transformer equipment. These states include normal condition and four common fault types. The normal condition (N) represents healthy operation without significant abnormalities. Low-temperature overheating (LTO) occurs below 300 °C and is often caused by poor contacts or circulating currents. High-temperature overheating (HTO) exceeds 700 °C and typically results from severe winding faults or core problems. Partial discharge (PD) involves small electrical discharges in insulation voids or interfaces. High-energy discharge (HED) includes arcing and sparking faults that can cause severe damage in short time [ 7]. The temperature thresholds cited above refer to estimated internal fault temperature categories used by domain experts. In this study they are used only to describe fault phenomenology, not to generate labels: all class labels are assigned from independent maintenance inspection and operational records as described in Section 3.5. IR images reflect surface temperature distributions, in our dataset from −20 °C to 150 °C, and provide complementary spatial cues rather than direct hotspot temperature measurements. Accurate classification of these five states is important for maintenance decisions because different fault types require different repair strategies. 3.2. DGA Data Description DGA measures gas concentrations in transformer oil. Six key gases are monitored in this study: hydrogen (H 2), methane (CH 4), ethane (C 2H 6), ethylene (C 2H 4), acetylene (C 2H 2), and carbon monoxide (CO). Each fault type produces characteristic gas patterns. Thermal faults generate mainly hydrocarbon gases, while electrical faults produce more hydrogen and acetylene. In raw monitoring records, a typical overheating episode appears as a gradual weekly rise in CH 4, C 2H 6, and C 2H 4 concentrations (ppm), whereas discharge-related episodes often show sharper increases in H 2 and C 2H 2. These trajectories are used as monitoring signals for the model rather than as hand-coded decision rules. The DGA data consist of gas concentration measurements in ppm, collected using automated gas chromatograph equipment with measurement intervals of approximately seven days. For each equipment unit, measurements from three consecutive time points ( t − 2 , t − 1 , t ) are used to form the temporal input, capturing fault evolution trends. Gas concentration values below the detection limit are set to zero. In addition to raw concentrations, four ratio features are computed from the current time step: CH 4/H 2, C 2H 2/C 2H 4, C 2H 6/CH 4, and C 2H 4/C 2H 6. These ratios are provided to the model only as input features; they are not used in the labeling pipeline ( Section 3.5), so the model and the ground truth do not share a common rule-based source. The complete DGA feature vector has 22 dimensions: 6 ୍ଠ 3 = 18 from temporal concentrations and 4 from ratios. To avoid undefined ratios when denominators are near zero, each ratio is computed with stabilized division ( a + ε ) / ( b + ε ) using a small constant ε set to 1 ppm. 3.3. IR Image Data Description IR thermal imaging provides surface temperature distribution of transformers. IR cameras detect thermal radiation and convert it to temperature values. Hotter regions appear brighter in IR images. Faults often cause abnormal temperature patterns. For example, winding overheating produces hot spots on the tank surface, while bushing faults show temperature rise at the bushing connections. In an illustrative IR monitoring record, normal operation shows a relatively uniform tank and radiator temperature field after load normalization, while a developing bushing or winding-related thermal fault produces a compact hotspot near the bushing connection or tank wall. The images therefore provide spatial context for abnormal heating that may not be separable from gas ratios alone. IR images in this dataset are captured using FLIR thermal cameras with thermal sensitivity of 0.05 °C and accuracy of ବ୍ଦ 2 % . The original image resolution is 512 ୍ଠ 512 pixels in 16-bit grayscale format, representing temperatures in the range of −20 °C to 150 °C. Images are captured from standardized viewing angles covering the main tank, cooling radiators, bushings, and tap changer. For model input, preprocessing includes per-image outlier clipping, resizing to 224 ୍ଠ 224 pixels, and normalization using dataset statistics. CLAHE is used only for visualization in figures and is not applied to the model input. During training, data augmentation includes random horizontal flipping with 50% probability, rotation within ±15 °C, and temperature scaling in the range [ 0.9 , 1.1 ] to simulate varying load conditions. 3.4. Dataset Description and Provenance Because each DGA sample is constructed from three consecutive time points ( t − 2 , t − 1 , t ) from the same unit, adjacent windows share raw observations; a purely sample-level random split would therefore leak near-duplicate information across train and test. To prevent such leakage, the dataset is partitioned using a grouped chronological split: Grouping. All windows belonging to the same (unit and fault episode) pair are assigned to exactly one of the train, validation, or test partitions so that no episode is split across partitions and no unit contributes overlapping windows to two partitions simultaneously for the same episode. Chronological ordering. Within each unit, episodes that ended before 30 June 2023 are used for training, episodes ending between 1 July 2023 and 31 December 2023 for validation, and episodes starting on or after 1 January 2024 for testing. This enforces strict temporal separation between train and test, eliminating future-to-past leakage. Station-disjoint held-out subset. For a stricter robustness check, one converter station (two units, 37 episodes, 2184 windows) is held out entirely from training and used only as a station-disjoint test set; its results are reported separately in Section 5.3. This protocol yields 16,842 training windows, 2367 validation windows, 3107 in-distribution test windows from unseen episodes, and 2184 station-disjoint test windows. Class balance is preserved within ବ୍ଦ 1.5 percentage points of the overall distribution in each partition by a constrained assignment algorithm that groups whole episodes while matching marginal class frequencies. Equipment identifiers, station identifiers, and absolute timestamps are not used as model inputs, further reducing the risk that the classifier learns unit- or station-specific signatures. The random sample-level split reported in earlier iterations of this work is used in this paper only as a diagnostic reference to quantify the leakage gap ( Section 5.3). 3.5. DGA-IR Alignment and Labeling Protocol Each sample in our dataset consists of a ( DGA , IR , label ) triplet collected from the same equipment unit under consistent operational conditions. Temporal alignment. DGA measurements are collected at approximately seven-day intervals via automatic oil sampling. IR images are captured within a ବ୍ଦ 2 day window of the DGA measurement time to ensure correspondence. When multiple IR images are available within the window, we select the image with the smallest time difference; if ties occur, we choose the image with the highest automated quality score, and if still tied, we select the median-mean-temperature image. Spatial alignment. IR images are captured from fixed camera positions installed at each converter transformer station. Standardized viewing angles, including a front view at 0° and side views at ±45°, ensure consistent spatial coverage. Image metadata records the camera position, enabling spatial registration across time points. Labeling protocol independent of model inputs. To avoid circular label definition, fault labels in this study are drawn exclusively from physical maintenance and operational records, not from IEC ratio codes or any gas-ratio threshold applied to the same DGA readings consumed by the model: Primary label source. For every fault-class episode, the class label is extracted from internal maintenance inspection reports that document the physical finding at intervention (e.g., winding hot-spot discoloration, partial-discharge tracking patterns, arcing debris, and bushing degradation). For normal-class episodes, the label is assigned from operational log entries confirming continuous healthy operation with no dispatched maintenance event during the interval. Independent corroboration. Each fault-class label is cross-checked against at least one additional independent source among: (i) offline laboratory analysis reports (e.g., furan, degree-of-polymerization, or interfacial-tension tests unrelated to the online DGA ratios used as model inputs), (ii) on-site component teardown or replacement records, or (iii) partial-discharge UHF/AE acoustic diagnostic reports. A sample is admitted only when the physical-inspection category and the independent corroboration agree. Multi-expert adjudication. All 335 fault-class episodes, and a randomly sampled 10% of normal-class episodes, are reviewed by three independent maintenance engineers. The Fleiss κ inter-rater agreement across engineers is 0.84. Before final inclusion, 27 additional candidate fault-class episodes without unanimous agreement are excluded from the dataset. Explicit exclusion of DGA ratio rules. During labeling, adjudicators do not consult the CH 4/H 2, C 2H 2/C 2H 4, C 2H 6/CH 4, and C 2H 4/C 2H 6 ratios computed from the measurement windows seen by the model. IEC 60599 thresholds are used by the model as auxiliary input features but play no role in label generation. Error control measures. Several additional measures ensure data quality: (i) DGA measurements with total gas concentration below 50 ppm are discarded as unreliable; (ii) IR images with motion blur or camera obstructions are excluded; (iii) samples whose maintenance-inspection label is inconsistent with independent corroboration are flagged and, where the disagreement cannot be resolved, removed (included in the 3187 QC exclusions in Table 1); and (iv) normal samples are validated against historical baselines and operational logs for each equipment unit. 3.6. Problem Formulation Let x DGA ∈ R d 1 denote the DGA feature vector extracted from gas concentrations and ratios, with d 1 = 22 after preprocessing. Let I IR ∈ R H ୍ଠ W ୍ଠ C represent the IR image, where H and W are image dimensions in pixels and C is the channel count. The DGA concentration components are measured in ppm, the ratio features are dimensionless, and IR pixel values record temperature in °C before normalization. The goal is to learn a mapping function f that predicts the fault class y ∈ { 1 , 2 , 3 , 4 , 5 } : y = f ( x DGA , I IR , θ ) . (1) Here θ denotes all learnable parameters of the DGA encoder, IR encoder, Cross-modal Transformer fusion module, modality gate, and classification head. Traditional approaches often process each modality separately and combine results at the decision level. In contrast, this paper proposes a joint modeling approach. The DGA features are encoded into token sequences, the IR image is processed by a CNN backbone to extract spatial tokens, and both token sequences are then fed into a Cross-modal Transformer module for deep interaction before final classification. 4. Proposed Cross-Modal Transformer Framework 4.1. Overall Architecture Figure 1 illustrates the overall framework of the proposed method. The system consists of four main components: a DGA encoder branch, an IR encoder branch, a Cross-modal Transformer fusion module, and a classification head. The DGA branch processes gas concentration data and generates temporal tokens. The IR branch extracts spatial features from IR images. Both branches project their features into a shared latent space. The Cross-modal Transformer module receives tokens from both branches and applies self-attention within each modality and cross-attention between modalities. A modality-gating mechanism adaptively weights the contributions of DGA and IR features, with an explicit missing-modality mask during training, before the classification head predicts the fault type. 4.2. DGA Branch: Temporal Feature Encoder The DGA branch processes gas concentration data and related features. The input DGA feature vector x ∈ R 22 consists of six key gas concentrations at three consecutive time steps ( t − 2 , t − 1 , t ) , yielding 18 dimensions, together with four gas-ratio features computed from the current time step. Because labels are generated independently of these ratios ( Section 3.5), including them as model inputs does not reintroduce circular labeling; they are retained for their diagnostic value in distinguishing thermal from electrical fault mechanisms. To enable token-based processing, we reshape x into a sequence of tokens. Specifically, six gas-type tokens are formed from temporal concentrations: g i = [ x i ( t − 2 ) , x i ( t − 1 ) , x i ( t ) ] ∈ R 3 , i ∈ { 1 , … , 6 } , (2) r = x CH 4 ( t ) + ε x H 2 ( t ) + ε , x C 2 H 2 ( t ) + ε x C 2 H 4 ( t ) + ε , x C 2 H 6 ( t ) + ε x CH 4 ( t ) + ε , x C 2 H 4 ( t ) + ε x C 2 H 6 ( t ) + ε ∈ R 4 . (3) Here ε is a small constant set to 1 ppm to avoid division by zero. The concentration terms x i ( t ) are measured in ppm, while the four gas ratios are dimensionless. Each token is projected to the model dimension with learnable linear embeddings: t i gas = LayerNorm ( Linear gas ( g i ) ) , Linear gas : R 3 → R 256 , (4) t ratio = LayerNorm ( Linear ratio ( r ) ) , Linear ratio : R 4 → R 256 , (5) T DGA = { t 1 gas + p 1 , … , t 6 gas + p 6 , t ratio + p 7 } , p i ∈ R 256 . (6) The DGA token sequence T DGA ∈ R 7 ୍ଠ 256 serves as one modality input to the cross-attention module. 4.3. IR Branch: Spatial Feature Encoder The IR branch extracts visual features from IR images. A ResNet-18 backbone serves as the feature extractor due to its balance between accuracy and efficiency. The input image I IR ∈ R 512 ୍ଠ 512 is preprocessed as follows: Outlier clipping: pixel values are clipped to mean ବ୍ଦ 3 σ per image to suppress sensor artifacts. Resize: the image is resized to 224 ୍ଠ 224 pixels using bilinear interpolation. Channel handling: for grayscale IR images, the single channel is replicated to three channels for ResNet-18 input. Normalization: pixel values are normalized using dataset-specific mean and standard deviation. The ResNet-18 backbone, pretrained on ImageNet and then fine-tuned, processes the normalized image: F IR = ResNet 18 ( I IR ) , (7) t i , j patch = Linear ( F IR [ : , i , j ] ) , i , j ∈ { 1 , … , 7 } , (8) p i , j 2 D = PE row ( i ) + PE col ( j ) , i , j ∈ { 1 , … , 7 } , (9) T IR = { t 1 , 1 + p 1 , 1 , t 1 , 2 + p 1 , 2 , … , t 7 , 7 + p 7 , 7 } . (10) Here F IR ∈ R 512 ୍ଠ 7 ୍ଠ 7 is the feature map output by the last convolutional layer. The 512 channels represent visual–semantic features, while the 7 ୍ଠ 7 spatial grid preserves coarse location information. 4.4. Cross-Modal Transformer Fusion Module The core of the proposed framework is the Cross-modal Transformer fusion module. This module consists of L = 4 transformer blocks, each containing a self-attention layer, a bidirectional cross-attention layer, and feed-forward networks. 4.4.1. Self-Attention Within Modalities For DGA tokens T DGA ∈ R 7 ୍ଠ 256 , multi-head self-attention captures dependencies among different gas features: Q DGA = T DGA W Q , K DGA = T DGA W K , V DGA = T DGA W V , (11) Attention ( Q , K , V ) = Softmax Q K T d k V , (12) T DGA ′ = MHSA ( T DGA ) + T DGA . (13) Here W Q , W K , and W V are learnable projection matrices and d k is the per-head key dimension. Similarly, IR tokens undergo self-attention to model spatial relationships, with N IR = 49 tokens. 4.4.2. Cross-Attention Between Modalities The cross-attention layer enables bidirectional information exchange between modalities. In the forward direction, DGA tokens query IR tokens: Q DGA → IR = T DGA ′ W cross Q , K IR = T IR ′ W cross K , V IR = T IR ′ W cross V , (14) T DGA ′ ′ = CrossAttn DGA → IR ( T DGA ′ , T IR ′ ) + T DGA ′ , (15) T IR ′ ′ = CrossAttn IR → DGA ( T IR ′ , T DGA ′ ) + T IR ′ . (16) Here W cross Q , W cross K , and W cross V are learnable projection matrices for cross-modal attention. This enables, for example, a hydrogen-concentration token to focus on IR regions exhibiting hot-spot patterns consistent with electrical faults. In the reverse direction, each IR spatial region can incorporate chemical information from all gas concentrations. 4.4.3. Feed-Forward and Layer Structure After attention layers, a position-wise feed-forward network with GELU activation is applied: FFN ( x ) = GELU ( x W 1 + b 1 ) W 2 + b 2 , W 1 : 256 → 1024 , W 2 : 1024 → 256 . (17) Each transformer block follows the pre-layer normalization structure: T m l , sa = T m l + MHSA ( LayerNorm ( T m l ) ) , m ∈ { DGA , IR } , (18) T DGA l , ca = T DGA l , sa + CrossAttn DGA → IR ( LayerNorm ( T DGA l , sa ) , LayerNorm ( T IR l , sa ) ) , (19) T IR l , ca = T IR l , sa + CrossAttn IR → DGA ( LayerNorm ( T IR l , sa ) , LayerNorm ( T DGA l , sa ) ) , (20) T m l + 1 = T m l , ca + FFN ( LayerNorm ( T m l , ca ) ) , m ∈ { DGA , IR } . (21) Here l = 0 , … , 3 denotes the layer index. 4.4.4. Modality-Gating Mechanism with Missing-Modality Support The modality-gating mechanism is designed to handle three regimes: (i) both modalities clean, (ii) one or both modalities degraded by noise, and (iii) one modality completely missing (sensor outage). Global features are extracted after the final transformer layer using average pooling: g DGA = 1 N DGA ∑ i = 1 N DGA T DGA , i ( L ) , g IR = 1 N IR ∑ j = 1 N IR T IR , j ( L ) . (22) With g DGA , g IR ∈ R 256 , and availability masks m DGA , m IR ∈ { 0 , 1 } indicating whether each modality is observed in this sample, a gating network computes masked dynamic weights: h concat = [ m DGA g DGA ; m IR g IR ; m DGA ; m IR ] ∈ R 514 , (23) z gate = ReLU ( h concat W gate ( 1 ) + b gate ( 1 ) ) , W gate ( 1 ) : 514 → 128 , (24) s gate = z gate W gate ( 2 ) + b gate ( 2 ) , W gate ( 2 ) : 128 → 2 , (25) s ˜ gate = s gate + log [ m DGA , m IR ] , (26) α DGA , α IR = Softmax ( s ˜ gate ) ∈ R 2 , (27) h fused = α DGA g DGA + α IR g IR ∈ R 256 . (28) The additive term log ( [ m DGA , m IR ] ) (with log 0 implemented as a large negative constant) forces the gate to zero out any missing modality before the softmax. At training time, a modality-drop scheme randomly sets m DGA or m IR to zero with probability 0.15 each (jointly bounded so that at least one modality is always present), so the network explicitly learns to predict under partial and fully missing-modality conditions. This is evaluated in Section 5.4. 4.5. Classification Head and Training Objective The fused feature h fused ∈ R 256 is passed to a classification head with explicit MLP architecture: z 1 = Dropout ( ReLU ( h fused W c ( 1 ) + b c ( 1 ) ) ) , W c ( 1 ) : 256 → 128 , p dropout = 0.3 , (29) z 2 = Dropout ( ReLU ( z 1 W c ( 2 ) + b c ( 2 ) ) ) , W c ( 2 ) : 128 → 64 , (30) y ^ = Softmax ( z 2 W c ( 3 ) + b c ( 3 ) ) , W c ( 3 ) : 64 → 5 . (31) The model is trained using class-weighted cross-entropy with label smoothing: L CE = − ∑ i = 1 K y i log ( y ^ i ) , K = 5 , (32) L = − ∑ i = 1 K ( 1 − ε ) y i + ε K log ( y ^ i ) , ε = 0.1 , (33) w i = N total K N i , L WCE = − ∑ i = 1 K w i y i log ( y ^ i ) . (34) Here y i and y ^ i are the one-hot target and predicted probability for class i, N i is the number of training samples in class i, and N total is the total number of training samples. Optimization uses AdamW with initial learning rate η 0 = 1 ୍ଠ 10 − 4 , weight decay λ = 5 ୍ଠ 10 − 4 , and β 1 = 0.9 , β 2 = 0.999 . Learning-rate warmup is applied for the first 10 epochs, followed by cosine annealing: η t = η min + 1 2 ( η 0 − η min ) 1 + cos π t − T warmup T max − T warmup , (35) where T warmup = 10 , T max = 100 , and η min = 1 ୍ଠ 10 − 6 . 5. Experiments 5.1. Experimental Setup The proposed Cross-modal Transformer framework is implemented in PyTorch 2.1.2 and trained on an NVIDIA A100 GPU with 40 GB memory. Unless stated otherwise, all methods use the grouped chronological split of Section 3.4: 16,842 training windows, 2367 validation windows, and 3107 in-distribution test windows (unseen episodes from training-observed units and stations), with an additional 2184-window station-disjoint held-out set used only for the robustness check in Table 2. To quantify the leakage gap of prior practice, we also report results under a sample-level random split with the same marginal class distribution; those numbers are labeled “random-split reference” and are not used to support any headline claim. The hidden dimension d model is 256. The Cross-modal Transformer consists of 4 layers with 8 attention heads. The batch size is 32 for paired DGA and IR samples. The initial learning rate is 1 ୍ଠ 10 − 4 with weight decay 5 ୍ଠ 10 − 4 . Training runs for 100 epochs with 10 warmup epochs followed by cosine annealing. Dropout is 0.3. DGA features are normalized with z-score standardization, with gas concentration values clipped to [ 0 , 5000 ] ppm before standardization. Ratio features use stabilized division with ε = 1 ppm. IR images are clipped to mean ବ୍ଦ 3 σ per image, resized to 224 ୍ଠ 224 , and normalized using dataset-specific statistics computed only on the training partition to avoid test-set statistics leaking into training. Data augmentation includes random horizontal flipping, rotation within ±15°, and random temperature scaling in [ 0.9 , 1.1 ] . For all comparisons, we report accuracy (Acc), macro-averaged precision, macro-averaged recall, and macro-F1. All tables in this section, including ablations, use the same macro-F1 definition, so numbers are directly comparable across tables. All reported test numbers are averages over five independent runs with different random seeds; standard deviations are given in parentheses where space permits. 5.2. Baseline Methods To evaluate the proposed approach, we compare against both classical and encoder-matched baselines: DGA-MLP: a multi-layer perceptron on DGA features with hidden sizes [ 128 , 64 , 32 ] . IR-CNN: a standard CNN with 4 convolutional blocks processing IR images only. IR-ResNet18: a ResNet-18 backbone pretrained on ImageNet and fine-tuned on IR images. Transformer-DGA: a standard Transformer encoder processing DGA features only. ViT-IR: a Vision Transformer processing IR images only. DGA + IR-Concat (weak): feature-level concatenation of DGA-MLP and IR-CNN features, followed by a 3-layer MLP. Retained to match prior literature. DGA + IR-Decision (weak): decision-level fusion combining DGA-MLP and IR-CNN predictors via weighted averaging. EM-Concat: encoder-matched concatenation of Transformer-DGA and ResNet-18 features, followed by the same 3-layer MLP classifier used in our model. Isolates the benefit of stronger unimodal encoders from fusion design. EM-LateFusion: encoder-matched late fusion, where separate Transformer-DGA and ResNet-18 classifiers are combined via learned weighted averaging of class posteriors. EM-GatedAvg: encoder-matched gated averaging, where a lightweight gating MLP (identical in capacity to the proposed gate) produces sample-level weights over Transformer-DGA and ResNet-18 features, but no cross-attention is used. The encoder-matched baselines (EM-*) share Transformer-DGA and ResNet-18 backbones with the proposed model, differ from it only in the fusion mechanism, and are trained with the same optimizer, schedule, class weights, and modality-drop regularization. Consequently, any gap between EM-GatedAvg and the proposed method isolates the contribution of bidirectional cross-attention over simple gated averaging with identical encoders. We therefore restrict the superiority claim to these controlled DGA-IR fusion comparisons rather than claiming dominance over every possible temporal, uncertainty-aware, or physics-informed diagnostic framework. 5.3. Main Experimental Results Table 2 summarizes the comparison on the grouped chronological test set and the station-disjoint held-out set; the random-split reference column is included for transparency only. The corresponding per-class F1-scores on the grouped chronological test set are reported in Table 3. Several observations follow: Leakage gap is visible and substantial. Every method loses 5–6 accuracy points when moving from the random-split reference column to the grouped chronological split. The random-split numbers therefore systematically overestimate deployment accuracy, confirming that prior sample-level evaluations are optimistic. Encoder-matched baselines already absorb most of the “fusion gain”. EM-Concat, EM-LateFusion, and EM-GatedAvg reach 84.67–85.91% accuracy, compared with 81.53–82.08% for the weak-encoder simple-fusion baselines. This shows that a significant fraction of the headline gap reported in earlier multi-modal studies is attributable to stronger unimodal encoders, not to fusion design itself. Cross-attention contributes a genuine residual gain. With encoders held fixed (Transformer-DGA + ResNet-18), the proposed bidirectional cross-attention model improves over EM-GatedAvg by 2.82 accuracy points and 0.020 macro-F1 (both statistically significant at p < 0.01 under a paired bootstrap over five seeds). The improvement is largest for the hardest classes (HED: +3.8 F1, PD: +3.3 F1), which supports the claim that cross-modal attention disambiguates confusable fault pairs rather than merely inflating easy-class metrics. Station-disjoint results degrade gracefully. On the station-disjoint test set, the proposed method retains 85.62% accuracy, a 3.11-point drop from the in-distribution grouped-split test. Encoder-matched baselines drop by comparable margins, suggesting the gain is not an artifact of station-specific overfitting. From a diagnostic perspective, the largest per-class improvements occur for PD and HED, where gas signatures such as H 2 and C 2H 2 increases need to be interpreted together with localized or weak thermal patterns. For LTO and HTO, the temporal rise of hydrocarbon gases and the spatial distribution of tank or bushing hot spots provide complementary evidence of thermal severity. These trends are consistent with the physical mechanisms summarized in Section 3.2 and Section 3.3 and explain why cross-modal interaction is more beneficial than feature concatenation alone. 5.4. Missing-Modality and Noise Robustness Table 4 evaluates the modality-gating mechanism under partial and complete sensor failure. EM-Concat, which has no gating mechanism, collapses by 16–24 accuracy points when one modality is missing because feature concatenation with zeroed inputs drives the classifier off distribution. Gated baselines degrade more gracefully, and the proposed model retains the largest fraction of its paired-modality accuracy across all four degradation regimes. This supports robustness under the simulated outage setting, but it should not be read as covering all field conditions such as long DGA delays, analyzer calibration faults, camera aging, seasonal temperature drift, or incomplete maintenance records. Under per-sample Gaussian noise on DGA concentrations ( σ DGA ∈ { 0.05 , 0.10 , 0.20 } x ପ୍ତ ) and speckle noise on IR images ( σ IR ∈ { 0.03 , 0.06 , 0.12 } ), the proposed model retains 86.12%, 82.45%, and 76.88% accuracy respectively, versus 83.24%, 78.42%, 70.55% for EM-GatedAvg. Noise results are consistent with the missing-modality findings and are reported as stress tests rather than as a complete operational certification. 5.5. Ablation Studies All ablation tables ( Table 5, Table 6, Table 7 and Table 8) use the same grouped chronological test set and the same macro-F1 definition as Table 2, so values here are directly comparable to the main results. Bidirectional cross-attention, adaptive gating, and temporal concentrations each contribute measurable and non-redundant gains. Architecture is saturated around 4 layers and d model = 256 ; deeper or wider variants return diminishing returns at noticeably higher parameter cost. 5.6. Feature Representation Analysis Figure 2 presents t-SNE projections for the two pre-fusion branches and the proposed fused representation, and Table 9 reports silhouette scores on the grouped chronological test set. DGA features alone are the most mixed (0.079), IR features show moderate grouping (0.152), encoder-matched concatenation improves compactness (0.217), and the proposed cross-modal fusion achieves the highest compactness (0.274), confirming that cross-attention produces more discriminative joint representations. 5.7. Efficiency Table 10 shows that the proposed model adds 5.0 M parameters and 1.2 ms of per-batch inference over EM-GatedAvg in exchange for 2.82 accuracy points and noticeably stronger missing-modality robustness. In operational deployment, where batch inference runs at the DGA sampling cadence (approximately weekly), this overhead is negligible. 6. Conclusions This paper presents a Cross-modal Transformer framework for UHV converter transformer equipment fault diagnosis by jointly modeling DGA and IR imaging data. The framework combines Transformer-DGA temporal encoding, a ResNet-18 IR backbone, bidirectional cross-attention, and an adaptive modality-gating mechanism that supports full, degraded, and missing-modality inputs. Evaluation is performed under a leakage-controlled grouped chronological split that isolates equipment units, converter stations, and fault episodes across train, validation, and test partitions, with labels drawn from maintenance inspection and operational records independent of the IEC 60599 ratio features used as model inputs, and with dataset provenance—units, stations, monitoring span, fault episodes, overlapping windows, and QC exclusions—fully disclosed. On 24,500 windows from 459 episodes across 18 units at 6 converter stations over six years, the proposed method reaches 88.73% accuracy and 0.867 macro-F1, retaining 85.62% accuracy on a station-disjoint held-out subset. The closest larger architecture in Table 7 reaches 88.91%, so we do not claim 90%; reaching that level will likely require additional independently labeled episodes and tighter DGA-IR calibration rather than only increasing model depth or width. Crucially, the gain over encoder-matched simple-fusion baselines (EM-Concat, EM-LateFusion, and EM-GatedAvg) is 2.82–4.06 accuracy points with encoders held fixed, which isolates the contribution of bidirectional cross-attention from that of stronger unimodal backbones. The gating mechanism, trained with modality-drop supervision, further improves robustness under one-modality outage by 2.22–2.31 accuracy points over EM-GatedAvg. Taken together, these results support the claim that explicit cross-modal interaction and adaptive modality weighting are both meaningful improvements for DGA-IR fault diagnosis of UHV converter transformer equipment, beyond what is attributable to stronger feature extractors or to leakage-optimistic evaluation protocols. Future work will extend the benchmark across additional utilities and climates, evaluate cross-manufacturer transfer, and study online adaptation when the monitoring pipeline is retrofitted with new IR camera models. Author Contributions Conceptualization, X.Y. and W.L.; methodology, X.Y. and W.L.; investigation, X.Y., W.L., R.L., S.F., Y.F. and Y.Z.; formal analysis, X.Y.; writing—original draft preparation, X.Y.; writing—review and editing, all authors; supervision, W.L.; funding acquisition, X.Y. and W.L. All authors have read and agreed to the published version of the manuscript. Funding This research was funded by the Science and Technology Project of State Grid Sichuan Electric Power Company (Project: Mechanism and Diagnosis Technology of Abnormal Gas Generation in UHV Converter Transformers and Key Components, Grant No. 521997240002). Data Availability Statement The data presented in this study are available from the corresponding author upon reasonable request due to privacy or ethical restrictions. Conflicts of Interest All authors were employed by the Electric Power Research Institute, State Grid Sichuan Electric Power Company and Sichuan Provincial Key Laboratory of Safety and Operation of New Power System. References Figure 1. Overall architecture of the proposed Cross-modal Transformer. The DGA branch encodes three-step gas concentrations and four IEC ratios into seven temporal tokens, and a ResNet-18 backbone turns the 224 ୍ଠ 224 IR image into 49 spatial tokens. Both token sequences are projected into a shared 256-dimensional latent space, where four transformer blocks apply self-attention within each modality and bidirectional cross-attention between modalities. A modality-gating module with a missing-modality mask combines the global features and feeds the fused embedding to a 5-way classifier. Figure 1. Overall architecture of the proposed Cross-modal Transformer. The DGA branch encodes three-step gas concentrations and four IEC ratios into seven temporal tokens, and a ResNet-18 backbone turns the 224 ୍ଠ 224 IR image into 49 spatial tokens. Both token sequences are projected into a shared 256-dimensional latent space, where four transformer blocks apply self-attention within each modality and bidirectional cross-attention between modalities. A modality-gating module with a missing-modality mask combines the global features and feeds the fused embedding to a 5-way classifier. Figure 2. t-SNE visualization of feature representations. The panels show DGA-only features, IR-only features, and proposed cross-modal fused features, respectively. Figure 2. t-SNE visualization of feature representations. The panels show DGA-only features, IR-only features, and proposed cross-modal fused features, respectively. Table 1. Dataset provenance statistics. “Episodes” count physically distinct fault intervals or healthy-operation intervals; “Windows” count three-step DGA-IR samples used for modeling; “W/Ep.” gives median/max windows per episode. Table 1. Dataset provenance statistics. “Episodes” count physically distinct fault intervals or healthy-operation intervals; “Windows” count three-step DGA-IR samples used for modeling; “W/Ep.” gives median/max windows per episode. Class Units Stations Episodes Windows W/Ep. (med/max) Normal (N) 18 6 124 8000 62/118 Low-Temp Overheat (LTO) 14 6 118 5500 45/96 High-Temp Overheat (HTO) 13 5 96 4800 48/102 Partial Discharge (PD) 11 5 74 3500 44/88 High-Energy Discharge (HED) 9 4 47 2700 52/106 Total 18 6 459 24,500 — Excluded (QC) 18 6 — 3187 — Table 2. Method comparison under the grouped chronological split (leakage-controlled) and the station-disjoint test set. Table 2. Method comparison under the grouped chronological split (leakage-controlled) and the station-disjoint test set. Method Acc (%) Prec. Rec. F1 Stn-Disj. (%) RS Ref. (%) DGA-MLP 74.82 0.741 0.736 0.738 71.93 82.45 IR-CNN 71.55 0.707 0.702 0.704 68.74 79.31 IR-ResNet18 77.34 0.764 0.758 0.761 74.85 84.62 Transformer-DGA 78.46 0.778 0.772 0.775 75.61 85.18 ViT-IR 79.12 0.784 0.780 0.782 76.32 86.35 DGA + IR-Concat (weak) 81.53 0.806 0.801 0.803 78.24 88.72 DGA + IR-Decision (weak) 82.08 0.812 0.807 0.809 78.91 89.14 EM-Concat 84.67 0.837 0.832 0.834 81.46 91.35 EM-LateFusion 85.24 0.843 0.838 0.840 82.05 91.84 EM-GatedAvg 85.91 0.850 0.845 0.847 82.77 92.48 Proposed 88.73 0.876 0.871 0.867 85.62 94.28 Note: F1 is macro-averaged using the same definition across all tables. Values are means over five seeds. “Stn-disj.” is the station-disjoint accuracy and “RS ref.” is the random-split reference accuracy, reported only to quantify the leakage gap. Table 3. Per-class F1 on the grouped chronological test set. Table 3. Per-class F1 on the grouped chronological test set. Method N LTO HTO PD HED DGA + IR-Concat (weak) 89.4 79.1 81.7 76.2 73.5 DGA + IR-Decision (weak) 90.1 79.8 82.9 77.4 74.1 EM-Concat 92.0 82.7 85.1 80.2 77.3 EM-LateFusion 92.4 83.3 85.7 81.0 78.1 EM-GatedAvg 92.8 84.1 86.3 81.8 78.8 Proposed 93.0 85.5 87.2 85.1 82.6 Bold indicates the best-performing method and values. Table 4. Accuracy (%) on the grouped chronological test set under missing-modality conditions. “DGA only”/“IR only” denotes a hard sensor outage (the other modality zeroed and masked). “Drop 20%”/“Drop 50%” denote random per-sample modality drop at test time. Table 4. Accuracy (%) on the grouped chronological test set under missing-modality conditions. “DGA only”/“IR only” denotes a hard sensor outage (the other modality zeroed and masked). “Drop 20%”/“Drop 50%” denote random per-sample modality drop at test time. Method Both Present DGA Only IR Only Drop 20% Drop 50% EM-Concat 84.67 68.42 61.05 78.83 71.12 EM-LateFusion 85.24 74.81 68.37 81.56 75.24 EM-GatedAvg 85.91 75.92 69.41 82.34 76.43 Proposed 88.73 78.14 71.72 85.12 79.68 Table 5. Ablation on cross-attention mechanisms. Table 5. Ablation on cross-attention mechanisms. Variant Acc (%) Macro-F1 Self-attention only 85.12 0.830 Unidirectional (DGA→IR) 87.04 0.848 Bidirectional (ours) 88.73 0.867 Table 6. Ablation on modality-gating strategies. Table 6. Ablation on modality-gating strategies. Fusion Acc (%) Macro-F1 Simple averaging 86.18 0.839 Learned fixed weights 87.05 0.851 Adaptive gating (ours) 88.73 0.867 Table 7. Ablation on model architecture. Table 7. Ablation on model architecture. Layers d model Acc (%) Param (M) 2 128 85.32 14.4 2 256 86.74 16.0 4 256 88.73 17.7 6 256 88.91 19.9 4 512 88.82 23.5 Table 8. Ablation on DGA feature combinations. Table 8. Ablation on DGA feature combinations. Features Acc (%) Macro-F1 Current-step concentrations only 86.45 0.843 Concentrations + ratios 87.61 0.855 Concentrations + ratios + temporal (ours) 88.73 0.867 Table 9. Silhouette scores on the grouped chronological test set. Table 9. Silhouette scores on the grouped chronological test set. Feature Type DGA-Only IR-Only EM-Concat Proposed Score 0.079 0.152 0.217 0.274 Table 10. Efficiency comparison. “Time” is average per-batch (32 samples) inference time in ms on a single A100. Table 10. Efficiency comparison. “Time” is average per-batch (32 samples) inference time in ms on a single A100. Method Param (M) Time (ms) Acc (%) DGA-MLP 0.8 2.1 74.82 IR-ResNet18 11.2 8.5 77.34 EM-Concat 12.5 10.8 84.67 EM-GatedAvg 12.7 11.1 85.91 Proposed 17.7 12.3 88.73 Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. © 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. 2. Related Work 3. Problem Formulation and Data Description 3.1. Transformer Equipment Fault Types This study focuses on five operating states of UHV converter transformer equipment. These states include normal condition and four common fault types. The normal condition (N) represents healthy operation without significant abnormalities. Low-temperature overheating (LTO) occurs below 300 °C and is often caused by poor contacts or circulating currents. High-temperature overheating (HTO) exceeds 700 °C and typically results from severe winding faults or core problems. Partial discharge (PD) involves small electrical discharges in insulation voids or interfaces. High-energy discharge (HED) includes arcing and sparking faults that can cause severe damage in short time [ 7]. The temperature thresholds cited above refer to estimated internal fault temperature categories used by domain experts. In this study they are used only to describe fault phenomenology, not to generate labels: all class labels are assigned from independent maintenance inspection and operational records as described in Section 3.5. IR images reflect surface temperature distributions, in our dataset from −20 °C to 150 °C, and provide complementary spatial cues rather than direct hotspot temperature measurements. Accurate classification of these five states is important for maintenance decisions because different fault types require different repair strategies. 3.2. DGA Data Description DGA measures gas concentrations in transformer oil. Six key gases are monitored in this study: hydrogen (H 2), methane (CH 4), ethane (C 2H 6), ethylene (C 2H 4), acetylene (C 2H 2), and carbon monoxide (CO). Each fault type produces characteristic gas patterns. Thermal faults generate mainly hydrocarbon gases, while electrical faults produce more hydrogen and acetylene. In raw monitoring records, a typical overheating episode appears as a gradual weekly rise in CH 4, C 2H 6, and C 2H 4 concentrations (ppm), whereas discharge-related episodes often show sharper increases in H 2 and C 2H 2. These trajectories are used as monitoring signals for the model rather than as hand-coded decision rules. The DGA data consist of gas concentration measurements in ppm, collected using automated gas chromatograph equipment with measurement intervals of approximately seven days. For each equipment unit, measurements from three consecutive time points ( t − 2 , t − 1 , t ) are used to form the temporal input, capturing fault evolution trends. Gas concentration values below the detection limit are set to zero. In addition to raw concentrations, four ratio features are computed from the current time step: CH 4/H 2, C 2H 2/C 2H 4, C 2H 6/CH 4, and C 2H 4/C 2H 6. These ratios are provided to the model only as input features; they are not used in the labeling pipeline ( Section 3.5), so the model and the ground truth do not share a common rule-based source. The complete DGA feature vector has 22 dimensions: 6 ୍ଠ 3 = 18 from temporal concentrations and 4 from ratios. To avoid undefined ratios when denominators are near zero, each ratio is computed with stabilized division ( a + ε ) / ( b + ε ) using a small constant ε set to 1 ppm. 3.3. IR Image Data Description IR thermal imaging provides surface temperature distribution of transformers. IR cameras detect thermal radiation and convert it to temperature values. Hotter regions appear brighter in IR images. Faults often cause abnormal temperature patterns. For example, winding overheating produces hot spots on the tank surface, while bushing faults show temperature rise at the bushing connections. In an illustrative IR monitoring record, normal operation shows a relatively uniform tank and radiator temperature field after load normalization, while a developing bushing or winding-related thermal fault produces a compact hotspot near the bushing connection or tank wall. The images therefore provide spatial context for abnormal heating that may not be separable from gas ratios alone. IR images in this dataset are captured using FLIR thermal cameras with thermal sensitivity of 0.05 °C and accuracy of ବ୍ଦ 2 % . The original image resolution is 512 ୍ଠ 512 pixels in 16-bit grayscale format, representing te