Open AccessArticle CARM: Cross-Modal Alignment Recovery for Lightweight Referring Expression Comprehension 1 School of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan 430205, China 2 Hubei Key Laboratory of Intelligent Robot (Wuhan Institute of Technology), Wuhan 430073, China 3 Wuhan Yangtze Computing Technology Co., Ltd., Wuhan 430071, China * Author to whom correspondence should be addressed. Electronics 2026, 15(12), 2509; https://doi.org/10.3390/electronics15122509 (registering DOI) Submission received: 4 May 2026 / Revised: 26 May 2026 / Accepted: 4 June 2026 / Published: 7 June 2026 Abstract Referring Expression Comprehension (REC) localizes a target object in an image given a natural-language referring expression and is a core benchmark for fine-grained vision–language alignment. Recent detection-style multimodal Transformers achieve strong REC performance but typically rely on high-capacity visual and textual backbones, incurring substantial storage and compute costs. Replacing these backbones with lightweight alternatives greatly reduces model size, yet often degrades cross-modal alignment and yields a persistent accuracy gap. We propose CARM, a minimally invasive Cross-modal Alignment Recovery Module inserted between lightweight backbones and the downstream multimodal Transformer, requiring no changes to either component. CARM injects complementary priors via bidirectional prompts and uses a Cross-Attention Gate (CAG) to adaptively filter and scale prompt-induced updates; it further integrates Tree-of-Attributes Prompts (TAPs) to strengthen fine-grained grounding of attributes such as color, size, and spatial location. On RefCOCO, switching to lightweight backbones drops P@1 (IoU ≥ 0.5) to 84.51, while CARM improves it to 86.23, recovering most of the loss. Meanwhile, the overall model storage (checkpoint) is reduced by about 80%, demonstrating that the cross-modal alignment degradation induced by compression can be effectively restored. 1. Introduction 2. Background, Related Work, and Motivation 2.1. Detection-Style Multimodal Frameworks for REC 2.2. Recent Advances and Limitations in REC 2.3. Motivation: From Representation Degradation to Alignment Recovery This limitation becomes particularly important when high-capacity REC models are compressed for efficient deployment. We focus on accuracy recovery after backbone lightweighting in detection-style REC models. Replacing large visual and textual backbones with lightweight ones reduces computation but weakens semantic anchors and cross-modal alignment, leading to less reliable word-to-region correspondence. To address this issue, we propose CARM as a minimally invasive module before the multimodal Transformer. It strengthens visual and textual representations through bidirectional prompts, a Cross-Attention Gate, and Tree-of-Attributes Prompts, aiming to recover fine-grained alignment without modifying the compressed backbones or requiring additional generic pretraining. 3. Method 3.1. Problem Setting and Overall Framework Our method is built upon an MDETR-style detection-based multimodal Transformer for REC. Given an image I and a referring expression text T , the model outputs a target bounding box that matches T . Rather than modifying the detection head or downstream reasoning structure, we keep MDETR’s joint Transformer and REC prediction head fixed and make only two minimal changes: (1) we replace the vision and text backbones with TinyViT and DistilRoBERTa to form a lightweight baseline (denoted as Shrunken), and (2) we insert a minimally invasive accuracy recovery module, CARM, between the “backbone outputs” and the “early fusion stage” of the joint Transformer encoder, as shown in Figure 1. The lightweight vision backbone encodes the image into a visual token sequence X img ∈ R N ୍ଠ d , and the lightweight text encoder encodes the expression into a text token sequence X txt ∈ R L ୍ଠ d . CARM takes ⟨ X img , X txt ⟩ as input and outputs the enhanced ⟨ X ~ img , X ~ txt ⟩ . Following MDETR’s early-fusion strategy, the enhanced sequences are concatenated and fed into the joint Transformer encoder, and then the decoder and REC head perform matching and regression. Therefore, the core problem is: without changing the downstream structure, how can we bring lightweight representations back into an alignable regime before cross-modal interaction? 3.2. CARM: Cross-Modal Alignment Recovery Module As illustrated in Figure 2, CARM supplements alignable semantic anchors for both modalities before cross-modal interaction and injects them into the visual and textual memories in a controllable manner, enabling stable cross-modal alignment recovery under lightweight backbones. CARM consists of symmetric visual-prompt and text-prompt branches, coupled bidirectionally so that each modality can receive complementary priors from the other side before entering the joint Transformer. Bidirectional Prompt Generation. Degradation caused by lightweight backbones often first manifests as “insufficient alignment cues before interaction”: visual tokens lack language priors, and text tokens lack complementary visual structure cues. CARM extracts a global summary from each modality (mean pooling in Figure 2), generates conditioned prompts via lightweight projection networks, and maintains learnable base prompts as stable prompt slots [ 17, 18, 19, 20]. The text-side summary generates visual prompts, and the vision-side summary generates text prompts, forming bidirectional prompts: P v = P v 0 + f t v P o o l X t x t , P t = P t 0 + f v t P o o l X i m g . (1) Intuitively, bidirectional prompts allow the visual branch to carry semantic bias about “what the expression is looking for”, while allowing the text branch to carry bias about “what structural cues exist in the image”, pulling both representations toward a more alignable state before fusion. Prefix–Suffix Injection. The generated prompts do not replace the original tokens. Instead, they are injected as explicit prompt tokens at both the beginning and the end of each sequence ( Figure 2). On the visual side, visual prompts are concatenated as both prefix and suffix to the image token sequence; similarly, on the text side, text prompts are concatenated to the text token sequence. This design is minimally invasive: it requires no change to the backbones or downstream Transformer, while enabling prompts to directly participate in attention computation and provide explicit alignment anchors before early fusion. Cross-Attention Gate (CAG). Simple concatenation may introduce instability under lightweight representations—especially when prompts trigger responses irrelevant to the current referring target, which can be amplified by attention. To address this, CARM introduces a Cross-Attention Gate (CAG) on both sides to adaptively control the prompt injection strength [ 21]. The gate uses prompts as queries and the original memory as keys/values to compute prompt responses, then modulates the write-back strength via a sigmoid gate and updates the memory with a residual form: A = A t t n Q = P , K = X , V = X , X ~ = X + σ ( W A + b ) ⨀ ∆ A . (2) This turns injection from a “hard write” into a controlled recovery process: when prompts are relevant, the gate amplifies useful increments; when prompts deviate, the gate suppresses write-back. CARM finally outputs the enhanced image memory X ~ img and text memory X ~ txt , which are then fed into the original joint Transformer encoder for early fusion ( Figure 1). 3.3. TAPs and Prompt-Modulated Query Attention Although bidirectional prompts and gating substantially alleviate alignment degradation caused by lightweight backbones, REC remains highly sensitive to fine-grained attribute binding—especially for attributes such as color, size, and location, which are more easily weakened by a lightweight text encoder [ 22, 23]. We therefore introduce TAPs on the text side and further modulate decoder queries using prompt information so that recovery influences not only the encoder input stage but also decoder query selection. Tree-of-Attributes Prompts (TAPs). As shown in Figure 2, TAPs organize attribute cues into type-aware prompt groups. Referring expressions often contain heterogeneous attributes, such as color, size, and position, which may play different roles in grounding. Instead of treating them as a flat set of independent prompts, TAPs introduce a lightweight hierarchical prior to preserve attribute-type distinctions before prompt fusion. This design helps make fine-grained attribute cues more explicit under lightweight text representations. For each attribute type, TAPs maintain learnable prompts and aggregate the corresponding attribute representations into text-side prompts. These prompts are then injected through the same prefix–suffix and gated fusion process as the text branch. In this way, TAPs provides structured attribute anchors for fine-grained grounding, while irrelevant attribute responses can still be suppressed by the CAG. Prompt-Modulated Query Attention. As shown in Figure 3 (the query modulation interface in the lower-right of Figure 1), we further use aggregated prompt representations to modulate the positional embeddings of decoder queries so that queries are biased toward expression-relevant regions during decoding. Specifically, we aggregate visual and text prompts into summary vectors, concatenate them, and map them linearly into a modulation term Δ q , which is then added to the query positional embedding [ 24, 25, 26]. This low-cost design extends prompt information from “enhancing memory” to “guiding queries”, thereby further stabilizing localization decisions in the decoder. We follow the original MDETR-style training objectives and inference pipeline. Since our focus is modular recovery, the backbones remain frozen and only CARM-related parameters are optimized; training and implementation details are provided in Section 4. 4. Experiments 4.1. Experimental Setup and Metrics To quantify the “gap” and the “recovery,” we define Accuracy Gap and Recovery Rate: ∆ = A c c f u l l − A c c s h r u n k e n , ρ = A c c o u r s − A c c s h r u n k e n A c c f u l l − A c c s h r u n k e n . (3) Here, A c c ( ⋅ ) denotes P@1 (IoU ≥ 0.5). Both Δ ρ are computed independently for each reported split. That is, for a given split s (e.g., val, test A, test B, or test), A c c full , A c c shrunken , and A c c CARM all refer to the P@1 values on the same split s . 4.2. Implementation Details We train Shrunken and Shrunken + CARM on all three datasets. Loss functions and matching strategies follow the standard set-based matching and box regression objectives in detection-style frameworks. All models are trained for 30 epochs with batch size 8. To keep the comparison focused on “accuracy recovery” rather than “retraining backbones”, we freeze TinyViT-11M and DistilRoBERTa in the lightweight setting and update only CARM and its associated projection/gating parameters. We adopt grouped learning rates, assigning higher rates to prompts and gates, to stabilize optimization. Unless otherwise specified, the joint Transformer hidden dimension is set to d = 256 , which is also used for all CARM prompt embeddings, while other structures remain consistent with the MDETR backbone framework. We set the visual and textual prompt lengths to L v = 10 L t = 5 , respectively; with prefix–suffix insertion, this adds 20 visual prompt tokens and 10 textual prompt tokens. Specifically, we searched L v ∈ { 5,10,15 } L t ∈ { 3,5 , 8 } on the validation split and found that L v = 10 L t = 5 offered the best accuracy–efficiency trade-off. These values were fixed for all final experiments. 4.3. Accuracy Recovery Results Table 1 reports the results of Full, Shrunken, and Shrunken + CARM on the RefCOCO series, together with the Accuracy Gap Δ , Recovery Rate ρ , and FLOPs. Directly replacing the original backbones with lightweight ones substantially reduces computation, but also causes consistent accuracy degradation across datasets, indicating that the compressed representations weaken fine-grained cross-modal alignment. After adding CARM, the model recovers most of the lost accuracy while still maintaining a much lower computational cost than the Full model. These results show that CARM achieves a favorable accuracy–efficiency trade-off: it introduces only moderate additional computation over the lightweight baseline yet effectively restores cross-modal alignment under lightweight backbones. We also provide an external comparison against representative REC models and efficient REC-related works in Table 2 as a reference for performance range. The first seven methods are representative high-performance REC models, while the following three methods focus more on lightweight or efficient REC settings. Except for ours, the results are taken from the corresponding papers. Since these methods use different backbones, pretraining data, training protocols, and reported resource metrics, the numbers should be viewed as an external reference rather than strictly controlled comparisons. In this context, Shrunken + CARM achieves competitive performance with lightweight backbones and only 21.46 M trainable parameters, showing its effectiveness in recovering accuracy after backbone lightweighting. Overall, on the validation splits of RefCOCO, RefCOCO+, and RefCOCOg, CARM recovers approximately 77%, 92%, and 73% of the accuracy gap, respectively, while the recovery rates on other splits are reported separately in Table 1. In terms of computation, Shrunken + CARM reduces FLOPs from 105.15 G in the Full model to 62.87 G, with only a moderate overhead over the Shrunken baseline. With the backbone and downstream architecture frozen, CARM achieves this recovery by optimizing only 21.46 M trainable parameters, compared with about 176.58 M parameters required by standard end-to-end training on the MDETR backbone. These results indicate that CARM provides an effective recovery strategy that jointly considers accuracy, computation, and training efficiency without extra pretraining or downstream architectural changes. 4.5. Visualization and Analysis Figure 4 presents both successful and failed qualitative examples. In the first three cases, attention under Prompt OFF is relatively scattered or biased toward distracting regions, while CARM makes the attention more concentrated on the targets described by the referring expressions. The CAG maps further show that different prompt indices receive different gate values, indicating that prompt information is selectively injected rather than uniformly applied. This supports the ablation results that prompts improve cross-modal alignment and gating enhances recovery stability. The last row shows a representative failure case. The expression “a person wearing a white shirt, black pants, and no hat” should refer to the middle background man without a hat, but the model incorrectly predicts the foreground batter. This case is challenging because it contains multiple visually similar people, a negated attribute cue, and a visually salient distractor. It suggests that although CARM improves coarse-grained cross-modal alignment, it may still be insufficient for fine-grained exclusion based on negated attributes. Future work will further explore explicit modeling of discriminative and negation-aware attributes. 5. Conclusions We studied the performance drop that occurs when detection-style multimodal Transformers for REC are switched to lightweight backbones. This degradation is closely related to weakened cross-modal alignment and less stable fine-grained attribute grounding before fusion. We proposed CARM, a minimally invasive Cross-modal Alignment Recovery Module that enhances lightweight representations through bidirectional prompts and selectively writes prompt-induced updates via a Cross-Attention Gate; Tree-of-Attributes Prompts and lightweight query modulation further strengthen attribute grounding and decoding focus. Experiments on RefCOCO, RefCOCO+, and RefCOCOg show that CARM recovers most of the accuracy gap introduced by backbone lightweighting while preserving clear advantages in model scale and computational cost, offering a practical path to compact yet accurate REC models without full retraining or large-scale pretraining. Nevertheless, CARM still introduces moderate additional computation over the lightweight baseline, and its prompt-based recovery remains limited in challenging cases involving multiple similar candidates, negated attributes, or strong visual distractors. Future work will explore more explicit fine-grained and negation-aware reasoning while maintaining the modular and lightweight nature of the framework. Author Contributions Conceptualization, Q.Z.; methodology, Q.Z. and G.Z.; software, G.Z.; validation, G.Z., Q.Z., M.S. and X.Z.; formal analysis, Q.Z.; investigation, G.Z. and J.W.; data curation, Q.Z.; writing—original draft preparation, Q.Z.; writing—review and editing, G.Z. All authors have read and agreed to the published version of the manuscript. Funding This work was supported in part by the Plan Innovation of Hubei Province (Grant No.2024BAA005), the Hubei Provincial Science and Technology Program Project (Grant No. 2025BAA002), and the Hubei Key Laboratory of Intelligent Robot Innovation Fund (Wuhan Institute of Technology) (Grant No. HBIRL 202503). The APC was jointly funded by the above grants. Data Availability Statement The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request. Conflicts of Interest Author Meng Song was employed by the company Wuhan Yangtze Computing Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. References Yang, Z.; Gong, B.; Wang, L.; Huang, W.; Yu, D.; Luo, J. A Fast and Accurate One-Stage Approach to Visual Grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York City, NY, USA, 2019; pp. 4683–4693. [] [ CrossRef] Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment Anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York City, NY, USA, 2023; pp. 4015–4026. [] [ CrossRef] Shridhar, M.; Manuelli, L.; Fox, D. CLIPort: What and Where Pathways for Robotic Manipulation. In Proceedings of the 5th Conference on Robot Learning; PMLR: Auckland, New Zealand, 2022; Volume 164, pp. 894–906. Available online: https://proceedings.mlr.press/v164/shridhar22a.html (accessed on 25 May 2026). Zhang, J.; Tu, L.; Zhang, Y.; Xie, L.; Xu, M.; Ming, D.; Yan, Y.; Yin, E. An Accuracy Enhanced Vision Language Grounding Method Fused with Gaze Intention. Electronics 2023, 12, 5007. [] [ CrossRef] Chen, Y.; Su, L.; Chen, L.; Lin, Z. LCV2: A Universal Pretraining-Free Framework for Grounded Visual Question Answering. Electronics 2024, 13, 2061. [] [ CrossRef] Kamath, A.; Singh, M.; LeCun, Y.; Synnaeve, G.; Misra, I.; Carion, N. MDETR: Modulated Detection for End-to-End Multi-Modal Understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York City, NY, USA, 2021; pp. 1780–1790. [] [ CrossRef] Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Computer Vision—ECCV 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [] [ CrossRef] Wu, K.; Zhang, J.; Peng, H.; Liu, M.; Xiao, B.; Fu, J.; Yuan, L. TinyViT: Fast Pretraining Distillation for Small Vision Transformers. In Computer Vision—ECCV 2022; Springer Nature: Cham, Switzerland, 2022; pp. 68–85. [] [ CrossRef] Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. arXiv 2019, arXiv:1910.01108. [] [ CrossRef] Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [] [ CrossRef] Liu, D.; Zhang, H.; Wu, F.; Zha, Z.-J. Learning to Assemble Neural Module Tree Networks for Visual Grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York City, NY, USA, 2019; pp. 4673–4682. [] [ CrossRef] Deng, J.; Yang, Z.; Chen, T.; Zhou, W.; Li, H. TransVG: End-to-End Visual Grounding with Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York City, NY, USA, 2021; pp. 1769–1779. [] [ CrossRef] Wang, Y.; Tian, Z.; Qin, Z.; Zhou, S.; Wang, L. RefDetector: A Simple Yet Effective Matching-Based Method for Referring Expression Comprehension. Proc. AAAI Conf. Artif. Intell. 2025, 39, 8033–8041. [] [ CrossRef] Lu, M.; Li, R.; Feng, F.; Ma, Z.; Wang, X. LGR-NET: Language Guided Reasoning Network for Referring Expression Comprehension. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 7771–7784. [] [ CrossRef] Wang, Y.; Ding, H.; He, S.; Jiang, X.; Wei, B.; Liu, J. Hierarchical Alignment-Enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension. Proc. AAAI Conf. Artif. Intell. 2025, 39, 8042–8050. [] [ CrossRef] Liu, X.; Liu, T.; Huang, S.; Xin, Y.; Hu, Y.; Qin, L.; Wang, D.; Wu, Y.; Chen, H. M2IST: Multi-Modal Interactive Side-Tuning for Efficient Referring Expression Comprehension. IEEE Trans. Circuits Syst. Video Technol. 2026, 36, 1341–1354. [] [ CrossRef] Li, X.L.; Liang, P. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Volume 1: Long Papers; IEEE: New York City, NY, USA, 2021; pp. 4582–4597. [] [ CrossRef] Lester, B.; Al-Rfou, R.; Constant, N. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing; IEEE: New York City, NY, USA, 2021; pp. 3045–3059. [] [ CrossRef] Liu, X.; Ji, K.; Fu, Y.; Du, Z.; Yang, Z.; Tang, J. P-Tuning: Prompt Tuning Can Be Comparable to Fine-Tuning across Scales and Tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Volume 2: Short Papers; IEEE: New York City, NY, USA, 2022; pp. 61–68. [] [ CrossRef] Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to Prompt for Vision-Language Models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [] [ CrossRef] Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A Visual Language Model for Few-Shot Learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [] Huang, S.; Hui, T.; Liu, S.; Li, G.; Wei, Y.; Han, J.; Liu, L.; Li, B. Referring Image Segmentation via Cross-Modal Progressive Comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York City, NY, USA, 2020; pp. 10488–10497. [] [ CrossRef] Yang, Z.; Wang, J.; Tang, Y.; Chen, K.; Zhao, H.; Torr, P.H.S. LAVT: Language-Aware Vision Transformer for Referring Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York City, NY, USA, 2022; pp. 18155–18165. [] [ CrossRef] Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional DETR for Fast Training Convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York City, NY, USA, 2021; pp. 3651–3660. [] [ CrossRef] Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. DAB-DETR: Dynamic Anchor Boxes Are Better Queries for DETR. In Proceedings of the International Conference on Learning Representations; IEEE: New York City, NY, USA, 2022; Available online: https://openreview.net/forum?id=oMI9PjOb9Jl (accessed on 25 May 2026). Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York City, NY, USA, 2022; pp. 13619–13627. [] [ CrossRef] Kazemzadeh, S.; Ordonez, V.; Matten, M.; Berg, T.L. ReferItGame: Referring to Objects in Photographs of Natural Scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Doha, Qatar, 2014; pp. 787–798. [] [ CrossRef] Yu, L.; Poirson, P.; Yang, S.; Berg, A.C.; Berg, T.L. Modeling Context in Referring Expressions. In Computer Vision—ECCV 2016; Springer: Cham, Switzerland, 2016; pp. 69–85. [] [ CrossRef] Mao, J.; Huang, J.; Toshev, A.; Camburu, O.; Yuille, A.L.; Murphy, K. Generation and Comprehension of Unambiguous Object Descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York City, NY, USA, 2016; pp. 11–20. [] [ CrossRef] Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning; PMLR: Long Beach, CA, USA, 2019; Volume 97, pp. 6105–6114. Available online: https://proceedings.mlr.press/v97/tan19a.html (accessed on 25 May 2026). Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York City, NY, USA, 2021; pp. 10012–10022. [] [ CrossRef] Yang, L.; Xu, Y.; Yuan, C.; Liu, W.; Li, B.; Hu, W. Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York City, NY, USA, 2022; pp. 9499–9508. [] [ CrossRef] Deng, J.; Yang, Z.; Liu, D.; Chen, T.; Zhou, W.; Zhang, Y.; Li, H.; Ouyang, W. TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13636–13652. [] [ CrossRef] [ PubMed] Shi, F.; Gao, R.; Huang, W.; Wang, L. Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 1181–1198. [] [ CrossRef] [ PubMed] Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; et al. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. In Computer Vision—ECCV 2024; Springer Nature: Cham, Switzerland, 2024; pp. 38–55. [] [ CrossRef] Dai, M.; Yang, L.; Xu, Y.; Feng, Z.; Yang, W. SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-Modal Fusion. Adv. Neural Inf. Process. Syst. 2024, 37, 121670–121698. [] Dai, M.; Cheng, W.; Zhuang, J.; Liu, J.J.; Zhao, H.; Feng, Z.; Yang, W. PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York City, NY, USA, 2025; pp. 7058–7068. Available online: https://openaccess.thecvf.com/content/ICCV2025/html/Dai_PropVG_End-to-End_Proposal-Driven_Visual_Grounding_with_Multi-Granularity_Discrimination_ICCV_2025_paper.html (accessed on 25 May 2026). Ho, C.-H.; Appalaraju, S.; Jasani, B.; Manmatha, R.; Vasconcelos, N. YORO—Lightweight End to End Visual Grounding. In Computer Vision—ECCV 2022 Workshops; Springer: Cham, Switzerland, 2023; pp. 3–23. [] [ CrossRef] Liu, T.; Xu, Z.; Hu, Y.; Shi, L.; Wang, Z.; Yin, Q. MaPPER: Multimodal Prior-Guided Parameter Efficient Tuning for Referring Expression Comprehension. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Miami, FL, USA, 2024; pp. 4984–4994. [] [ CrossRef] Ouyang, S.; Niu, Z.; Wang, H.; Luo, H.; Tian, Z.; Wang, L. Region-Aware Anchoring Mechanism for Efficient Referring Visual Grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York City, NY, USA, 2025; pp. 24192–24202. [] [ CrossRef] Figure 1. Overall framework of the proposed lightweight REC pipeline. CARM is inserted between lightweight backbones and the MDETR-style joint Transformer to recover cross-modal alignment. Figure 1. Overall framework of the proposed lightweight REC pipeline. CARM is inserted between lightweight backbones and the MDETR-style joint Transformer to recover cross-modal alignment. Figure 2. Architecture of CARM. Bidirectional prompts, cross-attention gating, and TAPs jointly enhance lightweight visual and textual representations before fusion. Figure 2. Architecture of CARM. Bidirectional prompts, cross-attention gating, and TAPs jointly enhance lightweight visual and textual representations before fusion. Figure 3. Prompt-modulated query attention. Aggregated prompt features are mapped to a bias term and added to decoder query positional embeddings. Figure 3. Prompt-modulated query attention. Aggregated prompt features are mapped to a bias term and added to decoder query positional embeddings. Figure 4. Visualization of successful and failed cases. Each row shows the prediction result, attention maps under Prompt OFF and Prompt ON, and CAG coefficients. The first three rows show successful recovery cases, while the last row shows a representative failure case. Figure 4. Visualization of successful and failed cases. Each row shows the prediction result, attention maps under Prompt OFF and Prompt ON, and CAG coefficients. The first three rows show successful recovery cases, while the last row shows a representative failure case. Table 1. Main results on RefCOCO, RefCOCO+, and RefCOCOg (P@1, IoU ≥ 0.5). Table 1. Main results on RefCOCO, RefCOCO+, and RefCOCOg (P@1, IoU ≥ 0.5). Method/Metric Backbone FLOPs (G) RefCOCO RefCOCO+ RefCOCOg val Test A Test B val Test A Test B val Test MDETR [ 6] ResNet-101 105.15 86.75 89.58 81.41 79.52 84.09 70.62 81.64 80.89 Shrunken TinyViT 50.25 84.51 87.25 79.92 78.03 82.23 68.82 78.98 78.43 Shrunken + CARM (Ours) TinyViT 62.87 86.23 88.76 81.43 79.40 83.63 70.53 80.92 79.98 Δ = Accuracy Gap 2.24 2.33 1.49 1.49 1.86 1.80 2.66 2.46 ρ = Recovery Rate 76.8% 64.8% 101.3% 91.9% 75.3% 95.0% 72.9% 63.4% Here, Δ ρ denote the accuracy gap and gap recovery ratio, respectively, computed column-wise for each dataset split. ρ > 100 % indicates over-recovery beyond the Full model. FLOPs are model-level estimates under a unified input setting. Table 2. External comparison with representative REC models and efficient REC baselines (P@1, IoU ≥ 0.5). Table 2. External comparison with representative REC models and efficient REC baselines (P@1, IoU ≥ 0.5). Method Publication Backbone Model Scale/ Efficiency RefCOCO RefCOCO+ RefCOCOg val Test A Test B val Test A Test B val Test A Model scale and efficiency indicators are reported under “Model Scale/Efficiency”, where “ckpt”, “model params”, and “tuned/trainable params” denote checkpoint size, total model parameters, and optimized parameters, respectively. Table 3. Stepwise component ablation (P@1, IoU ≥ 0.5). Table 3. Stepwise component ablation (P@1, IoU ≥ 0.5). Bi-Prompt CAG TAPs Query Mod. RefCOCO RefCOCO+ RefCOCOg val val val – – – – 84.51 78.03 78.98 ✓ – – – 85.23 78.56 79.56 ✓ ✓ – – 85.61 78.77 79.83 ✓ ✓ ✓ – 85.96 79.17 80.65 ✓ ✓ ✓ ✓ 86.23 79.40 80.92 “✓/–” indicates enabling/removing each component (Bi-Prompt, CAG, TAPs, Query Mod.). Table 4. Prompt direction ablation (P@1, IoU ≥ 0.5). Table 4. Prompt direction ablation (P@1, IoU ≥ 0.5). Prompt Direction RefCOCO RefCOCO+ RefCOCOg val val val Shrunken 84.51 78.03 78.98 Text → Visual only 84.93 78.38 79.46 Visual → Text only 84.76 78.24 79.28 Bidirectional (Text ↔ Visual) 85.23 78.56 79.56 “Text → Visual” injects language priors into the visual branch only; “Visual → Text” injects visual priors into the text branch only; “Text ↔ Visual” is bidirectional prompting. Table 5. Gating stability analysis (P@1, IoU ≥ 0.5). Table 5. Gating stability analysis (P@1, IoU ≥ 0.5). Setting RefCOCO (val) Failure Rate Bi-Prompt w/o CAG ୮୫.୨୩ ବ୍ଦ ୦.୨୦ 1/5 Bi-Prompt + CAG ୮୫.୬୧ ବ୍ଦ ୦.୧୦ 0/5 “w/o” removes the corresponding module. Failure rate counts how often training diverges or becomes significantly worse than the Shrunken baseline across multiple random seeds. Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. © 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Zheng, G.; Zhang, Q.; Song, M.; Zhang, X.; Wang, J. CARM: Cross-Modal Alignment Recovery for Lightweight Referring Expression Comprehension. Electronics 2026, 15, 2509. https://doi.org/10.3390/electronics15122509 Zheng G, Zhang Q, Song M, Zhang X, Wang J. CARM: Cross-Modal Alignment Recovery for Lightweight Referring Expression Comprehension. Electronics. 2026; 15(12):2509. https://doi.org/10.3390/electronics15122509 Zheng, Gengsheng, Qiang Zhang, Meng Song, Xinghu Zhang, and Jianhua Wang. 2026. "CARM: Cross-Modal Alignment Recovery for Lightweight Referring Expression Comprehension" Electronics 15, no. 12: 2509. https://doi.org/10.3390/electronics15122509 Zheng, G., Zhang, Q., Song, M., Zhang, X., & Wang, J. (2026). CARM: Cross-Modal Alignment Recovery for Lightweight Referring Expression Comprehension. Electronics, 15(12), 2509. https://doi.org/10.3390/electronics15122509 Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.
CARM: Cross-Modal Alignment Recovery for Lightweight Referring Expression Comprehension
Vorheriger Beitrag