Text-Guided Geometric Relation Parsing with Logic Regularization

Open AccessArticle Text-Guided Geometric Relation Parsing with Logic Regularization by Pengpeng Jian Pengpeng Jian SciProfiles Scilit Preprints.org Google Scholar , Xuhui Zhang Xuhui Zhang SciProfiles Scilit Preprints.org Google Scholar *, Lei Wu Lei Wu SciProfiles Scilit Preprints.org Google Scholar and Quanhong Sun Quanhong Sun SciProfiles Scilit Preprints.org Google Scholar Information Engineering Institute, North China University of Water Resources and Electric Power, Zhengzhou 450046, China * Author to whom correspondence should be addressed. Electronics 2026, 15(11), 2460; https://doi.org/10.3390/electronics15112460 (registering DOI) Submission received: 14 May 2026 / Revised: 2 June 2026 / Accepted: 2 June 2026 / Published: 4 June 2026 Abstract Geometric relation parsing is a prerequisite for automated geometry problem solving, especially when diagram interpretation depends jointly on visual appearance and textual conditions. In this study, we examine a text-conditioned parsing setting derived from PGDP5K and propose a lightweight parser with atomic cue extraction, iterative visual–semantic feedback, and differentiable logic regularization. Because the active high-level labels are derived through a rule-based weak-supervision protocol, the results should be interpreted as parser-level evidence under Ext-PGDP5K rather than proof of general geometric semantic understanding. The nominal label space contains five candidate relations, while the current evaluation focuses on four active relations with positive instances: Intersect, Parallel, Perpendicular, and Bisect. Compared with text-only, image-only, global-fusion, and shuffled-text controls, the proposed parser improves Edge-F1 and Macro-F1, with the clearest gains for Parallel and Perpendicular. Ablations show that the atomic probe is the main source of improvement, while logic regularization and feedback exhibit non-monotonic interactions. Although limited by weak labels, lexical cues, and the absence of downstream solver validation, this study provides a reproducible protocol-aligned testbed for analyzing text-conditioned relation prediction and low-order logic regularization in geometric diagram parsing. 2. Related Work 2.1. Geometric Diagram Parsing and Automated Problem Solving Research on geometry understanding has progressed along two lines: diagram parsing (detection and inference of relations) and full problem solving (symbolic reasoning and proof programs) [ 4]. Robust parsers must resolve ambiguous cues, preserve relational structure, and expose outputs that remain compatible with formal reasoning. FGeo-Parser also emphasizes the role of autoformalization in plane geometric problem solving [ 5]. For solver-oriented systems, the final objective is usually the numerical answer, proof trace, or formalized solution program. Recent solver-oriented systems have demonstrated that formal reasoning and learned search can solve challenging geometry problems when reliable symbolic representations are available [ 6]. Inter-GPS further shows that formal language and symbolic reasoning can make geometry problem solving more interpretable [ 7]. LANS introduces layout-aware neural solving for plane geometry problems, highlighting the importance of diagram layout in solver-oriented pipelines [ 8]. In contrast, parser-oriented studies evaluate whether the diagram and text have been converted into a reliable intermediate representation. This distinction is important for the present work because relation-level errors may be hidden in an end-to-end solver if the final answer happens to remain correct. We therefore evaluate Edge-F1, Macro-F1, and logic-violation behavior directly at the relation-parsing stage. Other recent work has explored solving geometry problems with parsed clauses extracted from diagrams [ 9]. AutoGPS represents another solver-oriented direction that combines multimodal formalization with deductive reasoning [ 10]. Diagram formalization has also been used to enhance multimodal geometry problem solving systems [ 11]. Recent work on Euclidean geometry formalization further indicates that structured intermediate representations are important for reliable theorem-level reasoning [ 12]. This parser-level perspective also determines the choice of baselines. Existing solver-oriented systems usually optimize different outputs, such as formalized clauses, theorem-guided proof traces, or final numerical answers. Directly comparing such systems with a relation parser would confound relation prediction with downstream solving. Therefore, this study uses protocol-aligned controlled baselines, including text-only, image-only, global-fusion, shuffled-text, and component-level ablations, so that all methods are evaluated under the same candidate primitive and relation-label setting. Although stronger solver-oriented systems are discussed above, they are not used as direct main baselines because their outputs are final answers, formal clauses, or proof traces rather than relation-level edge predictions. A direct comparison would therefore mix parser quality with downstream solving ability. In this work, the main comparison is restricted to protocol-aligned parser variants under the same candidate primitive pairs and active relation labels. The controlled baselines used in this study are intended to isolate the effects of modality, paired text, atomic cue extraction, feedback, and logic regularization under a shared candidate-edge protocol. They do not establish competitiveness with stronger external parser-oriented systems. Re-implementing PGDPNet-style parser baselines or connecting the parser to solver-level systems under a shared evaluation interface remains an important direction for future comparison. 2.2. Multimodal Fusion in the Geometric Domain Multimodal fusion is delicate in geometric diagrams because the visual domain is sparse and precision-sensitive. IconQA further shows that abstract diagrams require visual–language reasoning beyond ordinary image understanding [ 13]. GeoQA illustrates that geometry question answering requires coordinated reasoning over textual conditions, diagrams, and numerical structures [ 14]. A recurring weakness of many fusion strategies is that text is used only globally. Vision–language pretraining methods such as LXMERT demonstrate the effectiveness of cross-modal representation learning in general multimodal tasks [ 15]. mPLUG improves vision–language learning through cross-modal skip connections, but such general fusion designs do not directly address relation-specific geometric grounding [ 16]. Our atomic semantic probe is motivated by the observation that identifying relation-relevant textual atoms can guide cross-modal attention more effectively than a single global vector. However, text-conditioned modeling also introduces the risk of language-prior dependence [ 17]. If a relation label can be partially inferred from frequently occurring phrases or annotation artifacts, a model may appear multimodal while relying only weakly on diagram grounding. Shortcut learning can lead models to rely on superficial correlations rather than robust multimodal grounding [ 18]. For this reason, modality-control experiments are necessary. In addition to text-only and image-only baselines, the shuffled-text setting used in this study tests whether the parser benefits from correctly paired textual cues rather than arbitrary textual priors or dataset-level language shortcuts. 2.3. Neuro-Symbolic Learning and Logic Regularization Neuro-symbolic learning seeks to combine neural flexibility with symbolic rigor [ 19]. Semantic loss further formalizes how symbolic knowledge can be encoded as differentiable learning signals [ 20]. Our work follows an objective-level path, encoding symmetry, transitivity, and mutual exclusivity as soft constraints so that the parser is penalized when its predicted relation graph drifts away from structurally admissible configurations. Logic rules have been used to guide neural models through regularization-style learning objectives [ 21]. Semantic-based regularization provides a related framework for incorporating symbolic constraints into learning objectives [ 22]. The logic component in our model should be understood as soft regularization rather than complete symbolic theorem proving. DL2 also explores training and querying neural networks with logical constraints [ 23]. Probabilistic soft logic provides another example of using continuous relaxations for rule-based reasoning [ 24]. Symmetry, transitivity, and mutual exclusivity provide local structural biases during training, but they do not guarantee a globally consistent geometric proof. This distinction is important because the empirical results later show a trade-off between predicting more positive relations and increasing rule-defined conflicts. 3. Materials and Methods 3.1. Overall Architecture Task definition. Given an input pair I , T , where I is the diagram image and T is the aligned problem text, our goal is to predict a relation graph over detected primitives. The nominal label space is R n o m = { I n t e r s e c t , T a n g e n t , P a r a l l e l , P e r p e n d i c u l a r , B i s e c t } . In the current derived split, Tangent has no positive instances, so the active evaluation set is R a c t = { I n t e r s e c t , P a r a l l e l , P e r p e n d i c u l a r , B i s e c t } . Let E denote the candidate edge set and let P ∈ [ 0,1 ] N ୍ଠ N ୍ଠ | R a c t | denote the predicted relation tensor, where P i j r is the probability that relation r holds between primitives i and j . The Ext-PGDP5K protocol is derived from the official PGDP5K split by constructing candidate primitive pairs and assigning high-level relation labels through rule-based parsing of diagram primitives and aligned textual cues. This rule-based label construction is related to weak supervision, in which labeling functions are used to generate training signals at scale [ 28]. Therefore, the model is trained on weak labels generated by the derived protocol rather than independently verified human semantic annotations. The derived labels, split identifiers, and label-derivation scripts are released with the accompanying project materials. Because textual cues are used both in label derivation and as model input, potential text-prior leakage is evaluated through text-only and shuffled-text controls and further discussed as a validity limitation in Section 4.6. The overall framework contains three tightly coupled stages: multimodal atomic perception, iterative visual–semantic feedback fusion, and logic-regularized graph reasoning. Figure 2 provides a high-level overview, and Algorithm 1 summarizes the training and inference pipeline. The central idea is to expose explicit semantic cues first, use them to guide visual relation reasoning, and then regularize the resulting graph with geometric structure. Algorithm 1. Training and inference pipeline of the proposed parser Input: training set D , active relation set R a c t , atomic rule set A , model parameters θOutput: trained parser F θ and predicted relation graph G ^ Training phase:For each sample (I,T) in D, detect primitives and extract node features V. Construct candidate edge set E and generate weak atomic labels z from T using A. Encode T with DistilBERT to obtain token features H and atomic probabilities a. Project a to the initial semantic query q^((0)) and apply semantic-guided cross-attention over V. Run graph reasoning to obtain relation logits S^((1)) and probabilities P^((1)). Compute feedback f^((1))=MLP_fb (Pool(S^((1)))) and update q^((1))=LayerNorm(q^((0))+f^((1))). Re-apply cross-attention and graph reasoning to obtain refined logits S^((2)) and probabilities P^((2)). Compute L_sup, L_atom, L_sym, L_trans, and L_mutex, then update θ with AdamW. Inference phase:9. Given a test pair (I,T), detect primitives, encode text, and obtain q^((0)). 10. Perform two rounds of semantic-guided attention and graph reasoning. 11. Threshold the final relation probabilities to produce G ^. 3.2. Multimodal Atomic Perception Visual Stream: We adopt ResNet-50 as the visual backbone [ 29]. An FPN detector is used to support multi-scale primitive perception [ 30]. In the present study, relation prediction is evaluated on the derived candidate primitives provided by the PGDP5K-based preprocessing pipeline, as recorded in the Ext-PGDP5K split files. Candidate edges are constructed from these preprocessed primitives according to the Ext-PGDP5K split files. All valid candidate primitive pairs recorded in the Ext-PGDP5K split files are retained for relation prediction. We do not perform additional negative sampling; candidate edge–relation entries without derived positive labels are treated as negative instances in the masked binary relation loss. This setting leads to severe class imbalance, so FRA is reported only as an auxiliary metric, while Edge-F1 and Macro-F1 are emphasized for relation-level evaluation. The reported metrics are computed on the resulting candidate primitive pairs; therefore, the evaluation focuses on relation parsing rather than standalone primitive detection. This design isolates relation-level prediction errors from primitive-detection errors. Text Stream: Explicit Atomic Semantic Probe: Instead of encoding the full problem statement into a single undifferentiated sentence vector, we explicitly model a compact set of geometric atoms. The seed vocabulary contains six cue families centered on parallel, perpendicular, tangent, bisector, angle-bisector, and intersection semantics. The atomic weak labels are generated by matching normalized cue expressions to these predefined cue families. This procedure is closer to lexical cue extraction than to full natural-language semantic parsing. The vocabulary includes textual forms, symbolic forms, and common variants, such as “parallel”, “//”, and “||” for the Parallel atom. The cue vocabulary and normalization rules are available in the project repository. For example, ‘parallel’, ‘//’, and ‘||’ are normalized to the Parallel atom. This design makes the text stream efficient and interpretable, but it may be brittle to paraphrases, implicit relation descriptions, and syntactic forms not covered by the seed vocabulary or normalization rules. The DistilBERT encoder is kept frozen in all experiments [ 31]. Only the task-specific projection, attention, feedback, and edge-classification modules are trained. This setting keeps the trainable parameter count low and ensures that the reported parameter numbers reflect the parser components rather than the pretrained language encoder. 3.3. Iterative Visual–Semantic Feedback Fusion A single forward pass is often insufficient for geometry parsing because the most informative visual evidence may become apparent only after the model forms an initial relational hypothesis. We therefore adopt a two-round iterative feedback mechanism. Two rounds are used as the default setting because the validation-set and test-set sensitivity analyses show that this configuration provides the best trade-off between relation-prediction performance and computational cost. Additional feedback rounds are analyzed in the sensitivity experiment rather than assumed to be monotonically beneficial. f t = M L P f b ( P o o l ( S t ) ) , q t 1 = L a y e r N o r m ( q t + f t ) (1) 3.4. Logic Consistency Regularization The following logic terms are used as differentiable soft regularizers during training. They are not hard constraints and do not guarantee theorem-level consistency of the final graph. Their purpose is to bias the parser away from local rule violations and to provide a measurable conflict signal through LVR. These regularizers are used as auxiliary soft losses and should not be interpreted as a guarantee of global geometric validity. Let y i j r denote the binary supervision label for relation r on candidate edge ( i , j ) , and let z c denote the weak label for the c -th atomic cue family. We define the supervised relation loss and the atomic probe loss as follows: L s u p = − 1 Ω ∑ i , j , r ∈ Ω y i j r log P i j r + 1 − y i j r log 1 − P i j r (2) L a t o m = − 1 C ∑ c = 1 C z c log a c + 1 − z c log 1 − a c (3) For undirected relations such as Parallel, Perpendicular, and Intersect, the symmetry loss is defined as: L s y m = 1 E | | R s y m ∑ i , j ∈ E ∑ r ∈ R s y m ( P i j r − P j i r ) 2 (4) The transitivity term is computed over sampled local triplets from the candidate edge graph rather than over all possible primitive triples. This sampling strategy reduces computational cost and keeps the regularizer focused on local graph consistency. Over sampled triplets T t r i , we use the Lukasiewicz-style relaxation for transitivity: L t r a n s = 1 ∣ T t r i ∣ ∑ ( i , j , k ) ∈ T t r i m a x 0 , P i j p a r a + P j k p a r a − P i k p a r a − 1 (5) The mutual-exclusivity loss is defined as: L m u t e x = 1 E | | M e x c l ∑ ( i , j ) ∈ E ∑ ( r a , r b ) ∈ M e x c l P i j r a P i j r b (6) The final training objective is: L = L s u p + λ a t o m L a t o m + λ l o g i c L s y m + L t r a n s + L m u t e x (7) In the reported implementation, λ a t o m was set to 0.2 and λ l o g i c was set to 0.1. The same values were used across the controlled variants unless the corresponding component was ablated. 3.5. Design Rationale and Computational Discussion The logic layer is intentionally low-order rather than theorem-complete: the current Ext-PGDP5K labels do not provide the typed supervision needed for stronger angle-, ratio-, or polygon-level constraints. 4. Results and Discussion 4.1. Experimental Design and Evaluation Protocol Dataset and task setting. We use the official PGDP5K split, consisting of 3500 training samples, 500 validation samples, and 1000 test samples. The nominal label space contains five candidate relations, but the current main evaluation is defined over four active relations with positive instances: Intersect, Parallel, Perpendicular, and Bisect. Tangent is retained as an audited candidate relation but excluded from the current quantitative evaluation. We report Edge-F1, Macro-F1, Full Relation Accuracy (FRA), and Logic Violation Rate (LVR). Edge-F1 is computed as micro-F1 over valid edge–relation predictions, whereas Macro-F1 averages F1 across the four active relations. FRA measures graph-level exact matching and is therefore reported as an auxiliary metric because positive high-level relations are sparse. LVR measures the proportion of predicted edges or local triplets that violate predefined symmetry, transitivity, or mutual-exclusivity rules. Edge-F1, Macro-F1, FRA, and LVR are reported in percentage form unless otherwise specified. The prediction threshold is selected on the validation split by maximizing Macro-F1 and is then fixed for all test-set evaluations. All reported comparisons are conducted within the same derived Ext-PGDP5K candidate-edge protocol. This controlled setting is not intended to replace broader solver-level benchmarks; rather, it is designed to make modality effects, text pairing, threshold behavior, and local consistency conflicts directly observable at the relation-graph level. They should not be interpreted as direct comparisons with solver-oriented systems whose outputs are final answers, proof traces, or formal solution programs. The absence of independently annotated relation labels and downstream solver-level evaluation is therefore treated as a limitation rather than resolved by the present protocol. The experiments were implemented in Python using PyTorch, with the released dependency file specifying PyTorch 2.2 or later; the project code is available at https://github.com/youger-zero/atom-main (accessed on 13 May 2026). 4.2. Protocol Statistics Table 1 shows that the derived relation labels are highly sparse. Although the dataset contains more than half a million primitive pairs, the number of positive high-level relations is much smaller, with fewer than 0.5 active relations per sample on average. This sparsity explains why graph-level FRA should not be interpreted alone: a model may obtain a non-trivial FRA by predicting very few positive relations, while still failing to recover relation-level positives. Therefore, the following comparisons emphasize Edge-F1 and Macro-F1. 4.3. Main Comparison and Modality-Control Analysis Table 2 shows that the proposed parser achieves the best Edge-F1 and Macro-F1 among the controlled settings. Compared with global fusion, Edge-F1 increases from 30.78% to 53.63%, and Macro-F1 increases from 16.16% to 42.56%. These improvements indicate that the proposed design improves positive-relation prediction rather than merely increasing graph-level matching accuracy. The shuffled-text condition performs far worse than the correctly paired text condition, which suggests that paired textual cues are useful under the current protocol. However, this control does not prove robust natural-language understanding, and residual dependence on lexical priors cannot be excluded. At the same time, the text-only baseline obtains a non-trivial FRA but zero Edge-F1 and Macro-F1, confirming that FRA alone is insufficient under sparse positive labels. Therefore, the main evidence for performance improvement is taken from Edge-F1 and Macro-F1 rather than from FRA alone. 4.4. Relation-Wise Analysis As shown in Table 3, the relation-wise results reveal uneven behavior across categories, which is important for interpreting the scope of the proposed parser. The proposed model mainly improves Parallel and Perpendicular, which are closely associated with explicit textual cues and line–line geometric constraints. In particular, Parallel increases from 0.00% to 55.70%, and Perpendicular increases from the best baseline value of 23.39% to 60.95%. In contrast, Intersect is still better handled by the global-fusion baseline, and Bisect remains at zero F1. This result indicates that the current model is more effective for text-sensitive line–line relations than for higher-order relations such as bisectors. The weaker Intersect result suggests that not all relation types benefit equally from explicit textual cues, and that some visually grounded relations may already be handled effectively by simpler fusion strategies. The failure on Bisect suggests that the current pairwise edge-classification formulation is insufficient for relations involving higher-order geometric constructions. Unlike Parallel and Perpendicular, which are mainly line–line relations, Bisect often depends on an angle, a segment partition, or a composed construction involving more than two primitives. Therefore, the zero F1 on Bisect should be interpreted as a limitation of the present parser design rather than as a failure of text conditioning alone. This also limits immediate deployment in downstream solvers for problems that require angle-bisector, segment-partition, or composed-construction reasoning. Illustrative qualitative cases are shown in Figure 3. Cases A-C illustrate the type of grounding behavior expected for Parallel, Perpendicular, and Intersect, whereas Case D summarizes the unresolved higher-order ambiguity of Bisect under the current design. 4.5. Ablation, Efficiency, and Sensitivity Analysis Table 4 further clarifies the contribution of each component, but the results should not be interpreted as monotonic gains from every component. Instead, they reveal nonlinear interactions among atomic cues, feedback refinement, and logic regularization. The atomic probe is the main source of improvement, increasing Edge-F1 from 27.05% to 46.45% and Macro-F1 from 16.43% to 36.82%. Adding Logic Loss on top of the atomic probe improves Edge-F1 from 46.45% to 52.14% and reduces LVR from 0.116% to 0.086% in the no-feedback setting, but this improvement is not uniform across metrics: Macro-F1 decreases from 36.82% to 29.85%, suggesting that the improvement is concentrated in micro-level prediction and does not translate into a uniform gain in class-balanced performance. Feedback alone reduces both Edge-F1 and Macro-F1, which suggests that preliminary feedback may introduce noisy relational evidence. Therefore, feedback should be viewed as an interacting refinement mechanism rather than as an independently stable contributor. The full model achieves the best Edge-F1, Macro-F1, and FRA, but it also predicts the largest number of positive relations and yields the highest LVR. This supports the interpretation that relation coverage and rule-defined consistency are in tension under the current design. LVR should therefore be interpreted jointly with predicted-positive coverage: a model can obtain a low LVR by predicting very few positive relations, whereas a model that recovers more relation edges may expose more opportunities for rule-defined conflicts. Future work may dynamically tune the logic weight or introduce coverage-aware consistency objectives to balance relation recall and structural validity. A representative comparison of validation-set LVR trajectories for AP-only and AP+Logic is shown in Figure 4. Figure 4 illustrates the validation-set training behavior of AP-only and AP+Logic. Model selection is performed using validation Macro-F1 rather than minimum validation LVR. Therefore, this trajectory should not be interpreted as the final test-set LVR of the full model, whose higher LVR is reported in Table 4 and is associated with increased predicted-positive coverage. Feedback-round sensitivity in Figure 5 additionally reports Macro-F1 to show the relation-level effect of varying the number of feedback rounds. Table 5 shows that the proposed model remains lightweight in terms of trainable parameters. Inference time is measured on the same hardware after warm-up and averaged over the test split. Small differences below approximately 0.3 ms should be interpreted as measurement variation rather than meaningful speed differences. The reported parameter count excludes frozen pretrained text-encoder parameters and reflects the trainable parser components. Among the feedback settings, two rounds provide the best FRA–efficiency trade-off, achieving 77.8% FRA with an inference time of 8.02 ms per sample. Three rounds reduce LVR to 0.028% but also lower FRA to 74.3%, suggesting that the configuration with the lowest rule-defined conflict rate is not necessarily the configuration with the best prediction performance. 4.6. Limitations and Threats to Validity The first threat to validity comes from the derived-label protocol itself. Because the present task extends PGDP5K into a text-conditioned parser-level setting, the results should be interpreted within this derived protocol rather than as a replacement for the original primitive-level benchmark. The derived labels make it possible to study high-level relation parsing, but they also introduce dependence on rule-based label construction. Because the derived labels are produced by rule-based parsing, the trained parser may partly learn to approximate the label-derivation pipeline. The present results therefore do not prove general semantic understanding or robustness beyond the adopted protocol. Another threat comes from text-prior effects and weak supervision in the semantic probe. Although the shuffled-text control suggests that correctly paired text is important, the atomic cues are still derived from surface-level textual patterns. Therefore, the reported improvements should be interpreted as parser-level gains under the adopted protocol, not as evidence that all text-conditioned geometric ambiguity has been solved. Because textual cues participate in both the derived-label protocol and the model input, residual text-prior leakage cannot be excluded. The text-only and shuffled-text controls reduce this concern but do not fully resolve it. A further limitation concerns relation complexity. Parallel and Perpendicular benefit most clearly from explicit textual cues and low-order geometric constraints, whereas Bisect remains unresolved in the current experiments. This suggests that pairwise relation classification and lexical cue extraction are insufficient for higher-order grounding involving angles, segments, or composed geometric constructions. The atomic probe relies on a seed vocabulary and normalization rules. Although this design is interpretable, it is closer to lexical cue extraction than to full semantic parsing, and may fail under paraphrases or implicit descriptions. Another limitation is the absence of downstream solver-level validation. The predicted relation graph is intended to provide candidate symbolic constraints, such as Parallel and Perpendicular, for future theorem-guided solvers. However, the present study does not demonstrate improved final-answer accuracy, proof generation, or formal reasoning success. Finally, the full model predicts more positive relations and obtains better Edge-F1 and Macro-F1, but it also yields a higher LVR. This indicates a trade-off between relation coverage and rule-defined consistency. Future work should investigate stronger grounding mechanisms, improved conflict resolution, and cross-dataset transfer before connecting the parsed relation graph to downstream theorem-guided solvers. A final reproducibility-related limitation is that the derived protocol depends on the correctness of the rule-based label-derivation pipeline. Although the accompanying materials provide derived labels, split identifiers, vocabulary files, and evaluation scripts, independent verification of the label construction remains important for future extensions of the protocol. The main baselines are protocol-aligned variants designed to isolate component effects. Therefore, they do not establish competitiveness with stronger external parser-oriented baselines or solver-integrated systems. These limitations constrain the scope of the claims, but they also define the intended use of the present work: a reproducible parser-level benchmark and analysis framework for studying text-guided geometric relation prediction before downstream solver integration. 5. Conclusions 5.1. Main Findings This study investigates geometric relation parsing as a text-conditioned, logic-aware structured prediction problem under a derived Ext-PGDP5K protocol. The proposed parser combines atomic semantic probing, iterative visual–semantic feedback, and low-order logic consistency regularization while keeping the visual and textual backbones lightweight. Under the active four-relation evaluation, the model improves Edge-F1 and Macro-F1 relative to the image-only and global-fusion baselines. The strongest gains are observed for Parallel and Perpendicular, suggesting the value of explicit textual cues for ambiguity-sensitive relation prediction under the current weakly supervised parser-level setting. These improvements should be interpreted within the derived Ext-PGDP5K protocol and should not be taken as evidence of general geometric language understanding. Within this scope, the value of the study lies in providing controlled evidence that paired textual cues and relation-specific cue extraction can improve sparse relation-graph prediction for selected geometric relations. The revised ablation further shows that the atomic probe is the main source of improvement and that Logic Loss is useful in the no-feedback setting. At the same time, feedback alone is not a stable independent contributor, and the full model shows a higher LVR because it predicts more positive relations. These findings suggest that text-conditioned relation coverage and rule-defined consistency should be optimized jointly rather than treated as automatically aligned objectives. They also show that the current architecture is not uniformly stable across all component combinations. The unresolved Bisect category further suggests that future parsers should incorporate higher-order geometric construction modeling rather than relying only on pairwise edge classification. 5.2. Limitations and Future Work Future work will focus on constructing reliable Tangent annotations, improving higher-order grounding for Bisect, reducing rule-defined graph conflicts at higher predicted-positive coverage, validating cross-dataset transfer, comparing with stronger external parser baselines, and connecting parsed relation graphs to downstream theorem-guided solvers. Future work should evaluate whether the parsed relation graphs improve downstream solver accuracy, final-answer prediction, proof generation, or theorem-guided reasoning success. In particular, phrase-level grounding and typed geometric construction modeling may be necessary for relations that cannot be represented reliably by pairwise edge classification alone. Future extensions should also replace or supplement seed-vocabulary cue extraction with more robust phrase-level grounding or learned semantic parsing, and should evaluate the protocol with manual label auditing or independently verified relation annotations. Thus, the present work should be viewed as a constrained but reproducible step toward more reliable geometry parsing, rather than as a complete AGP system. Author Contributions Conceptualization, X.Z. and P.J.; methodology, X.Z.; software, X.Z.; validation, X.Z., L.W., and P.J.; formal analysis, X.Z.; investigation, X.Z. and L.W.; resources, P.J. and Q.S.; data curation, L.W. and X.Z.; writing—original draft preparation, X.Z.; writing—review and editing, P.J., L.W. and Q.S.; visualization, X.Z. and L.W.; supervision, P.J. and Q.S.; project administration, P.J.; funding acquisition, P.J. and Q.S. All authors have read and agreed to the published version of the manuscript. Funding This research was supported by the General Project of Natural Science Foundation of Henan Province (262300421801) and the Soft Science Project of Henan Province (No. 262400410529). Institutional Review Board Statement Not applicable. Informed Consent Statement Not applicable. Data Availability Statement The original PGDP5K dataset is publicly available from the official PGDP repository and dataset page ( https://github.com/mingliangzhang2018/PGDP, accessed on 15 May 2026; http://www.nlpr.ia.ac.cn/databases/CASIA-PGDP5K/, accessed on 15 May 2026). The project code and review materials are available at https://github.com/youger-zero/atom-main (accessed on 15 May 2026). A stable archival release will be provided through Zenodo or an equivalent repository upon acceptance. Conflicts of Interest The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results. References Ma, J.; Wang, W.; Jin, Q. A Survey of Deep Learning for Geometry Problem Solving. arXiv 2025, arXiv:2507.11936. [ Google Scholar] [ CrossRef] Seo, M.; Hajishirzi, H.; Farhadi, A.; Etzioni, O.; Malcolm, C. Solving Geometry Problems: Combining Text and Diagram Interpretation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 1466–1476. [ Google Scholar] [ CrossRef] Zhang, M.-L.; Yin, F.; Hao, Y.-H.; Liu, C.-L. Plane Geometry Diagram Parsing. arXiv 2022, arXiv:2205.09363. [ Google Scholar] [ CrossRef] Lu, P.; Qiu, L.; Yu, W.; Welleck, S.; Chang, K.-W. A Survey of Deep Learning for Mathematical Reasoning. arXiv 2022, arXiv:2212.10535. [ Google Scholar] [ CrossRef] Zhu, N.; Zhang, X.; Huang, Q.; Zhu, F.; Zeng, Z.; Leng, T. FGeo-Parser: Autoformalization and Solution of Plane Geometric Problems. Symmetry 2025, 17, 8. [ Google Scholar] [ CrossRef] Trinh, T.H.; Wu, Y.; Le, Q.V.; He, H.; Luong, T. Solving Olympiad Geometry without Human Demonstrations. Nature 2024, 625, 476–482. [ Google Scholar] [ CrossRef] [ PubMed] Lu, P.; Gong, R.; Jiang, S.; Qiu, L.; Huang, S.; Liang, X.; Zhu, S.-C. Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Online, 1–6 August 2021; pp. 6774–6786. [ Google Scholar] [ CrossRef] Li, Z.-Z.; Zhang, M.-L.; Yin, F.; Liu, C.-L. LANS: A Layout-Aware Neural Solver for Plane Geometry Problem. In Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 2596–2608. [ Google Scholar] [ CrossRef] Zhang, M.-L.; Li, Z.-Z.; Yin, F.; Lin, L.; Liu, C.-L. Fuse, Reason and Verify: Geometry Problem Solving with Parsed Clauses from Diagram. arXiv 2024, arXiv:2407.07327. [ Google Scholar] [ CrossRef] Ping, B.; Luo, M.; Dang, Z.; Wang, C.; Jia, C. AutoGPS: Automated Geometry Problem Solving via Multimodal Formalization and Deductive Reasoning. In Proceedings of the Fourteenth International Conference on Learning Representations, Rio de Janeiro, Brazil, 23–27 April 2026; Available online: https://openreview.net/forum?id=PVtZnUh04m (accessed on 15 May 2026). Zhang, Z.; Cheng, J.-K.; Deng, J.; Tian, L.; Ma, J.; Qin, Z.; Zhang, X.; Zhu, N.; Leng, T. Diagram Formalization Enhanced Multi-Modal Geometry Problem Solver. arXiv 2024, arXiv:2409.04214. [ Google Scholar] [ CrossRef] Murphy, L.; Yang, K.; Sun, J.; Li, Z.; Anandkumar, A.; Si, X. Autoformalizing Euclidean Geometry. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024; Proceedings of Machine Learning Research. Volume 235, pp. 36847–36893. [ Google Scholar] Lu, P.; Qiu, L.; Chen, J.; Xia, T.; Zhao, Y.; Zhang, W.; Yu, Z.; Liang, X.; Zhu, S.-C. IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning. arXiv 2021, arXiv:2110.13214. [ Google Scholar] [ CrossRef] Chen, J.; Tang, J.; Qin, J.; Liang, X.; Liu, L.; Xing, E.P.; Lin, L. GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical Reasoning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 513–523. [ Google Scholar] [ CrossRef] Tan, H.; Bansal, M. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, 3–7 November 2019; pp. 5100–5111. [ Google Scholar] [ CrossRef] Li, C.; Xu, H.; Tian, J.; Wang, W.; Yan, M.; Bi, B.; Ye, J.; Chen, H.; Xu, G.; Cao, Z.; et al. mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 7241–7259. [ Google Scholar] [ CrossRef] Agrawal, A.; Batra, D.; Parikh, D.; Kembhavi, A. Don’t Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4971–4980. [ Google Scholar] [ CrossRef] Geirhos, R.; Jacobsen, J.-H.; Michaelis, C.; Zemel, R.; Brendel, W.; Bethge, M.; Wichmann, F.A. Shortcut Learning in Deep Neural Networks. Nat. Mach. Intell. 2020, 2, 665–673. [ Google Scholar] [ CrossRef] Besold, T.R.; d’Avila Garcez, A.; Bader, S.; Bowman, H.; Domingos, P.; Hitzler, P.; Kühnberger, K.-U.; Lamb, L.C.; Lowd, D.; Lima, P.M.V.; et al. Neural-Symbolic Learning and Reasoning: A Survey and Interpretation. arXiv 2017, arXiv:1711.03902. [ Google Scholar] [ CrossRef] Xu, J.; Zhang, Z.; Friedman, T.; Liang, Y.; Van den Broeck, G. A Semantic Loss Function for Deep Learning with Symbolic Knowledge. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Proceedings of Machine Learning Research. Volume 80, pp. 5502–5511. [ Google Scholar] Hu, Z.; Ma, X.; Liu, Z.; Hovy, E.; Xing, E. Harnessing Deep Neural Networks with Logic Rules. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; pp. 2410–2420. [ Google Scholar] [ CrossRef] Diligenti, M.; Gori, M.; Saccà, C. Semantic-Based Regularization for Learning and Inference. Artif. Intell. 2017, 244, 143–165. [ Google Scholar] [ CrossRef] Fischer, M.; Balunovic, M.; Drachsler-Cohen, D.; Gehr, T.; Zhang, C.; Vechev, M. DL2: Training and Querying Neural Networks with Logic. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Proceedings of Machine Learning Research. Volume 97, pp. 1931–1941. [ Google Scholar] Bach, S.H.; Broecheler, M.; Huang, B.; Getoor, L. Hinge-Loss Markov Random Fields and Probabilistic Soft Logic. J. Mach. Learn. Res. 2017, 18, 1–67. [ Google Scholar] Dong, H.; Mao, J.; Lin, T.; Wang, C.; Li, L.; Zhou, D. Neural Logic Machines. arXiv 2019, arXiv:1904.11694. [ Google Scholar] [ CrossRef] Manhaeve, R.; Dumančić, S.; Kimmig, A.; Demeester, T.; De Raedt, L. DeepProbLog: Neural Probabilistic Logic Programming. arXiv 2018, arXiv:1805.10872. [ Google Scholar] [ CrossRef] Serafini, L.; d’Avila Garcez, A. Logic Tensor Networks: Deep Learning and Logical Reasoning from Data and Knowledge. arXiv 2016, arXiv:1606.04422. [ Google Scholar] [ CrossRef] Ratner, A.; Bach, S.H.; Ehrenberg, H.; Fries, J.; Wu, S.; Ré, C. Snorkel: Rapid Training Data Creation with Weak Supervision. Proc. VLDB Endow. 2017, 11, 269–282. [ Google Scholar] [ CrossRef] [ PubMed] He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [ Google Scholar] [ CrossRef] Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [ Google Scholar] [ CrossRef] Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. arXiv 2019, arXiv:1910.01108. [ Google Scholar] [ CrossRef] Figure 1. Motivation of text-conditioned geometric relation parsing. Figure 1. Motivation of text-conditioned geometric relation parsing. Figure 2. Overview of the proposed text-guided and logic-regularized parser. Figure 2. Overview of the proposed text-guided and logic-regularized parser. Figure 3. Illustrative qualitative cases under the active four-relation setting. These examples are parser-level cases and do not demonstrate downstream solver-level correctness. Figure 3. Illustrative qualitative cases under the active four-relation setting. These examples are parser-level cases and do not demonstrate downstream solver-level correctness. Figure 4. Representative validation-set LVR trajectories for AP-only and AP+Logic during training. Figure 4. Representative validation-set LVR trajectories for AP-only and AP+Logic during training. Figure 5. Sensitivity of the full model to the number of feedback rounds. Figure 5. Sensitivity of the full model to the number of feedback rounds. Table 1. Statistics of the Ext-PGDP5K protocol. Table 1. Statistics of the Ext-PGDP5K protocol. Set N Pair I T P ⊥ B Avg Train 3500 363,872 571 0 176 685 105 0.439 Val. 500 55,912 77 0 29 100 23 0.458 Test 1000 110,086 175 0 47 204 53 0.479 Total 5000 529,870 823 0 252 989 181 0.449 I = Intersect; T = Tangent; P = Parallel; ⊥ = Perpendicular; B = Bisect; Avg = average active relations per sample. Tangent is retained in the nominal label space but has no positive instances in the current derived protocol. Table 2. Main comparison and modality controls. Table 2. Main comparison and modality controls. Method Text Edge-F1 (%) Macro-F1 (%) FRA (%) LVR (%) Text-only Orig. 0.00 0.00 69.0 0.000 Image-only None 27.05 16.43 71.6 0.027 Global fusion Orig. 30.78 16.16 72.8 0.035 Img. + shuf. text Shuf. 3.67 2.42 70.2 0.001 Ours Paired 53.63 42.56 77.8 0.244 Edge-F1, Macro-F1, FRA, and LVR are reported in %. Orig. = original text; Shuf. = shuffled text; Paired = correctly paired original text. All methods are protocol-aligned parser variants rather than direct comparisons with solver-oriented systems. Table 3. Relation-wise F1 under the active four-relation setting. Table 3. Relation-wise F1 under the active four-relation setting. Rel. Img. Fusion Ours Δ Int. 42.32 59.97 53.60 −6.37 Par. 0.00 0.00 55.70 +55.70 Perp. 23.39 4.66 60.95 +37.56 Bis. 0.00 0.00 0.00 0.00 Macro avg. 16.43 16.16 42.56 +26.13 Rel. = Relation; Img. = Image-only; Fusion = Global text fusion; Δ = Ours minus the best baseline. Int. = Intersect; Par. = Parallel; Perp. = Perpendicular; Bis. = Bisect. Table 4. Revised ablation study of the proposed components. Table 4. Revised ablation study of the proposed components. Variant Edge-F1 (%) Macro-F1 (%) FRA (%) LVR (%) Pos./S Base 27.05 16.43 71.6 0.027 0.1125 +AP 46.45 36.82 75.6 0.116 0.2530 +AP+Logic 52.14 29.85 77.0 0.086 0.2690 +AP+Fb 39.38 29.29 75.1 0.091 0.2320 Full 53.63 42.56 77.8 0.244 0.2985 Gold — — — — 0.4790 AP = Atomic probe; Fb = Feedback; Pos./S = average predicted positive relations per sample. Edge-F1, Macro-F1, FRA, and LVR are reported in %. The Gold row reports the average number of ground-truth positive relations per sample. The ablation results show non-monotonic component interactions rather than independent monotonic gains from each component. Table 5. Efficiency and complexity analysis. Table 5. Efficiency and complexity analysis. Method Rds. Params (M) Time (ms) FRA (%) LVR (%) Image-only N.A. 3.89 7.82 71.6 0.027 Global fusion N.A. 3.89 7.84 72.8 0.035 Ours-1R 1 3.89 7.55 76.3 0.115 Ours-2R 2 3.89 8.02 77.8 0.244 Ours-3R 3 3.89 8.45 74.3 0.028 Rds. = feedback rounds. Params are trainable parser parameters in millions. Time is inference time per sample in ms. FRA and LVR are reported in %. Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. © 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Share and Cite MDPI and ACS Style Jian, P.; Zhang, X.; Wu, L.; Sun, Q. Text-Guided Geometric Relation Parsing with Logic Regularization. Electronics 2026, 15, 2460. https://doi.org/10.3390/electronics15112460 AMA Style Jian P, Zhang X, Wu L, Sun Q. Text-Guided Geometric Relation Parsing with Logic Regularization. Electronics. 2026; 15(11):2460. https://doi.org/10.3390/electronics15112460 Chicago/Turabian Style Jian, Pengpeng, Xuhui Zhang, Lei Wu, and Quanhong Sun. 2026. "Text-Guided Geometric Relation Parsing with Logic Regularization" Electronics 15, no. 11: 2460. https://doi.org/10.3390/electronics15112460 APA Style Jian, P., Zhang, X., Wu, L., & Sun, Q. (2026). Text-Guided Geometric Relation Parsing with Logic Regularization. Electronics, 15(11), 2460. https://doi.org/10.3390/electronics15112460 Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here. Article Metrics Article metric data becomes available approximately 24 hours after publication online.

www.mdpi.com

Zum Originalartikel

Text-Guided Geometric Relation Parsing with Logic Regularization

https://www.swissengineering.ch/fr/manifestations/besuch-bei-acentauri-solar-racing_v94554

SNP-Based Chromosomal Microarray Analysis in the Era of Optical Genome Mapping: An Enriched Case-Series Evaluating Copy-Neutral Events

Text-Guided Geometric Relation Parsing with Logic Regularization

https://www.swissengineering.ch/fr/manifestations/besuch-bei-acentauri-solar-racing_v94554

SNP-Based Chromosomal Microarray Analysis in the Era of Optical Genome Mapping: An Enriched Case-Series Evaluating Copy-Neutral Events

Prometheus - Die linke Stimme der Schweiz