Zum Inhalt springen

Applied Sciences, Vol. 16, Pages 5629: View-GFN: A Novel View-Based Graph Convolution and Sampling Fusion Network for 3D Shape Recognition

Prometheus Redaktion

Three-dimensional (3D) shape recognition is a fundamental task in computer vision, where view-based methods have recently achieved state-of-the-art performance. However, effectively capturing and exploiting the rich geometric correspondences between different views remains a key challenge, as such information is crucial for accurate shape representation. Existing methods often fall short in explicitly modeling these structured correlations, which limits their ability to fully leverage discriminative shape information. To address this limitation, we propose a novel View-based Graph Convolution and Sampling Fusion Network (View-GFN). View-GFN employs a hierarchical architecture that progressively coarsens the view-graph to learn multi-scale features. In this structure, views are treated as graph nodes, and a predefined-value strategy is introduced to initialize the adjacency matrix (AM) for constructing initial node correlations. For effective graph coarsening, we develop a novel view down-sampling method based on a cluster assignment matrix. Furthermore, a Graph Convolution and Sampling Fusion (CSF) module is designed to seamlessly integrate deep feature embeddings with the topological information derived from view down-sampling. Extensive experiments on benchmark datasets, including ModelNet40 and RGB-D, demonstrate that View-GFN achieves strong performance, performing on par with established baseline methods while reducing the number of model parameters by nearly 50% compared to the baseline View-GCN. These results validate the effectiveness of our hierarchical fusion strategy in capturing multi-view geometric information both efficiently and robustly. 1. Introduction Specifically, voxel-based methods extract features by discretizing the 3D space into regular grids [ 1, 2]. Point cloud-based methods represent targets as unordered sets of spatial points and directly process them using advanced geometric aggregation networks [ 3, 4]. Recent studies have further introduced techniques such as skeleton-aware sampling [ 5], zero-shot geometry-driven aggregation [ 6], sample-adaptive auto-augmentation [ 7], and multi-scale topological networks [ 8], significantly enhancing recognition robustness. On the other hand, multi-view-based methods project 3D objects into a sequence of two-dimensional (2D) images and integrate these features into a global descriptor. Compared with the former two paradigms, view-based methods generally exhibit highly competitive performance in 3D recognition tasks. This is because they can acquire comprehensive geometric and textural information from different perspectives, and seamlessly leverage mature pre-trained image networks. Recently, to further address diverse challenges in 3D recognition, novel paradigms have emerged. For instance, recent studies have advanced this field using progressive interaction transformers [ 12], lightweight multi-view convolutional-vision models [ 10], and prototype-based interpretable architectures for fine-grained shape classification [ 11 Despite this progress, efficiently modeling the complex geometric correspondences across views remains a core bottleneck. Early multi-view fusion strategies (e.g., view pooling) typically treat all input views equally, ignoring inherent spatial correlations. To introduce relational modeling, Graph Convolutional Network (GCN)-based methods (e.g., View-GCN) treat views as nodes for message passing. However, these methods exhibit notable practical deficiencies. First, existing methods predominantly use “hard sampling” during graph coarsening. Directly discarding lower-ranked view nodes leads to irreversible feature loss and destroys the global manifold topology of 3D objects. Second, current graph constructions over-rely on predefined rigid viewpoint coordinates for initializing the adjacency matrix (AM). This mechanism lacks a global connectivity prior, reducing robustness to viewpoint fluctuations. Finally, node feature updating and topological evolution are decoupled, restricting the network from exploiting deep discriminative information. To overcome these limitations, this paper proposes a novel View-based Graph Convolution and Sampling Fusion Network (View-GFN). Unlike traditional methods that rely on fixed spatial coordinates, we propose an AM initialization strategy with a global connectivity prior to endow the initial graph with a global receptive field. To preserve geometric topology, we replace traditional selective dropping with a hierarchical graph coarsening method based on a clustering assignment matrix. By softly mapping semantically similar features into super-nodes, we eliminate redundant information while retaining topological integrity. Furthermore, we design a Graph Convolution and Sampling Fusion (CSF) module to seamlessly integrate deep local feature embeddings with the coarsened macroscopic structure. To explicitly delineate the methodological differences between our View-GFN and existing view-based graph networks, we summarize three fundamental shifts in our design. (1) Initialization: We transition from coordinate-dependent static graph construction to a fully dense initialization with a global connectivity prior, eliminating the reliance on rigid camera positions. (2) Coarsening: We replace destructive hard-node dropping with a structure-preserving soft-clustering mechanism to safeguard the 3D manifold topology. (3) Architecture: We move from decoupled feature updating and pooling stages to a unified CSF module, which performs simultaneous feature embedding and structural coarsening to significantly reduce parameter overhead. Our main contributions can be explicitly summarized into three distinct structural novelties: (1) A coordinate-free dense adjacency matrix (AM) initialization strategy: We adopt a complete graph prior with predefined values, which directly equips shallow graph convolutions with a global receptive field and eliminates the model’s dependency on rigid physical camera coordinates. (2) A soft-clustering mechanism for view down-sampling: Unlike traditional hard-node dropping, our cluster-assignment approach softly aggregates semantically similar features, effectively preserving the discriminative geometric topology of 3D objects. (3) A unified Graph Convolution and Sampling Fusion (CSF) module: We seamlessly integrate feature embedding and structural graph coarsening within a single branch. This design eliminates the massive parameter overhead inherent in traditional two-stage methods, providing a highly lightweight and efficient alternative. Extensive experiments on ModelNet40 and RGB-D demonstrate that View-GFN achieves robust performance, yielding a competitive recognition accuracy of 97.8% while reducing model parameters by nearly 50% compared to the baseline View-GCN. This validates the strong performance and practical value of our hierarchical fusion strategy. Importantly, our core contribution lies not merely in pushing the saturated accuracy boundaries on synthetic benchmarks, but in achieving top-tier performance with significantly enhanced architectural efficiency. Furthermore, evaluations on the challenging real-world RGB-D dataset demonstrate the model’s strong generalization against background noise and varying camera trajectories, making it highly suitable for practical deployments. 2. Related Work In this section, we provide a systematic review of the literature closely related to our work from three perspectives: multi-view 3D shape recognition, graph construction and relational modeling, and hierarchical graph coarsening and pooling. 2.1. Multi-View 3D Shape Recognition Converting 3D objects into 2D projections to leverage mature 2D convolutional neural networks (CNNs) for discriminative feature extraction has become a core paradigm in the field of 3D shape analysis. As a pioneering work, MVCNN [ 13] introduced a view pooling strategy that aggregates multi-view features via element-wise maximum operation to generate global shape descriptors. This work laid the foundation for multi-view methods, enabling 3D recognition tasks to fully benefit from 2D network models pre-trained on large-scale image datasets such as ImageNet. Subsequent studies have pursued improvements in fusion strategies and feature extraction. GVCNN [ 14] introduced a group-view convolutional approach that partitions views into different groups based on feature similarity. MHBN [ 15] proposed harmonized bilinear pooling to capture second-order statistics across cross-view image patches. Furthermore, several works have focused on viewpoint optimization and sequence modeling. For instance, RotationNet [ 16] treats viewpoints as latent variables for joint optimization, achieving simultaneous improvement in both classification and pose estimation. Methods based on RNNs or LSTMs [ 17] attempt to capture spatial evolution patterns across view sequences using temporal models. Despite significant progress, most of these methods rely on simple pooling operations or sequential aggregation, treating each view as an isolated image sample. This paradigm fails to explicitly establish structured topological relationships between views, thereby overlooking the rich geometric correspondence information embedded across different perspectives. This limitation motivated the introduction of graph neural networks (GNNs) into the multi-view domain. View-GCN [ 18] represents the first attempt to explicitly treat views as graph nodes and perform message passing through graph convolution, opening new directions for graph-driven multi-view fusion research. Recently, to further address the diverse challenges in 3D recognition, novel paradigms have emerged. For instance, LM-MCVT [ 10] explores lightweight multimodal fusion optimized for few-view scenarios, highlighting the ongoing demand for deployment efficiency. Meanwhile, Proto-FG3D [ 11] pioneers prototype-based interpretable architectures for fine-grained 3D classification, pushing the boundaries of detail-oriented shape understanding. Complementary to these specific applications, our View-GFN focuses on maximizing the geometric fidelity and parameter efficiency of graph structures under standard dense multi-view settings (e.g., 20 views). 2.2. Graph Construction and Relational Modeling The performance of GNNs heavily depends on the quality of the initial graph topology. In multi-view 3D recognition, defining appropriate node adjacency relationships for a set of views constitutes a fundamental challenge. Existing graph-based methods primarily adopt two initialization strategies. The first is geometry-driven static graph construction, exemplified by View-GCN [ 18], which initializes the adjacency matrix (AM) using the physical 3D coordinates of camera viewpoints via the K-nearest neighbors (KNN) algorithm. While this approach introduces spatial priors, its fixed graph structure fails to reflect the dynamic semantic evolution of view relationships and exhibits high sensitivity to variations in the number of input views. The second strategy is semantic-driven dynamic graph construction. For example, Xu et al. [ 19] proposed a path aggregation graph network that dynamically constructs a view-relation graph by computing semantic correlations between view features. While this approach captures deep semantic relationships, it typically involves expensive pairwise similarity computation, incurring significant computational overhead. Different from these methods, this paper proposes an AM initialization strategy based on a global connectivity prior. In contrast to methods relying on local geometric constraints [ 18] or high-overhead dynamic feature dependencies [ 19], our approach constructs a densely connected initial topology using pre-defined values. This design endows graph convolution with a global receptive field at shallow layers and eliminates dependence on static viewpoint coordinates, enabling the model to adaptively learn cross-view long-range dependencies while demonstrating inherent robustness to fluctuations in the number of input views. 2.3. Hierarchical Graph Coarsening and Pooling On the other hand, general-purpose soft pooling methods such as DiffPool [ 22] and MinCutPool [ 23] introduce mapping mechanisms based on cluster assignment. However, these methods are primarily designed for generic graph data. When applied to densely connected multi-view graphs, their computational complexity often grows quadratically with the number of nodes, and they fail to utilize geometric priors specific to 3D vision tasks. To tackle these challenges, we propose a hierarchical multi-view graph coarsening method based on a cluster assignment matrix. Our approach smoothly aggregates semantically similar view features into super-nodes through a learnable soft assignment mechanism, achieving dimensionality reduction while maximally preserving critical geometric topological properties. Building upon this, we design a graph convolution and sampling fusion (CSF) module that jointly optimizes feature embedding and topological evolution within a unified framework. This design effectively mitigates discriminative information loss from a representational perspective and eliminates the error accumulation inherent in traditional two-stage methods from an architectural standpoint. 3. Methodology 3.1. Overview In this section, we introduce View-GFN, a novel hierarchical graph fusion network for three-dimensional (3D) shape recognition. The network adopts a multi-stage abstraction architecture designed to capture multi-scale geometric features through progressive graph coarsening. Each level of the hierarchy defines a view-graph denoted as G l = ( V l , E l ) . The initial view-graph at the first level is constructed based on M input views, where each view corresponds to a node in the graph. To define the initial correlations between nodes, we propose an initialization strategy based on a global connectivity prior. In contrast to traditional methods that rely on unstable viewpoint coordinates or local K-Nearest Neighbor (KNN) constraints, we initialize the initial adjacency matrix A 1 ∈ R M ୍ଠ M as a representation of a complete graph: A i j 1 = 1 , i ≠ j 0 , i = j (1) We opt for a static complete graph prior over alternative constructions (e.g., sparse k-NN graphs or dynamically learned connectivity) due to the specific scale of multi-view 3D recognition. Typically, the input consists of only M = 12 or 20 views. At this scale, a complete graph generates at most 20 ୍ଠ 20 = 400 edges, making the O ( M 2 ) computational and memory cost negligible. In contrast, constructing a sparse k-NN graph requires calculating pairwise coordinate distances, and learned connectivity requires computing dynamic weights at each iteration. For such a small number of nodes, these dynamic computations introduce unnecessary overhead. Therefore, our complete graph initialization provides a global receptive field at zero extra computational cost for edge generation, enabling shallow graph convolutions to facilitate global information interaction at the early stages of feature learning. The overall architecture of View-GFN is illustrated in . The network consists of a feature extraction module followed by three cascaded Graph Convolution and Sampling Fusion (CSF) modules. Each CSF module concurrently performs feature embedding and assignment matrix generation, enabling the hierarchical evolution of the graph structure through differentiable soft-clustering operations. 3.2. Initial Feature Extraction Given a sequence of multi-view images of a 3D object I = { I 1 , I 2 , … , I M } , we employ ResNet-18, pre-trained on ImageNet and fine-tuned on the target dataset, as the backbone network for initial feature extraction. Each view image I i is mapped to a c 0 -dimensional discriminative feature vector. These vectors constitute the initial node feature matrix X 1 ∈ R m 1 ୍ଠ c 0 for the first-level graph, where m 1 = M represents the initial number of nodes. 3.3. Cluster Assignment Based View Sampling To achieve hierarchical compression of the graph structure, we need to aggregate m l nodes at level l into m l + 1 super-nodes at level l + 1 , such that m l + 1 100 from continuous video sequences). To support massive-view environments, future work will explore two concrete mitigation strategies: (1) a pre-sampling step using Farthest Point Sampling to reduce the initial node set to a manageable size, and (2) replacing the complete graph with sparse, local-windowed attention mechanisms to achieve linear O ( M ) complexity. Additionally, we plan to extend this unified fusion framework to more complex 3D scene understanding and autonomous robotic perception tasks. Conceptualization, M.P. and J.J.; methodology, M.P. and J.J.; software, M.P.; validation, M.P., J.J. and Y.Z.; formal analysis, M.P.; investigation, M.P.; resources, M.P.; data curation, M.P.; writing—original draft preparation, M.P.; writing—review and editing, M.P., J.J. and Y.Z.; visualization, M.P.; supervision, M.P. and J.J.; project administration, M.P.; funding acquisition, M.P. All authors have read and agreed to the published version of the manuscript. Funding This research received no external funding. Not applicable. Informed Consent Statement Not applicable. Data Availability Statement The data presented in this study are available on request from the corresponding author. Acknowledgments The authors would like to thank the anonymous reviewers for their valuable comments and suggestions. Conflicts of Interest The authors declare no conflicts of interest. References The overall architecture of the proposed View-GFN. The framework takes multi-view images as input, extracts initial features via a CNN backbone, and processes them through a hierarchical structure consisting of cascaded CSF models and view sampling modules. The final pooled features are concatenated to generate output scores. The overall architecture of the proposed View-GFN. The framework takes multi-view images as input, extracts initial features via a CNN backbone, and processes them through a hierarchical structure consisting of cascaded CSF models and view sampling modules. The final pooled features are concatenated to generate output scores. The detailed architecture of the Graph Convolution and Sampling Fusion (CSF) module. It demonstrates the joint optimization of multi-scale feature embeddings (via stacked GCN layers and mixed pooling) and the generation of the assignment matrix. The detailed architecture of the Graph Convolution and Sampling Fusion (CSF) module. It demonstrates the joint optimization of multi-scale feature embeddings (via stacked GCN layers and mixed pooling) and the generation of the assignment matrix. Visualization of the hierarchical graph coarsening process. The soft-assignment matrix effectively groups semantically similar and topologically adjacent view images (e.g., four viewpoints capturing the legs of a chair) into a single localized super-node, preserving the structural integrity of the object while reducing the graph scale without hard-dropping. Visualization of the hierarchical graph coarsening process. The soft-assignment matrix effectively groups semantically similar and topologically adjacent view images (e.g., four viewpoints capturing the legs of a chair) into a single localized super-node, preserving the structural integrity of the object while reducing the graph scale without hard-dropping. Classification accuracy and model complexity comparison on ModelNet40. Classification accuracy and model complexity comparison on ModelNet40. Method Backbone Type Views Inst. Acc. (%) Class Acc. (%) Params (M) Time (s/epoch) Classification accuracy on the RGB-D dataset. Classification accuracy on the RGB-D dataset. Method Backbone Views Inst Acc (%) Params (M) Time (s/epoch) Note: Methods such as CFK, MMDCNN, and MDSICNN employ 120 views, whereas View-GFN, View-GCN, and MVCNN utilize only 12 views. Despite the significantly fewer input views, View-GFN achieves highly competitive accuracy with substantially fewer parameters, underscoring its superior view utilization efficiency. “–” indicates that the corresponding information is not available from the original publication. Retrieval task performance comparison on ModelNet40 (mAP). Retrieval task performance comparison on ModelNet40 (mAP). Method mAP (%) GVCNN [ 14] 85.7 MVCVT [ 32] 95.4 MLVCNN [ 27] 92.8 MVPNet [ 28] 97.4 View-GFN (Ours) 97.8 Ablation study of View-GFN core components (ModelNet40, 20 views). Ablation study of View-GFN core components (ModelNet40, 20 views). Configuration Inst Acc (%) Class Acc (%) Description View-GFN-FPS 96.5 95.2 Replace soft-clustering with Farthest Point Sampling (FPS) View-GFN-SEP 97.7 96.5 Decouple feature embedding and assignment matrix generation View-GFN-A1 97.4 96.2 AM considers only 3 nearest neighbor nodes View-GFN-A2 97.5 96.1 AM initialized with view coordinate encoding View-GFN (Full) 97.8 96.5 Full model (Global AM + Soft-clustering + CSF) Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. © 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Pang, M.; Jiao, J.; Zhang, Y. View-GFN: A Novel View-Based Graph Convolution and Sampling Fusion Network for 3D Shape Recognition. Appl. Sci. 2026, 16, 5629. https://doi.org/10.3390/app16115629 Pang M, Jiao J, Zhang Y. View-GFN: A Novel View-Based Graph Convolution and Sampling Fusion Network for 3D Shape Recognition. Applied Sciences. 2026; 16(11):5629. https://doi.org/10.3390/app16115629 Pang, Min, Jichao Jiao, and Yingjian Zhang. 2026. "View-GFN: A Novel View-Based Graph Convolution and Sampling Fusion Network for 3D Shape Recognition" Applied Sciences 16, no. 11: 5629. https://doi.org/10.3390/app16115629 Pang, M., Jiao, J., & Zhang, Y. (2026). View-GFN: A Novel View-Based Graph Convolution and Sampling Fusion Network for 3D Shape Recognition. Applied Sciences, 16(11), 5629. https://doi.org/10.3390/app16115629 2. Related Work In this section, we provide a systematic review of the literature closely related to our work from three perspectives: multi-view 3D shape recognition, graph construction and relational modeling, and hierarchical graph coarsening and pooling. 2.1. Multi-View 3D Shape Recognition Converting 3D objects into 2D projections to leverage mature 2D convolutional neural networks (CNNs) for discriminative feature extraction has become a core paradigm in the field of 3D shape analysis. As a pioneering work, MVCNN [ 13] introduced a view pooling strategy that aggregates multi-view features via element-wise maximum operation to generate global shape descriptors. This work laid the foundation for multi-view methods, enabling 3D recognition tasks to fully benefit from 2D network models pre-trained on large-scale image datasets such as ImageNet. Subsequent studies have pursued improvements in fusion strategies and feature extraction. GVCNN [ 14] introduced a group-view convolutional approach that partitions views into different groups based on feature similarity. MHBN [ 15] proposed harmonized bilinear pooling to capture second-order statistics across cross-view image patches. Furthermore, several works have focused on viewpoint optimization and sequence modeling. For instance, RotationNet [ 16] treats viewpoints as latent variables for joint optimization, achieving simultaneous improvement in both classification and pose estimation. Methods based on RNNs or LSTMs [ 17] attempt to capture spatial evolution patterns across view sequences using temporal models. Despite significant progress, most of these methods rely on simple pooling operations or sequential aggregation, treating each view as an isolated image sample. This paradigm fails to explicitly establish structured topological relationships between views, thereby overlooking the rich geometric correspondence information embedded across different perspectives. This limitation motivated the introduction of graph neural networks (GNNs) into the multi-view domain. View-GCN [ 18] represents the first attempt to explicitly treat views as graph nodes and perform message passing through graph convolution, opening new directions for graph-driven multi-view fusion research. Recently, to further address the diverse challenges in 3D recognition, novel paradigms have emerged. For instance, LM-MCVT [ 10] explores lightweight multimodal fusion optimized for few-view scenarios, highlighting the ongoing demand for deployment efficiency. Meanwhile, Proto-FG3D [ 11] pioneers prototype-based interpretable architectures for fine-grained 3D classification, pushing the boundaries of detail-oriented shape understanding. Complementary to these specific applications, our View-GFN focuses on maximizing the geometric fidelity and parameter efficiency of graph structures under standard dense multi-view settings (e.g., 20 views). 2.2. Graph Construction and Relational Modeling The performance of GNNs heavily depends on the quality of the initial graph topology. In multi-view 3D recognition, defining appropriate node adjacency relationships for a set of views constitutes a fundamental challenge. Existing graph-based methods primarily adopt two initialization strategies. The first is geometry-driven static graph construction, exemplified by View-GCN [ 18], which initializes the adjacency matrix (AM) using the physical 3D coordinates of camera viewpoints via the K-nearest neighbors (KNN) algorithm. While this approach introduces spatial priors, its fixed graph structure fails to reflect the dynamic semantic evolution of view relationships and exhibits high sensitivity to variations in the number of input views. The second strategy is semantic-driven dynamic graph construction. For example, Xu et al. [ 19] proposed a path aggregation graph network that dynamically constructs a view-relation graph by computing semantic correlations between view features. While this approach captures deep semantic relationships, it typically involves expensive pairwise similarity computation, incurring significant computational overhead. Different from these methods, this paper proposes an AM initialization strategy based on a global connectivity prior. In contrast to methods relying on local geometric constraints [ 18] or high-overhead dynamic feature dependencies [ 19], our approach constructs a densely connected initial topology using pre-defined values. This design endows graph convolution with a global receptive field at shallow layers and eliminates dependence on static viewpoint coordinates, enabling the model to adaptively learn cross-view long-range dependencies while demonstrating inherent robustness to fluctuations in the number of input views. 2.3. Hierarchical Graph Coarsening and Pooling On the other hand, general-purpose soft pooling methods such as DiffPool [ 22] and MinCutPool [ 23] introduce mapping mechanisms based on cluster assignment. However, these methods are primarily designed for generic graph data. When applied to densely connected multi-view graphs, their computational complexity often grows quadratically with the number of nodes, and they fail to utilize geometric priors specific to 3D vision tasks. To tackle these challenges, we propose a hierarchical multi-view graph coarsening method based on a cluster assignment matrix. Our approach smoothly aggregates semantically similar view features into super-nodes through a learnable soft assignment mechanism, achieving dimensionality reduction while maximally preserving critical geometric topological properties. Building upon this, we design a graph convolution and sampling fusion (CSF) module that jointly optimizes feature embedding and topological evolution within a unif

www.mdpi.com

Zum Originalartikel