Zum Inhalt springen

A CNN and Transformer-Based Framework for Fine-Grained Plant Species Classification in Real-World Environments

Prometheus Redaktion

Open AccessArticle A CNN and Transformer-Based Framework for Fine-Grained Plant Species Classification in Real-World Environments by Daniel Chwaifo Malann Daniel Chwaifo Malann SciProfiles Scilit Preprints.org Google Scholar 1,2,*, Nadire Cavus Nadire Cavus SciProfiles Scilit Preprints.org Google Scholar 1,2 and Boran Sekeroglu Boran Sekeroglu SciProfiles Scilit Preprints.org Google Scholar 2,3 1 Computer Information Systems, Faculty of AI & Informatics, Near East University, N. Cyprus, Mersin 10, Nicosia 99138, Türkiye 2 Computer Information Systems Research and Technology Center, Near East University, N. Cyprus, Mersin 10, Nicosia 99138, Türkiye 3 Computer Engineering, Faculty of AI & Informatics, Near East University, N. Cyprus, Mersin 10, Nicosia 99138, Türkiye * Author to whom correspondence should be addressed. Appl. Sci. 2026, 16(12), 5810; https://doi.org/10.3390/app16125810 (registering DOI) Submission received: 12 May 2026 / Revised: 3 June 2026 / Accepted: 3 June 2026 / Published: 9 June 2026 Abstract Plant recognition plays a vital role in agriculture and biodiversity monitoring, and deep learning, particularly convolutional neural networks (CNNs), has gained increased attention for automating this task. However, CNNs have a limitation in their ability to handle complex patterns due to the difficulty in capturing global contextual information. Furthermore, plant datasets are often created in laboratory environments that minimize discrimination challenges, enabling the analysis of model performance. This study proposes a hybrid deep learning model, HDL-PlantNet, for real-world plant recognition on the primary dataset, the Cyprus Seasonal Flora Image Dataset (CSFID), comprising 27 plant species. The HDL-PlantNet model integrates an EfficientNetV2-S convolutional backbone with a Transformer encoder to capture both spatial contextual and long-range dependencies. Additionally, the Swedish Leaf Dataset is used as a supplementary dataset to analyze the consistency of the HDL-PlantNet under controlled environments. Five benchmark CNN models are used for comparative evaluation, and statistical tests and an ablation study are conducted to assess the results. The proposed model achieved the highest observed Macro-F1 and Macro-AUC scores among the evaluated models, reaching 90.06% and 99.59%, respectively. The results demonstrate that combining convolutional and Transformer architectures yields computationally effective performance in fine-grained plant classification while maintaining a compact model size suitable for further research. This study contributes to real-time plant identification studies and supports informed ecological decision-making. 1. Introduction Plant image recognition has become an important application of artificial intelligence in agriculture and biodiversity informatics. In farming, plant pests and diseases can devastate crop yields; up to 40% of global crop production is lost annually, costing over $220 billion, according to the Food and Agriculture Organization [ 1]. Early and accurate identification of plant ailments is therefore critical for food security. Beyond agriculture, automated plant species identification can assist botanists and laypersons in rapidly recognizing plant types, which is valuable for environmental monitoring and medicinal plant discovery [ 2]. Due to the time-consuming process and the need for an expert in manual plant identification, computer vision techniques offer robust and accurate results in automating the process. In recent years, deep learning has gained an important role in analyzing plant images for different problems, such as disease detection and species classification [ 2, 3]. Convolutional Neural Network (CNN)- based models tend to learn rich hierarchical features from plant images and achieve superior results compared to previous systems [ 2]. In this study, we employed a hybrid deep learning model (HDL-PlantNet) that combines a CNN and a Transformer, specifically designed for plant leaf classification. The model includes the EfficientNetV2-Small as a convolutional backbone for initial feature extraction, a custom Transformer encoder to enhance extracted features for long-range dependencies, and a classification head. The aim of this architecture is to improve the class separability across diverse species captured in real-life environments while capturing fine-grained local features and global contextual patterns. The primary contribution of HDL-PlantNet is its task-specific adaptation of a compact EfficientNetV2-S backbone, combined with a Transformer-based feature refinement module, for field-based plant species recognition. Contrary to the large-scale hybrid architectures such as CoAtNet, MaxViT [ 6], and Swin-based models, HDL-PlantNet is designed to improve contextual feature modeling after convolutional feature extraction without increasing computational cost. This design is particularly relevant for plant identification under field conditions, where pattern distinctions are based on small or partially occluded regions. A primary plant dataset, The Cyprus Seasonal Flora Image Dataset (CSFID) [ 7], is used in the experiments, and the Swedish Leaf Dataset [ 8] is also used as a supplementary dataset to assess the consistency of the HDL-PlantNet under controlled environments. 2. Related Studies To leverage the strengths of both CNNs and Transformers, researchers have begun exploring hybrid architectures that combine convolutional and self-attention modules within a single model [ 4]. By merging CNN’s strong local feature extraction with the ViT’s global context modeling, such hybrid models can produce more discriminative representations of plant images. Recent studies in the agricultural domain indicate that CNN–Transformer hybrids can improve classification performance and robustness. As CNNs and Transformers each offered unique advantages, researchers increasingly experimented with hybrid architectures that combine convolutional and self-attention mechanisms. The general idea is to utilize CNN layers for efficient local feature extraction and Transformer blocks for modeling long-range dependencies and global context. In 2021, Google’s CoAtNet explicitly married convolutions and Transformers by stacking convolutional layers in early network stages and Transformer layers in later stages [ 17]. In 2022, a number of sophisticated hybrid models emerged in the computer vision community. Tu et al. [ 6] introduced MaxViT (Maximal Multi-Axis ViT), a hierarchical model that integrates depthwise convolutions (MBConv blocks) with a novel multi-axis self-attention mechanism in each building block. Recent studies show that the last decade has seen plant recognition research shift from traditional CNN solutions to the advent of vision transformers to improve capabilities, ultimately leading to hybrid CNN–Transformer approaches that aim to harness the strengths of both. Recently, Xu et al. [ 18] introduced a symmetric hybrid network (PLTransformer) that uses multi-scale convolutional modules and an overlap-attentive downsampler to capture both local textures and global context in plant disease images. Their model achieves 99.95% on PlantVillage, demonstrating the power of CNN+Transformer fusion. Similarly, Lee et al. [ 19] propose a Plant-CNN-ViT ensemble (ResNet50, DenseNet, Xception, and ViT) for leaf classification, achieving nearly 100% on standard leaf datasets. These recent studies illustrate that integrating CNNs and transformers can improve feature representation and performance. For instance, Zhang et al. [ 20] introduce the FewMedical-XJAU benchmark for fine-grained medicinal plants, featuring diverse natural backgrounds and high intra-class variability. They also propose a multimodal fusion model to cope with subtle differences under challenging conditions. Rodriguez-Vazquez et al. [ 21] show that unsupervised adversarial alignment can cut plant-counting error by 97% under strong cross-domain shifts. On large-scale plant tasks, Malik et al. [ 22] report only 87% accuracy on the PlantCLEF challenge using EfficientNet-B1, illustrating the difficulty of fine-grained species classification across hundreds of categories. These studies highlight that background clutter, lighting changes, and novel species reduce performance in practice. Importantly, most existing hybrid models and benchmarks still assume relatively clean or controlled data. 3. Materials and Methods 3.1. Datasets 3.1.1. The Cyprus Seasonal Flora Image Dataset (CSFID) The Cyprus Seasonal Flora Image Dataset [ 7] consists of 3072 labeled training and 768 labeled test images of 27 plant species commonly found in Cyprus. Each image represents a distinct flora class captured under real-world conditions, including herbs, shrubs, trees, and flowering plants. Images were captured in different seasons using different mobile phones under varied lighting, backgrounds, and scales to make the dataset suitable for robust training and real-world deployment and evaluation of multiclass plant classification models in computer vision tasks. Table 1 presents the detailed plant species and training and test samples for each species. Contrary to the many existing plant datasets collected under laboratory or standardized conditions [ 8, 12], the CSFID dataset was collected entirely in natural environments without isolation of plants or arrangement of scenes. This eliminates the need for controlled acquisition settings for the users and allows the dataset to reflect real-world scenarios in a more realistic way, where multiple plant species, occlusions, background clutter, and environmental variations naturally occur. As a result, the CSFID dataset provides a more challenging and realistic benchmark for evaluating plant recognition models. The dataset was originally split into a train and a test set. Single-plant images or uniform-background images were included in the training set, while more complex multi-plant images captured in crowded plant regions were reserved for testing. This provided an assessment of the model’s generalization under a challenging real-world distribution shift across multiple plants, partial occlusions, background interference, and varying spatial compositions. The official CSFID split does not follow a conventional IID classification protocol. The training set of CSFID primarily contains single-plant or relatively uniform-background images, whereas the test set contains more challenging multi-plant scenes with occlusions, background clutter, and varying spatial compositions. For that reason, the model performances are based on the interpretation of classification performance under an official domain-shift evaluation setting rather than under a standard random train–test split. This train–test split protocol provides a realistic assessment of model robustness to distribution shifts commonly encountered in practical plant identification applications. Additionally, the acquisition conditions of plant images, such as different seasons, times of day, weather conditions, and mobile camera devices, introduce challenging variations in illumination, color distribution, angle, and brightness. The multi-plant scenes in the test set also aim to evaluate the model’s abilities in real-world scenarios. Figure 1 and Figure 2 present example images for plant species of the Cyprus Seasonal Flora Image Dataset. The first row shows training images, while the second row presents test images of the same plant species. It is clear that the test images have the same characteristics as the training images; however, they also include additional artifacts, plants, or environmental effects that make recognition challenging. Figure 1 shows how the aloe vera, kalanchoe, and iris images differ in training and testing, whereas single plant or different flower characteristics appear in the training images, while multiple plants, crowded environments, or flower characteristics appear in the test images. Similarly, Figure 2 presents the plant and flower captured during different seasons and maturity stages. 3.1.2. Swedish Leaf Dataset (SLD) We considered the Swedish Leaf Dataset [ 8] as a supplementary dataset to assess the consistency of the proposed HDL-PlantNet architecture under controlled-environment and standardized imaging conditions. This benchmark dataset comprises leaf images from 15 plant species, with approximately 75 samples per class, captured under controlled conditions. Figure 3 presents sample leaf images from the Swedish Leaf Dataset. 3.3. The Proposed HDL-PlantNet Model The proposed HDL-PlantNet model is a custom hybrid architecture that combines EfficientNetV2-S backbone with a Transformer encoder block to leverage the complementary strengths of both approaches. The HDL-PlantNet is not proposed as a fundamentally new CNN–Transformer paradigm. Instead, its contribution lies in the task-specific adaptation of a compact Transformer-based contextual refinement module for challenging field-based plant recognition. The proposed model accepts 224 ୍ଠ 224 input images, normalized using the standard ImageNet mean and standard deviation. EfficientNetV2-S is employed as the primary backbone network to extract hierarchical representations. The model uses pretrained ImageNet weights for transfer learning. The backbone was initialized with pretrained ImageNet weights and fine-tuned for the target plant dataset. Prior to the transformer module, the EfficientNetV2-S backbone effectively captured local textures, shapes, and edge information. Although the EfficientNetV2-S backbone was retained within the HDL-PlantNet architecture, only a subset of its parameters participated in gradient-based optimization during fine-tuning. Specifically, the later backbone layers and the Transformer module remained trainable, whereas the remaining backbone layers were frozen. Therefore, trainable parameter counts reflect the adopted training strategy rather than the full architectural complexity of the model. The original EfficientNetV2-S classification head, consisting of pooling, dropout, and fully connected output layers, is removed and replaced with a Transformer-enhanced custom classification module to improve representational efficacy. The extracted features are reshaped into token sequences and processed using a multi-head self-attention mechanism. This block consists of attention layers, residual skip connections, layer normalization, and a feed-forward network. The Transformer module is used to model spatial dependencies and relationships in different image regions. Therefore, the hybrid structure enables simultaneous local and global feature learning. Then, an adaptive average pooling layer is used to generate a compact global descriptor for features. Finally, a fully connected classification layer maps the learned representation to the corresponding plant classes. Figure 4 presents the basic block diagram of the proposed HDL-PlantNet model. Table 2 shows the layer-by-layer composition of the proposed model with its parameters. 3.4. Experimental Design and Evaluation Metrics Due to the Cyprus Seasonal Flora Image Dataset being pre-divided into training and testing subsets, the models were trained using the training set and subsequently evaluated on the test set. The test set was not used for iterative hyperparameter optimization or architecture tuning. The Swedish Leaf Dataset is split into train and test sets using an 80:20 hold-out since the image characteristics are uniform for each class. The results are obtained using the test set. Initially, hyperparameters are tuned for all models on the Cyprus Seasonal Flora Image Dataset training set using 5-fold cross-validation. The same five stratified cross-validation folds were used for all models to ensure a fair comparison. Hyperparameter selection was based primarily on the mean Macro-F1 score across validation folds because of the class imbalance present in the dataset. During the optimization process, learning rate, batch size, number of epochs, and dropout parameters were systematically adjusted, and selection was based on the mean Macro-F1 scores due to the class imbalance of CSFID. Table 3 presents the details of hyperparameter optimization variations in detail. As a result, a batch size of 16, a learning rate of 1 ୍ଠ 10 − 4 , and a dropout rate of 0.2 consistently achieved the best overall performance across all evaluated models. Even though minor variations occurred for the optimal epoch numbers between models, ranging between 8 and 11 epochs, the observed differences in F1-score within this interval remained at approximately ବ୍ଦ 1.1 % . Therefore, we fixed the training epochs at 10 for all models. After determining the optimal hyperparameters, each model was retrained on the full training set and evaluated once on the official test set. The test set was not used during model selection or hyperparameter optimization. The models are trained using the cross-entropy loss function and the Adam optimizer. The models were trained and evaluated independently on each dataset. The performance of the models was evaluated using common multiclass evaluation metrics, Accuracy, Macro-F1 Score, Macro-Recall, Macro-Precision, Macro-Specificity, and Macro-AUC score for all datasets. After determining the final configuration, all competing models were trained and evaluated with the same fixed random seed and official dataset split to ensure reproducible, consistent comparison conditions. For all experiments, no additional data augmentation techniques are applied during training. Images are resized to 224 ୍ଠ 224 pixels and normalized using the ImageNet mean and standard deviation. Additionally, a paired McNemar statistical test [ 28] is conducted to determine the optimal model on the Cyprus Seasonal Flora Image Dataset. To account for multiple pairwise comparisons among the evaluated models, Holm–Bonferroni correction was applied to control the family-wise error rate. Adjusted p-values were used when interpreting statistical significance. All experiments were conducted on a Windows 11 system equipped with 64 GB RAM, Intel ପ୍ପ Core™ i9-14900 KF, and NVIDIA GeForce RTX 5090 with PyTorch 2.8.0. 4. Results 4.1. Results on Cyprus Seasonal Flora Image Dataset Macro-averaged results are obtained and analyzed for all models on the Cyprus Seasonal Flora Image Dataset in order to perform comparative evaluation, and class-based results are obtained and analyzed using the model that achieved the highest observed performance determined by the macro-averaged results. 4.1.1. Macro-Averaged Results on Cyprus Seasonal Flora Image Dataset Macro-averaged results showed that the ResNet50 model did not achieve competitive performance in classifying plant species and achieved the lowest scores across all metrics. It achieved 46.99% overall accuracy and 38.62% Macro-F1 Score. Even though the MaxViT model achieved high accuracy (96.54%), it could not obtain effective results in other metrics. The MaxViT model achieved 42.31% Macro-Recall and 42.12% Macro-F1 Score, which indicates that the model focuses on dominant classes while failing to classify classes with lower image counts. The VGG16 and ConvNeXt-Tiny models obtained similar results; however, ConvNeXt-Tiny achieved slightly higher results in all metrics. In particular, ConvNeXt-Tiny produced more precise results for positive predictions (Macro-Precision = 85.60%), whereas VGG16 failed to predict with high precision (78.31%). EfficientNetV2-S achieved higher results compared to other pre-trained deep learning models and obtained 87.71% Macro-F1 Score and 86.71% Macro-Recall. However, the proposed HDL-PlantNet model achieved the highest observed performance in all metrics (Macro-F1 Score = 90.06%, Macro-Precision = 91.70%, and Macro-Recall = 89.38%) and outperformed all other models considered in this study. Figure 5 presents the confusion matrices of the top 2 models, HDL-PlantNet and EfficientNetV2 Small. Table 4 presents all results for all models in detail. 4.1.2. Class-Based Results on Cyprus Seasonal Flora Image Dataset Table 5 presents the detailed class-based results obtained by the proposed HDL-PlantNet and the backbone EfficientNetV2 S, which obtained the second-highest scores on the Cyprus Seasonal Flora Image Database. Since the proposed HDL-PlantNet outperformed other models in macro-averaged results, it is analyzed in class-based results. The HDL-PlantNet model showed outstanding performance in classifying Crown Daisy, Iris, and Pine plants, achieving 100% across all metrics. However, the model had some difficulties in the discrimination of the mandarin (56.25%), and orange classes (57.14%) belonging to the same family. The model tended to favor the lemon class, which had a larger number of training images. Some species in the dataset contain fewer samples than others, which may reduce the models’ capability to sufficiently discriminate minor classes under complex field conditions. However, despite these challenges, the proposed HDL-PlantNet architecture maintained relatively stable macro-level performance. This suggests that the contextual-attention refinement mechanism may improve robustness for minority categories under difficult field conditions. 4.1.3. Statistical Results on Cyprus Seasonal Flora Image Dataset McNemar analysis was used to evaluate paired prediction disagreement on identical test samples, complementing the macro-averaged performance metrics. Table 6 presents the paired McNemar test results comparing the proposed HDL-PlantNet model with the benchmark architectures on the Cyprus Seasonal Flora Image Dataset. The significance decisions were based on Holm–Bonferroni-adjusted p-values. The findings indicate that the HDL-PlantNet showed statistically significant improvements compared to VGG16 ( p = 0.0099 ), ResNet50 ( p = 1.4 ୍ଠ 10 − 5 ), ConvNeXt-Tiny ( p = 0.0094 ), and MaxViT ( p = 1.3 ୍ଠ 10 − 5 ). Statistical results suggest that the superior classification performance of the proposed HDL-PlantNet model against these models is unlikely to be due to random variation. However, the comparison between HDL-PlantNet and EfficientNetV2-S is slightly above the 0.05 threshold. Since EfficientNetV2-S also serves as the backbone of the proposed model, this result suggests that the Transformer enhancement provided measurable but modest gains. Overall, the statistical analysis supports the effectiveness of HDL-PlantNet compared to VGG16, ResNet50, ConvNeXt Tiny, and MaxViT. HDL-PlantNet achieved a practically meaningful improvement over EfficientNetV2-S in the Macro-F1 score, increasing from 87.71% to 90.06%, but the improvement was not statistically significant because it did not meet the conventional 0.05 significance threshold. 4.2. Results on Swedish Leaf Dataset The experiment on the SLD was performed as a supplementary benchmark evaluation rather than an indicator of real-world generalization of the models. The Swedish Leaf Dataset is used to test the proposed model to determine its effectiveness across different datasets, and a similar comparative evaluation is performed across all the considered deep learning models. Given the low intra-class variation, uniform pattern representation, and clean backgrounds of leaf images in the Swedish Leaf Dataset, all models achieved full or similar recognition performance. However, the VGG16 model obtained slightly lower results with 95.56% Macro-Recall and 95.16% Macro-F1 Score. Even though the EfficientNetV2-S model failed to recognize a few samples, it achieved >99% performance across all metrics. The other models, ResNet50, ConvNeXt-Tiny, and the proposed HDL-PlantNet, achieved 100% performance. The Swedish Leaf Dataset results demonstrate that the considered models and the proposed architecture maintain stable performance under controlled imaging conditions. Table 7 presents the obtained results on the Swedish Leaf Dataset in detail. 4.3. Ablation Study An ablation study was conducted to evaluate and analyze the contributions of the Transformer module and its variants in HDL-PlantNet on the Cyprus Seasonal Flora Image Dataset. In the ablation study, 4 model variants, the EfficientNetV2-S backbone alone (A0_NO_TX), the HDL-PlantNet with a single transformer using two attention heads (A1_H2), the HDL-PlantNet with a single transformer using eight attention heads (A1_H8), and the HDL-PlantNet with two transformers, each with four attention heads (A3_2B), are tested. All other components, including the EfficientNetV2-S backbone, classification head, and training schedule, were kept identical across experiments. Table 8 presents the details and results of the ablation study. The ablation results indicate that the transformer configuration influences the balance between Macro-Recall and Macro-Precision. The proposed HDL-PlantNet architecture achieved the highest overall Macro-F1 score (90.06%), representing a modest but practically meaningful improvement over the EfficientNetV2-S backbone (87.71%). Increasing the number of attention heads or transformer blocks did not consistently improve performance. Although the variant with two transformer blocks and four attention heads achieved relatively higher Macro-Precision (99.07%), its Macro-Recall decreased substantially (Macro-Recall = 63.64%), indicating reduced effectiveness at identifying positive samples. This reduction might be attributed to over-parameterization with the limited dataset size, which was insufficient to effectively optimize a larger number of attention subspaces. Similarly, removing the transformer module resulted in lower Macro-Recall and Macro-F1 scores compared to the proposed HDL-PlantNet architecture. Overall, the results demonstrate that larger transformer blocks do not contribute to the convergence, particularly on relatively small, fine-grained plant image datasets. These findings are consistent with previous studies reporting that self-attention mechanisms can improve feature representation and contextual modeling in visual recognition tasks [ 29]. However, transformer integration was not universally beneficial; while the single-block transformer configurations improved performance, increasing the number of attention heads or transformer blocks led to notable performance degradation. 5. Discussion The employed hybrid deep learning model, HDL-PlantNet, demonstrated encouraging performance on two datasets. The use of EfficientNetV2-S as the backbone, combined with a Transformer and attention head module, contributed to the modeling of long-range dependencies within the images. Consequently, color, structural, and regional features could be represented more effectively [ 4, 30]. Consequently, the reported results reflect the model’s robustness to a predefined distribution shift and should not be interpreted as universally representative of all fine-grained plant classification settings. Instead of employing multiple heavy Transformer blocks, we adopted a compact single-block configuration with carefully selected attention heads to preserve computational feasibility while maintaining strong discriminative performance. Additionally, the ablation study demonstrated that the Transformer encoder module integrated into the EfficientNetV2-S backbone provides a measurable contribution, and the absence of this module or structural modifications negatively affects feature representation. The ablation study further shows that the proposed contextual refinement module improves performance compared with alternative Transformer configurations and the backbone-only architecture [ 4, 20]. HDL-PlantNet achieved a balanced classification performance and computational efficiency. Although the proposed architecture includes additional Transformer-based operations compared to EfficientNetV2-S, there was a moderate increase in computational complexity with GLOPs (approximately + 13.8 % ). Furthermore, HDL-PlantNet required substantially fewer trainable parameters (9,191,939) than the compared architectures due to partial parameter freezing. This led to reduced optimization complexity and the preservation of higher representations. Despite the increased architectural complexity, the proposed model maintained real-time inference performance with competitive latency (14.86 ms/image) and FPS (67.29). Table 9 shows the comparison of the computational efficiency of the models in detail. As determined during hyperparameter optimization, all experiments were performed using a batch size of 16. However, latency and FPS measurements were obtained using a batch size of 1 to evaluate single-image inference performance. Latency values correspond to the average of 1000 forward passes after an initial warm-up phase of 100 inferences on the RTX 5090 GPU. Data-loading and preprocessing times were excluded from the measurements. The success and strength of the proposed model were primarily demonstrated on the Cyprus Seasonal Flora Image Dataset, which was introduced as the primary dataset and consists of intertwined plant images collected under real-life environmental conditions. The proposed model outperformed all five benchmark models across all evaluation metrics and successfully classified minority plant classes with relatively fewer images in the imbalanced dataset. These results are consistent with prior studies’ findings, such as ConvTransNet-S (with LPU + transformer), which showed large gains over standalone CNNs and ViTs in complex scenes [ 4]. Compared to large ensembles such as Plant-CNN-ViT [ 19], which achieved ~100% accuracy on lab leaf sets, our approach is more compact and explicitly tailored to noisy field images. The proposed HDL-PlantNet demonstrated strong effectiveness in field-based plant recognition across varied natural backgrounds, lighting conditions, intertwined plants, and visually similar species. The GradCAM++ [ 31, 32] analysis demonstrated that the proposed model focused on biologically discriminative regions of different plant species under varying visual conditions. In Figure 6a,b, the activation map highlights leaf and flower structures of the olive tree, indicating that the model relies on seasonal characteristic texture and patterns for classification. In Figure 6c,d, the model successfully identified the olive tree from trunk-related structural features, despite the absence of visible leaves or flowers. Similarly, Figure 6e,f shows that the model concentrated on rose flower regions, whereas Figure 6g,h demonstrates that the model can also utilize leaf morphology and vein structures for rose classification even when flowers are absent from the scene. However, it should be noted that images of lemons, oranges, and mandarins belonging to the same family were occasionally confused, with the minority classes more frequently misclassified as the lemon class, which had more training images. This finding suggests that increasing the number of samples for certain underrepresented classes would lead to a more balanced and practically meaningful real-world dataset. Figure 7 shows how the model misclassified an orange image as a lemon. Even though the Swedish Leaf Dataset is a relatively easy benchmark on which modern deep models can achieve high performance [ 33], it indicates the consistency of HDL-PlantNet across controlled-environment datasets. The obtained results on this dataset are consistent with prior studies, reporting up to 99.47% accuracy on Swedish leaves using a CNN with data standardization, and an ensemble of pre-trained CNNs has reached 100% on this dataset [ 33]. However, these datasets largely consisted of clean and isolated leaf images captured under favorable conditions. In contrast, the Cyprus Seasonal Flora Image Dataset presented a more realistic real-world scenario. Therefore, the strong results obtained by HDL-PlantNet suggest that modern hybrid architectures can substantially narrow the long-standing performance gap between laboratory-based and field-based plant image classification. Table 10 summarizes the plant and disease classification studies for different datasets. Although the proposed HDL-PlantNet model achieved promising results, the study has several limitations. First, the imbalanced class distribution of plant images in CSFID might create bias for dominant classes. Second, even though the proposed model was trained on two different datasets with distinct characteristics, an external validation is required to validate real-world deployment. Third, visually similar species from the same botanical family remained challenging to distinguish and required further investigation. Future work would address these issues by providing larger multi-region datasets and balanced sampling strategies. Additionally, the proposed HDL-PlantNet is a primarily task-oriented adaptation of existing CNN–Transformer concepts and does not propose a fundamentally new architectural approach. Its primary contribution is to provide measurable benefits for plant recognition under challenging domain-shift conditions. Finally, using a fixed random seed to ensure reproducibility might prevent the generalization of the results, and additional repetitions with multiple random seeds could provide a more comprehensive assessment. 6. Conclusions In this study, we proposed a hybrid CNN–Transformer architecture, Hybrid Deep Learning PlantNet (HDL-PlantNet), designed for fine-grained plant species classification. Two datasets are used in the study. A primary dataset, the Cyprus Seasonal Flora Image Dataset, is used to incorporate real-world distortions and challenges, while the Swedish Leaf Dataset is used to assess the proposed model’s consistency under controlled conditions. Comprehensive and comparative experiments were conducted, and the proposed HDL-PlantNet model consistently outperformed several state-of-the-art models, achieving 90.06% and 100% F1 scores on the Cyprus Seasonal Flora Image Dataset and the Swedish Leaf Dataset, respectively. These results moderately surpass the best-performing baseline (EfficientNetV2-S) and establish measurable performance for plant species classification under real-world field conditions by integrating an EfficientNetV2-S backbone with a Transformer encoder that combines local and global feature representations within a unified framework. The ablation study confirmed the role of the Transformer module, as its removal caused a performance decrease of up to 3%. This indicates that the selected hybrid configuration has the potential to provide more balanced performance compared to the backbone model. The results obtained in this study might improve plant species recognition by demonstrating that a CNN–Transformer-based hybrid architecture could achieve promising performance on a real-world, field-collected dataset. Overall, the proposed approach and released dataset enable real-time plant identification in real-life complex environments and support informed ecological decision-making for farmers, ecologists, and the general public. Our future work will include extending the species list in the Cyprus Seasonal Flora Image Dataset, evaluating the proposed model across field-domain datasets, and developing a mobile application for real-world use. Author Contributions Conceptualization, D.C.M., B.S. and N.C.; methodology, D.C.M.; software, D.C.M. and B.S.; validation, D.C.M., B.S. and N.C.; formal analysis, B.S. and N.C.; investigation, B.S.; resources, B.S.; data curation, B.S.; writing—original draft preparation, D.C.M., B.S. and N.C.; writing—review and editing, D.C.M., B.S. and N.C.; visualization, D.C.M., B.S. and N.C.; supervision, B.S. and N.C. All authors have read and agreed to the published version of the manuscript. Funding This research received no external funding. Institutional Review Board Statement Not applicable. Informed Consent Statement Not applicable. Data Availability Statement The Cyprus Seasonal Flora Image Dataset is available at https://data.mendeley.com/datasets/dfy8grjkss/1 (accessed on 24 January2026). Conflicts of Interest The authors declare no conflicts of interest. References Figure 1. Example train (first row) and test (second row) plant images of Cyprus Seasonal Flora Image Dataset, ( a) Aloe Vera, ( b) Kalanchoe, and ( c) Iris. Figure 1. Example train (first row) and test (second row) plant images of Cyprus Seasonal Flora Image Dataset, ( a) Aloe Vera, ( b) Kalanchoe, and ( c) Iris. Figure 2. Example train (first row) and test (second row) plant images of Cyprus Seasonal Flora Image Dataset, ( a) Rose, ( b) Rosemary, and ( c) Basil. Figure 2. Example train (first row) and test (second row) plant images of Cyprus Seasonal Flora Image Dataset, ( a) Rose, ( b) Rosemary, and ( c) Basil. Figure 3. Example leaf images of Swedish Leaf Dataset (SLD). Figure 3. Example leaf images of Swedish Leaf Dataset (SLD). Figure 4. Basic block diagram of the proposed HDL-PlantNet model. Figure 4. Basic block diagram of the proposed HDL-PlantNet model. Figure 5. Confusion matrices of best-performing models: ( a) The proposed HDL-PlantNet. ( b) Backbone EfficientNetV2 Small. Figure 5. Confusion matrices of best-performing models: ( a) The proposed HDL-PlantNet. ( b) Backbone EfficientNetV2 Small. Figure 6. GradCAM++ visualization results for HDL-PlantNet. ( a, c, e, g) Original test images. ( b, d, f, h) Corresponding GradCAM++ activation maps. The highlighted regions indicate the discriminative visual features used by the model during classification, including flowers, leaves, trunk structures, and local texture patterns under different seasonal and scene complexities. Figure 6. GradCAM++ visualization results for HDL-PlantNet. ( a, c, e, g) Original test images. ( b, d, f, h) Corresponding GradCAM++ activation maps. The highlighted regions indicate the discriminative visual features used by the model during classification, including flowers, leaves, trunk structures, and local texture patterns under different seasonal and scene complexities. Figure 7. GradCAM++ Visualizations, ( a) Orange image and ( b) GradCAM++ Visualization of misclassified image. Figure 7. GradCAM++ Visualizations, ( a) Orange image and ( b) GradCAM++ Visualization of misclassified image. Table 1. The Cyprus Seasonal Flora Image Dataset distribution according to plant species. Table 1. The Cyprus Seasonal Flora Image Dataset distribution according to plant species. Plant Species Class Train ( n) Test ( n) Total ( n) Aloe Vera P1 112 28 140 Arabian Jasmine P2 104 26 130 Basil P3 56 14 70 Cape marguerite P4 136 34 170 Crown Daisy P5 44 11 55 Cycas P6 154 39 193 Cypress P7 114 28 142 Fig P8 62 16 78 Geranium P9 180 45 225 Grapevine P10 62 15 77 Iris P11 127 32 159 Jasmine P12 42 11 53 Kalanchoe P13 69 17 86 Lemon P14 242 61 303 Loquat P15 202 50 252 Magnolia P16 133 33 166 Mandarin P17 62 16 78 Nerium oleander P18 75 19 94 Nettle P19 42 11 53 Night Blooming Jasmine P20 167 42 209 Olive P21 166 41 207 Orange P22 57 14 71 Pine P23 29 7 36 Polygala myrtifolia P24 45 11 56 Rose P25 365 91 456 Rosemary P26 80 20 100 Yellow Jasmine P27 145 36 181 Table 2. Layer-by-layer composition of the proposed HDL-PlantNet model. Table 2. Layer-by-layer composition of the proposed HDL-PlantNet model. Layer (Type) Output Shape Parameters Description Input Image 224 ୍ଠ 224 ୍ଠ 3 0 Input plant image (RGB) EfficientNetV2-S Backbone 7 ୍ଠ 7 ୍ଠ 1280 ≈22 million Pretrained CNN feature extractor (stack of Conv + Fused-MBConv/MBConv blocks) Transformer Encoder 7 ୍ଠ 7 ୍ଠ 1280 ≈9 million 4-head self-attention + feed-forward block over 7 ୍ଠ 7 spatial tokens (includes LayerNorm and dropout) Global Average Pooling 1 ୍ଠ 1 ୍ଠ 1280 0 Averages spatial features into a 1280-D global descriptor Flatten 1280 0 Flattens the 1280-D pooled feature for classification Fully Connected (Dense) K 1280 ୍ଠ K + K Output layer for K plant classes (single linear layer producing logits) Table 3. Hyperparameter optimization details for the models. Table 3. Hyperparameter optimization details for the models. Hyperparameter Range Final Value Learning Rate 1 ୍ଠ 10 − 3 , 1 ୍ଠ 10 − 4 , 1 ୍ଠ 10 − 5 1 ୍ଠ 10 − 4 Batch Size 4, 8, 16, 32, 64 16 Dropout 0.2, 0.3, 0.5 0.2 Epochs 5–25 10 Table 4. Macro-averaged comparative performances on Cyprus Seasonal Flora Image Dataset. Table 4. Macro-averaged comparative performances on Cyprus Seasonal Flora Image Dataset. Model Accuracy (%) Macro-Precision (%) Macro-Recall (%) Macro-F1 Score (%) Macro-AUC (%) VGG16 98.99 78.31 82.75 83.52 94.02 ResNet50 46.99 42.16 39.69 38.62 39.13 ConvNext-Tiny 99.00 85.60 84.63 84.16 83.30 EfficientNetV2-S 99.27 90.84 86.71 87.71 99.57 MaxViT 96.54 56.02 42.31 42.12 49.50 HDL-PlantNet 99.32 91.70 89.38 90.06 99.59 Bold values indicate the highest results. Table 5. Class-based accuracy (%) results of the proposed HDL-PlantNet and backbone EfficientNetV2 Small on Cyprus Seasonal Flora Image Dataset. Table 5. Class-based accuracy (%) results of the proposed HDL-PlantNet and backbone EfficientNetV2 Small on Cyprus Seasonal Flora Image Dataset. Plant Test Samples HDL-PlantNet EfficientNetV2 S Aloe Vera 28 96.43 96.42 Arabian Jasmine 26 69.23 80.76 Basil 14 92.86 78.57 Cape Marguerite 34 100 100 Crown Daisy 11 100 81.82 Cycas 39 100 100 Cypress 28 92.86 92.86 Fig 16 87.5 100 Geranium 45 93.33 95.55 Grapevine 15 73.33 100 Iris 32 100 100 Jasmine 11 100 63.64 Kalanchoe 17 94.12 82.35 Lemon 61 100 95.08 Loquat 50 80 88 Magnolia 33 84.85 90.90 Mandarin 16 56.25 50.00 Nerium oleander 19 84.21 63.16 Nettle 11 100 100 Night Blooming Jasmine 42 85.71 95.23 Olive 41 100 95.12 Orange 14 57.14 28.57 Pine 7 100 100 Polygala myrtifolia 11 90.91 90.91 Rose 91 92.31 92.31 Rosemary 20 85 80.00 Yellow Jasmine 36 97.22 100 Macro Averaged 768 (total) 89.38 86.71 Bold values indicate the highest results. Table 6. Detailed statistical results on Cyprus Seasonal Flora Image Dataset. Table 6. Detailed statistical results on Cyprus Seasonal Flora Image Dataset. Comparison McNemar p-Value p-Value Status Interpretation HDL-PlantNet vs. VGG16 0.0099 0.05 Slightly not significant HDL-PlantNet vs. MaxViT 1.3 ୍ଠ 10 − 5 <0.05 Significant Table 7. Detailed results on the Swedish Leaf Dataset. Table 7. Detailed results on the Swedish Leaf Dataset. Model Accuracy (%) Macro-Recall (%) Macro-Specificity (%) Macro-F1 Score (%) Macro-AUC (%) VGG16 95.56 95.56 99.68 95.16 99.43 ResNet50 100.00 100.00 100.00 100.00 100.00 ConvNext-Tiny 100.00 100.00 100.00 100.00 100.00 EfficientNetV2-S 99.11 99.11 99.94 99.12 100.00 MaxViT 99.82 98.66 99.90 98.67 100.00 HDL-PlantNet 100.00 100.00 100.00 100.00 100.00 Bold values indicate the highest results. Table 8. The detailed results of ablation study. Table 8. The detailed results of ablation study. Ablation Transformer Block Attention Head Accuracy (%) Macro-Recall (%) Macro-Precision (%) Macro-F1 Score (%) Macro-AUC (%) A0_NO_TX - - 99.27 86.71 90.84 87.71 99.57 A1_H2 1 2 97.79 84.12 94.11 89.01 98.39 A1_H8 1 8 94.40 75.56 75.55 61.26 93.51 A3_2B 2 4 + 4 99.48 63.64 99.07 77.78 98.84 HDL-PlantNet 1 4 99.32 89.38 91.70 90.06 99.59 Bold values indicate the highest results. Table 9. Comparison of the computational complexity and inference efficiency of the proposed HDL-PlantNet and benchmark architectures. Table 9. Comparison of the computational complexity and inference efficiency of the proposed HDL-PlantNet and benchmark architectures. Model Total Parameters Trainable Parameters FLOPs (GFLOPs) Latency (ms/Image) FPS ResNet50 23,563,355 23,563,355 8.178 5.00 199.93 MaxViT 30,421,475 30,421,475 11.12 22.29 44.84 VGG16 134,371,163 134,371,163 3.09 1.474 679.56 ConvNeXt Tiny 27,840,891 27,840,891 8.90 5.26 189.85 EfficientNetV2-S 20,181,331 20,181,331 5.69 11.75 85.08 HDL-PlantNet 29,369,427 9,191,939 6.48 14.86 67.29 Total parameter count reflects architectural complexity. Trainable parameter count depends on training strategy. Since HDL-PlantNet employed partial parameter freezing, trainable parameter counts should not be interpreted as a direct architectural comparison with fully trainable baseline models. Table 10. Plant and disease classification studies for different datasets. Table 10. Plant and disease classification studies for different datasets. Study Dataset Dataset Property Approach Accuracy Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. © 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Share and Cite MDPI and ACS Style Malann, D.C.; Cavus, N.; Sekeroglu, B. A CNN and Transformer-Based Framework for Fine-Grained Plant Species Classification in Real-World Environments. Appl. Sci. 2026, 16, 5810. https://doi.org/10.3390/app16125810 AMA Style Malann DC, Cavus N, Sekeroglu B. A CNN and Transformer-Based Framework for Fine-Grained Plant Species Classification in Real-World Environments. Applied Sciences. 2026; 16(12):5810. https://doi.org/10.3390/app16125810 Chicago/Turabian Style Malann, Daniel Chwaifo, Nadire Cavus, and Boran Sekeroglu. 2026. "A CNN and Transformer-Based Framework for Fine-Grained Plant Species Classification in Real-World Environments" Applied Sciences 16, no. 12: 5810. https://doi.org/10.3390/app16125810 APA Style Malann, D. C., Cavus, N., & Sekeroglu, B. (2026). A CNN and Transformer-Based Framework for Fine-Grained Plant Species Classification in Real-World Environments. Applied Sciences, 16(12), 5810. https://doi.org/10.3390/app16125810 Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here. Article Metrics Article metric data becomes available approximately 24 hours after publication online.

www.mdpi.com

Zum Originalartikel