Objective Against the backdrop of deep integration between remote sensing and computer vision, tree detection in complex scenes has long been hindered by occlusion, background noise, and high-density canopies. Traditional methods struggle to balance global semantics and local details, resulting in insufficient detection accuracy and robustness. To overcome this bottleneck, this study proposes an end-to-end tree detection framework (ViT-Count) centered on the Vision Transformer (ViT), aiming to leverage its global context modeling capability to significantly enhance the accuracy of tree counting and localization in dense forest and urban environments.
Method Firstly, the Vision Transformer (ViT) was employed as the backbone model due to its inherent advantage in capturing global contextual information, making it particularly well-suited for complex environments with diverse visual patterns. Secondly, a Visual Prompt Tuning (VPT) strategy tailored for tree detection was introduced. By injecting learnable prompts into the feature space, this method enhanced the model’s capability to extract discriminative features under conditions such as high canopy density, illumination variation, and structural differences across tree species, thereby improving its adaptability to various forest stand types. Thirdly, the convolutional block attention mechanisms were designed, leveraging its ability to model long-range dependencies based on local perception. This effectively strengthened the model’s capacity to distinguish between occluded, overlapping, and morphologically similar tree targets, thereby enhancing the overall robustness and accuracy of detection. Finally, a tree detection decoder was constructed, which progressively restored spatial resolution through stacked convolutional layers, normalization, GELU activation, and upsampling operations, ultimately generating a tree density map for tree counting and localization.
Result Comparative experiments on the public Larch Casebearer and Urban Tree datasets showed that, compared with mainstream models, the proposed method achieves reductions up to 2.53 in MAE and 3.99 in RMSE, demonstrating stronger generalization capability. Visualization results indicate that the model maintains high accuracy in both dense forests and complex urban scenes. Ablation studies further confirm the effective contributions of VPT, the attention module, and the decoder components to the overall performance.
Conclusion This Vision Transformer-based method for tree counting and localization in complex scenes fully exploits the global modeling capability of ViT and the task adaptability of the visual prompt tuning mechanism. By integrating the attention mechanism of convolutional modules, it effectively enhances the accuracy and robustness of tree counting and localization in complex environments.