高级检索

    ViT-Count:面向冠层遮挡的Vision Transformer树木计数定位方法

    ViT-Count: a Vision Transformer-based tree counting and localization method for canopy-occluded scenarios

    • 摘要:
      目的 针对复杂场景中树木检测的挑战,如遮挡、背景干扰及密集分布等,本研究提出一种基于Vision Transformer(ViT)的树木检测方法(ViT-Count),提升模型对复杂场景中树木的检测精度与鲁棒性。
      方法 采用ViT作为基础模型,其在捕捉图像中全局上下文信息方面具有天然优势,尤其适用于形态多变的复杂环境。设计针对树木的视觉提示调优VPT机制,其通过在特征中注入可学习提示(prompts),优化模型在林地高密度树冠、光照变化及不同树种结构下的特征提取能力,提高对不同林分类型的适应性。设计卷积模块的注意力机制模块,利用其在局部感知基础上的长距离依赖建模能力,有效强化模型对树木遮挡、重叠及形态相似目标的辨别能力,提高整体检测的鲁棒性与准确性。设计一个树木检测解码器,通过多层卷积、归一化、GELU激活与上采样操作逐步还原空间分辨率,以生成的目标密度图实现树木计数与定位。
      结果 该方法在提升森林、城市场景下的树木检测鲁棒性的同时,增强了模型在多尺度树木目标上的泛化能力。在Larch Casebearer数据集和Urban Tree数据集上进行的实验显示,与其他主流模型相比,该方法的MAE和RMSE最多分别降低了2.53、3.99,表明其泛化能力更强,具有最优的树木检测性能。可视化实验结果表明,在密集森林场景和复杂城市场景中,所提模型均具有较高的树木检测准确率。消融实验的结果证明了模型主要模块的有效性。
      结论 基于Vision Transformer的面向复杂场景的树木计数与定位方法能够充分发挥ViT的全局建模能力及视觉提示调优机制任务适应性,结合卷积模块的注意力机制,有效提升复杂场景树木计数与定位的精度与鲁棒性。

       

      Abstract:
      Objective Against the backdrop of deep integration between remote sensing and computer vision, tree detection in complex scenes has long been hindered by occlusion, background noise, and high-density canopies. Traditional methods struggle to balance global semantics and local details, resulting in insufficient detection accuracy and robustness. To overcome this bottleneck, this study proposes an end-to-end tree detection framework (ViT-Count) centered on the Vision Transformer (ViT), aiming to leverage its global context modeling capability to significantly enhance the accuracy of tree counting and localization in dense forest and urban environments.
      Method Firstly, the Vision Transformer (ViT) was employed as the backbone model due to its inherent advantage in capturing global contextual information, making it particularly well-suited for complex environments with diverse visual patterns. Secondly, a Visual Prompt Tuning (VPT) strategy tailored for tree detection was introduced. By injecting learnable prompts into the feature space, this method enhanced the model’s capability to extract discriminative features under conditions such as high canopy density, illumination variation, and structural differences across tree species, thereby improving its adaptability to various forest stand types. Thirdly, the convolutional block attention mechanisms were designed, leveraging its ability to model long-range dependencies based on local perception. This effectively strengthened the model’s capacity to distinguish between occluded, overlapping, and morphologically similar tree targets, thereby enhancing the overall robustness and accuracy of detection. Finally, a tree detection decoder was constructed, which progressively restored spatial resolution through stacked convolutional layers, normalization, GELU activation, and upsampling operations, ultimately generating a tree density map for tree counting and localization.
      Result Comparative experiments on the public Larch Casebearer and Urban Tree datasets showed that, compared with mainstream models, the proposed method achieves reductions up to 2.53 in MAE and 3.99 in RMSE, demonstrating stronger generalization capability. Visualization results indicate that the model maintains high accuracy in both dense forests and complex urban scenes. Ablation studies further confirm the effective contributions of VPT, the attention module, and the decoder components to the overall performance.
      Conclusion This Vision Transformer-based method for tree counting and localization in complex scenes fully exploits the global modeling capability of ViT and the task adaptability of the visual prompt tuning mechanism. By integrating the attention mechanism of convolutional modules, it effectively enhances the accuracy and robustness of tree counting and localization in complex environments.

       

    /

    返回文章
    返回