Detecting the key points of tractor drivers under complex environments using improved YOLO-Pose
-
Graphical Abstract
-
Abstract
Key point leakage and misdetection have posed a great challenge on the recognition of tractor driver, due to the light, background, and occlusion in the complex operating environment of farmland. In this study, a joint driver-key point detection was proposed using improved YOLO-Pose. Firstly, Swin Transformer encoder was introduced in the top layer C3 module of the backbone network CSPDarkNet53. Among them, the encoder window size was set as 8, and the number of self-attention computation heads was 16. Swin Transformer encoder was used the self-attention of shifted windows (SW-MSA) computation to learn the cross-window interactions. The masking mechanism was utilized to isolate the invalid information exchange between pixels in non-adjacent regions in the original feature map. The better performance was achieved in the dense prediction and high-resolution vision, compared with the traditional ViT architecture. The improved model was obtained to effectively capture the global dependencies with the high computational efficiency. The global modelling capability was then improved the detection efficiency of key point under the occlusion condition. Secondly, RepGFPN, an efficient layer aggregation network with hopping structure and cross-scale connectivity, was adopted as the neck network, where the P6 detection layer was additionally added into the multi-scale output of the backbone network. CspStage module was adopted with the reparameterized ideas and layer aggregation connectivity to fuse the high-level semantic information and the low-layer spatial information, in order to enhance the model multi-scale detection. Thirdly, the pyramid convolution was introduced with 4-layer pyramid structure to replace the standard 3×3 convolution, in order to further optimize the neck network. The bottom-up layer-by-layer increasing convolution kernel was utilized to adaptively adjust the receptive field in the pyramid convolution. The number of model parameters was reduced to effectively capture the feature information of different layers. Finally, the decoupling head of key point was optimized to embed the coordinate attention mechanism, and then encode the horizontal and vertical position information into the channel attention. The network was obtained to acquire the cross-channel information, and then capture the direction-aware and position-sensitive information. The better capture performance was also achieved in the positional relationship between key points in the prediction process, indicating the high detection accuracy of the key points in the complex environments. The experimental results show that the improved model shared a mean average precision (mean average precision, mAP0.5) of 89.59%, when the Loks (object keypoint similarity) threshold was taken as 0.5, and a mAP of 0.5:0.95 (Loks thresholds were taken as 0.5, 0.55,..., 0.95, when the mean average precision) was 62.58%, which was 4.24 and 4.15 percentage points higher than the baseline model, respectively, and the average detection time of a single image was 21.9 ms. Furthermore, the mAP0.5 was improved by 7.94, 5.27, and 2.66 percentage points, and the model size was reduced by 257.5, 8.2, and 9.3 M, respectively, compared with the current mainstream networks of key point detection, such as Hourglass, HRNet-W32, and DEKR. The improved detection of key point presented the high detection accuracy and inference speed in complex scenes, especially in the case of the driver's presence of self-obscuring and other-object-obscuring. The finding can provide a strong theoretical basis for the driver behavior recognition and state monitoring in farmland operation environment.
-
-