Abstract:
Cherry tomatoes are a small variety of tomatoes with a shape size of not large than 2.5 cm and mostly grow in bunches. Furthermore, the bunches of cherry tomatoes also grow in variable postures. These growth conditions have posed a great challenge to the harvesting robot at a fixed angle. Once the robots automatically perform single-fruit harvesting operations, the stems can be found to usually interfere with the end-effectors, resulting in low picking efficiency. The reason may be that the picking robots cannot move towards commercialization. Particularly, not all fruits in a tomato bunch grow and ripen simultaneously. It is very necessary to pick the ripe fruits on time, in order to ensure a fresh taste with high economic profits. Therefore, a robotic vision system is highly required to rapidly and accurately identify fruit ripeness. In this study, a cascaded vision detection approach was proposed to harvest the single tomatoes from the robotic spikes. The processing procedure included three key aspects: the detection of the harvesting target, the determination of target maturity, and the fruit-stalk position relationship. Firstly, the YOLOv5 model of target detection was introduced to detect the tomato fruits and bunches. The tomato fruits were labelled into four categories using agronomic growing and harvesting requirements, including green, turning, ripe, and fully ripe fruit. It was totally difference from the simply classified ripeness than before. Among them, the ripe, and fully ripe fruit were targeted for robotic harvesting. The overlap of visual features was then fully considered for ripeness determination and target detection. The original YOLOv5 was improved for ripeness detection using multi-task learning. The robot was confined to only picking the tomatoes on both sides of the culture rack, due mainly to the structure of the greenhouse facility. The target detection was then filtered out for the targets beyond the execution range of the robot. The distance was also set as 1.55 m between the culture racks in this case. The region of interest (ROI) of the target fruit was then approximated as an ellipsoid with an equatorial diameter and a polar diameter of approximately 2.5 cm. The pinhole camera model was used to calculate the ROI picking range. Specifically, the tomatoes growing on the incubator outside the working range of the robot were mostly smaller than the 10 pixel×10 pixel region in the 640 pixel×640 pixel RGB image. At the same time, a large number of feature layers were cropped to choose the unlabeled targets in the annotation stage. As such, better performance was achieved to reduce the labor cost, particularly when filtering the targets without being captured. This end-to-end approach was required without post-processing. It was much more adaptable to real scenarios, compared with the traditional approach of filtering targets by the threshold setting. The field experiments show that the fruit stalk interfering with the end-effector was a major cause of robot picking failure or low efficiency. Correspondingly, the optimal angle was one of the most important parameters for the harvesting action. After the screening of targets to be picked, the target rectangle detection box was enlarged by 10% in length and width, in order to contain the peripheral information, such as pedicels and calyces. The expanded image block was then input into the Mobilenetv3 network model, in order to evaluate the relative position relationship between the target fruit and the fruit stalk. As such, the input was provided for the end-effector to change the picking position, and then choose the direction favorable for the fruit picking, in order to approach the fruit and then perform the action using the pose of the string. A harvesting robot system was also built consisting of a depth camera, a four-degree-of-freedom robot arm, a chassis, and a negative-pressure end-effector. The harvesting system was tested in the greenhouses at different times of the year, particularly for object detection, the prediction of the position relationship between fruit stalks, and fruit harvesting. The results showed that the average detection accuracy of cherry tomato bunches and fruits with different ripeness reached 89.9% with the Intersection over the union threshold of 0.5. The average inference time was 22 ms in the cascade detection system. Furthermore, the harvesting efficiency was improved by 28.7 percentage points, compared with targeting to be picked at a fixed angle. The average time was 10.4 s per fruit for harvesting fruits, indicating the better performance of the improved system. This finding can also provide a strong reference for fruit and vegetable harvesting robots.