基于强化学习的机器人底盘能量管理与路径规划优化算法

    Reinforcement learning-based optimization algorithm for energy management and path planning of robot chassis

    • 摘要: 为解决温室机器人底盘传统路径规划中因忽略地面粗糙度而导致的电池寿命缩短与利用效率低下的问题,该研究探讨了3种融合电池能量管理与路径规划的强化学习算法。首先,基于先验知识构建分级预打分奖励模型,并通过增加曼哈顿距离构建奖励函数,提高电池寿命和利用率;其次,针对传统Q-Learning(QL)算法收敛效率低、易陷入局部最优等问题,提出了自适应变步长的优化算法(adaptive multi-step q-learning,AMQL)和基于自适应改变探索率的优化算法(adaptive ε-greedy q-learning,AEQL),以提升Q-Learning算法的性能。此外,为进一步提高算法的可行性,该文将AMQL算法和AEQL算法进行融合,提出了一种自适应多步长和变ε-greedy融合算法(adaptive multi-step and ε-greedy q-learning,AMEQL),并通过仿真对比的方式,验证了AMQL和AMEQL算法相对于传统QL算法在3个不同垄道下的性能。仿真试验结果表明:AMQL相对于传统QL算法,训练平均时间降低23.74%,收敛平均迭代次数降低8.82%,路径平均拐点数降低54.29%,收敛后的平均波动次数降低14.54%;AMEQL相对于传统QL算法,训练平均时间降低34.46%,收敛平均迭代次数降低18.02%,路径平均拐点数降低63.13%,收敛后的平均波动次数减少15.62%,在400次迭代过程中,AMEQL到达最大奖励后平均每7.12次迭代波动1次,而AMQL平均每6.68次迭代波动1次。可知AMEQL训练时间最短,收敛最快,路径拐点数量最低,奖励波动最小,而AMQL次之。该算法可为机器人底盘自主路径规划提供理论参考。

       

      Abstract: Ground roughness can significantly impact the battery performance in greenhouse environments. In this study, battery energy management was integrated with path planning to address this challenge. A systematic investigation was also implemented to explore the effects of ground roughness on the battery life and utilization efficiency of greenhouse vehicle platforms. A graded pre-scoring model was constructed using prior knowledge. Additionally, the Manhattan distance between the vehicle's current position and the target point was incorporated into the reinforcement learning reward function, thus linking travel distance with battery life to optimize both battery utilization efficiency and life during path planning. An Adaptive Multi-step Q-learning algorithm (AMQL) with adaptive step sizes and an Adaptive ε-greedy Q-learning algorithm (AEQL) with an adaptive exploration rate was proposed to enhance the performance of the Q-learning algorithm. The traditional Q-learning algorithms were associated with some issues, such as long iteration times, low convergence efficiency, susceptibility to local optima, and excessive path turns. The AMQL algorithm was used to adjust the step size, according to the forward reward assessment, if the reward at the current position increased corresponding to the previous reward, the step size increased. The step size gradually decreased to prevent suboptimal path optimization, as the current position approached the endpoint. The AEQL algorithm was used to adaptively adjust the exploration rate ε using the difference between adjacent reward values—ε increased when the adjacent reward value increased, and ε decreased when the reward value decreased. Although AMQL improved the convergence efficiency and iteration speed, the variations in the step size caused significant fluctuations in rewards, resulting in lower algorithm stability. Additionally, there was no outstanding impact of multi-step length on the convergence efficiency and iteration speed. Furthermore, the AEQL enhanced the exploration efficiency and algorithm stability through dynamic adjustments. But its fluctuating rise during the initial training phase also increased the training time. Therefore, the AMQL and AEQL algorithms were combined to develop an Adaptive Multi-step and ε-greedy Q-learning algorithm (AMEQL), in order to ensure faster and more optimal global path selection during path planning. In a simulated environment, the models were first used to simulate a realistic greenhouse tomato scenario. Then, an Inertial Measurement Unit (IMU) was used to record the changes in the aisle roughness in real time. This data was then incorporated into the simulation model. Finally, 300 rounds of simulation experiments were carried out to test the traditional Q-learning, AMQL, and AMEQL algorithm for path planning in the single-row (30 m×20 m), double-row (50 m×50 m), and triple-row (70 m×50 m) environments. Simulation results show that the AMEQL algorithm reduced the average training time by 44.10%, the average number of iterations required for convergence by 11.06%, the number of path turns by 63.13%, and the post-convergence average fluctuation by 15.62%, compared with the traditional Q-learning. Due to its higher convergence speed in 400 iterations, the AMEQL algorithm averaged 14 fluctuations per 100 iterations after reaching the maximum reward, while the AMQL algorithm averaged 15 fluctuations. This algorithm can provide a theoretical reference for the autonomous path planning of greenhouse platforms.

       

    /

    返回文章
    返回