多机器人系统强化学习研究综述
doi: 10.3969/j.issn.0258-2724.2014.06.015
A Review of Developments in Reinforcement Learning for Multi-robot Systems
-
摘要: 强化学习是实现多机器人对复杂和不确定环境良好适应性的有效手段,是设计智能系统的核心技术之一.从强化学习的基本思想与理论框架出发,针对局部可观测性、计算复杂度和收敛性等方面的固有难题,围绕学习中的通信、策略协商、信度分配和可解释性等要点,总结了多机器人强化学习的研究进展和存在的问题;介绍了强化学习在机器人路径规划与避障、无人机、机器人足球和多机器人追逃问题中的应用;最后指出了定性强化学习、分形强化学习、信息融合的强化学习等若干多机器人强化学习的前沿方向和发展趋势.Abstract: Reinforcement learning (RL) is an effective mean for multi-robot systems to adapt to complex and uncertain environments. It is considered as one of the key technologies in designing intelligent systems. Based on the basic ideas and theoretical framework of reinforcement learning, main challenges such as partial observation, computational complexity and convergence were focused. The state of the art and difficulties were summarized in terms of communication issues, cooperative learning, credit assignment and interpretability. Applications in path planning and obstacle avoidance, unmanned aerial vehicles, robot football, the multi-robot pursuit-evasion problem, etc., were introduced. Finally, the frontier technologies such as qualitative RL, fractal RL and information fusion RL, were discussed to track its future development.
-
MURRAY R M,ASTROM K M, BODY S P, et al. Future directions in control in an information-rich world[J]. IEEE Control Systems Magazine, 2003, 23(2): 20-23. 陈学松,杨宜民. 强化学习研究综述[J]. 计算机应用研究,2010,27(8): 2834-2844. CHEN Xuesong, YANG Yimin. Reinforcement learning: survey of recent work[J]. Application Research of Computers, 2010, 27(8): 2834-2844. WIERING M, OTTERLO M V. Reinforcement learning state-of-the-art[M]. Berlin: Springer-Verlag, 2012: 3-42. SUTTON R S. Learning to predict by the methods of temporal differences[J]. Machine Learning, 1988, 3(1): 9-44. CHEN Xingguo, GAO Yang, WANG Ruili. Online selective kernel-based temporal difference learning[J]. IEEE Transactions on Neural Networks and Learning Systems, 2013, 24(12): 1944-1956. ZOU Bin, ZHANG Hai, XU Zongben. Learning from uniformly ergodic Markov chains[J]. Journal of Complexity, 2009, 25(2): 188-200. YU Huizhen, BERTSEKAS D P. Convergence results for some temporal difference methods based on least squares[J]. IEEE Transactions on Automatic Control, 2009, 54(7): 1515-1531. WATKINS C, DAYAN P. Q-learning[J]. Machine Learning, 1992, 8(3): 279-292. 沈晶,程晓北,刘海波,等. 动态环境中的分层强化学习[J]. 控制理论与应用,2008,25(1): 71-74. SHEN Jing, CHENG Xiaobei, LIU Haibo, et al. Hierarchical reinforcement learning in dynamic environment[J]. Control Theory & Applications, 2008, 25(1): 71-74 王雪松,田西兰,程玉虎. 基于协同最小二乘支持向量机的Q学习[J]. 自动化学报,2009,35(2): 214-219. WANG Xuesong, TIAN Xilan, CHENG Yuhu. Q-learning system based on cooperative least squares support vector machine[J]. Acta Automatica Sinica, 2009, 35(2): 214-219. 朱美强,李明,程玉虎,等. 基于拉普拉斯特征映射的启发式Q学习[J]. 控制与决策,2014,29(3): 425-430. ZHU Meiqiang, LI Ming, CHENG Yuhu, et al. Heuristically accelerated Q-learning algorithm based on Laplacian eigenmap[J].Control and Decision, 2014, 29(3): 425-430. CHEN Chunlin, DONG Daoyi, LI Hanxiong. Fidelity-based probabilistic Q-learning for control of quantum systems[J]. IEEE Transactions on Neural Networks and Learning Systems, 2014, 25(5): 920-933. RUMMERY G, NIRANJAN M. On-line Q-learning using connectionist systems[]. Cambridge: University of Cambridge, 1994. BARTO A G, SUTTON R S, ANDERSON C W. Neuronlike adaptive elements that can solve difficult learning control problems[J]. IEEE Transactions on Systems, Man and Cybernetics, 1983, 13(5): 834-846. LIN C T, LEE C S G. Reinforcement structure/parameter learning for neural-network based fuzzy logic control system[J]. IEEE Transactions on Fuzzy System, 2008, 2(1): 46-63. BERNSTEIN D S, GIVAN R. The complexity of decentralized control of Markov decision processes[J]. Mathematics of operations Research, 2002, 27(4): 819-840. CRITES R H, BARTO A G. Elevator group control using multiple reinforcement learning agents[J]. Machine Learning, 1998, 33(2/3): 235-262. KIM G H, LEE C S G. Genetic reinforcement learning approach to the heterogeneous machine scheduling problem[J]. IEEE Transactions on Robotics and Automation, 1998, 14(6): 879-893. 刘全,杨旭东,荆玲,等. 基于多Agent并行采样和学习经验复用的E~3算法[J]. 吉林大学学报:工学版,2013,43(1): 135-140. LIU Quan, YANG Xudong, JING Ling, et al. Optimal control of a class of nonlinear dynamic systems based onreinforcement learning[J]. Journal of Jilin University: Engineering and Technology Edition, 2013, 43(1): 135-140. KAR S, MOURA J M F, POOR H V. QD-Learning: a collaborative distributed strategy for multi-agent reinforcement learning through consensus plus innovations[J]. IEEE Transactions on Signal Processing, 2013, 61(7): 1848-1862. 吴军,徐昕,连传强,等. 采用核增强学习方法的多机器人编队控制[J].机器人,2011, 33(3): 379-384. WU Jun, XU Xin, LIAN Chuanqiang, et al. Multi-robot formation control with kernel-based reinforcement learning[J]. Robot, 2011, 33(3): 379-384. FANG M, GROEN F C A, LI H. Collaborative multi-agent reinforcement learning based on a novel coordination tree frame with dynamic partition[J]. Engineering Applications of Artificial Intelligence, 2014, 27: 191-198. BOWLING M, Multi agent learning in the presence of agents with limitations[]. Pittsburgh: Carnegie Mellon University, 2003. CAI Yifan, YANG Simon X, XU Xin. A hierarchical reinforcement learning-based approach to multi-robot cooperation for target searching in unknown environments[J]. Control and Intelligent Systems, 2013, 41(4): 218-30. DICKENS L, BRODA K, RUSSO A. The dynamics of multi-agent reinforcement learning[C]//In 19th European Conference on Artificial Intelligence (ECAI). Lisbon: Univ Lisbon, FacSci, 2010: 367-372. 徐昕,沈栋,高岩青,等. 基于马氏决策过程模型的动态系统学习控制:研究前沿与展望[J]. 自动化学报,2012,38(5): 673-687. XU Xin, SHEN Dong, GAO Yanqing, et al. Learning control of dynamical systems based on Markov decision processes: research frontiers and outlooks[J]. Acta Automatica Sinica, 2012, 38(5): 673-687. 沈晶,刘海波,张汝波,等. 基于半马尔可夫对策的多机器人分层强化学习[J]. 山东大学学报:工学版,2010,40(4): 1-7. SHEN Jing, LIU Haibo, ZHANG Rubo, et al. Multi-robot hierarchical reinforcement learning based on semi-Markov games[J]. Journal of Shandong University: Engineering Science, 2010, 40(4): 1-7. GHAVAMZADEH M, MAHADEVAN S. Hierarchical average reward reinforcement learning[J]. Journal of Machine Learning Research, 2007, 8: 2629-2669. ZUO Lei, XU Xin, LIU Chunming. A hierarchical reinforcement learning approach for optimal path tracking of wheeled mobile robots[J]. Neural Compu- ting & Applications, 2013, 23(7/8):1873-1883. KAWANO H. Hierarchical sub-task decomposition for reinforcement learning of multi-robot delivery mission[C]//In IEEE International Conference on Robotics and Automation (ICRA). Karlsruhe: IEEE, 2013: 828-835 SUNXueqing, MAO Tao, LAURA R. Hierarchical stateabstracted and socially augmented Q-Learning for reducing complexity in agent-based learning[J]. Journal of Control Theory and Applications, 2011, 9(3): 440-50. SINGH S P, JAAKOLA T, JORDAN M I. Reinforcement learning with soft state aggregation[M]. Cambridge: MIT Press, 1995: 361-368. MORIARTY D, SCHULTZ A, GREFENSTETTE J. Evolutionary algorithms for reinforcement learning[J]. Journal of Artificial Intelligence Research, 1999, 11(1): 241-276. BERTSEKAS D P, TSITSIKLIS J N. Neuro-dynamic programming[M]. Belmont: Athena Scientific, 1996: 107-109. PALLOTTINO L, VINCENZO G S, BICCHI A. Decentralized cooperative policy for conflict resolution in multivehicle systems[J]. IEEE Transactions on Robotics, 2007, 23(6): 1170-1183. ZHANG Wenxu, CHEN Xiaolong, MA Lei. Online planning for multi-agent systems with consensus protocol[C]//In 33nd Chinese Control Conference. Nanjing : IEEE, 2014: 1126-1131. 陈学松,刘富春. 一类非线性动态系统基于强化学习的最优控制[J]. 控制与决策,2013, 28(12): 1889-1893. CHEN Xuesong,LIU Fuchun. Optimal control of a class of nonlinear dynamic systems based onreinforcement learning[J]. Control and Decision, 2013, 28(12): 1889-1893. XIE Mengchun. Representation of the perceived environment and acquisition of behavior rule for multi-agent systems by Q-learning[C]//In 4th International Conference on Autonomous Robots and Agents. Wellington: Inst of Elec and Elec Eng, 2009:453-457. REINALDO A C B, MARTINS M F, RIBEIRO C H C. Heuristically accelerated multi agent reinforcement learning[J]. IEEE Transactions on Cybernetics, 2014, 44(2): 252-265. K-LIN L J. Self-improving reactive agents based on reinforcement learning, planning and teaching[J]. Machine Learning, 1992, 8(3/4): 293-321. MODARES H, LEWIS F L, NAGHIBI-SISTANI M B. Integral reinforcement learning and experience replay for adaptive optimal control of partially-unknown constrained-input continuous-time systems[J]. Automatica, 2014, 50(1): 193-202. 蔡自兴,任孝平,邹磊. 分布式多机器人通信仿真系统[J]. 智能系统学报,2009,4(4): 309-313. CAI Zixing, REN Xiaoping, ZOU Lei. A simulated communications system for distributed multi robots[J]. CAAI Transactions on Intelligent Systems, 2009, 4(4): 309-313. MARAVALL D, JAVIER D L, DOMINGUEZ R. Coordination of communication in robot teams by reinforcement learning[J]. Robotics and Autonomous Systems, 2013, 61(7): 661-666. 马磊,史习智. 多智能体系统中一致性卡尔曼滤波的研究进展[J]. 西南交通大学学报,2011,46(2): 287-293. MA Lei,SHI Xizhi. Recent development on consensus-based Kalman filtering in multi-agent systems[J]. Journal of Southwest Jiaotong University, 2011, 46(2): 287-293. YILIN M, BRUNO S. Communication complexity and energy efficient consensus algorithm[C]//2nd IFAC Workshop on Distributed Estimation and Control in Networked Systems. Annecy: The IFAC Secretariat, 2010: 209-214 KOWALSKI D R, GILBERT S. Distributed agreement with optimal communication complexity[C]//Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms. Austin: Association for Computing Machinery, 2010: 965-977. DWORMAN G, KIMBROUGH S O, LAING J D. Bargaining by artificial agents in two coalition games: a study in genetic programming for electronic commerce[C]//Proceedings of Genetic Programming 1996 Conference. Stanford: MIT Press, 1996: 54-62. NUNES L, OLIVEIRA E. Cooperative learning using advice exchange[J]. Adaptive Agents and Multi-Agent Systems, 2003, 2636: 33-48. 柯文德,洪炳镕,崔刚,等. 一种基于π-MaxQ学习的多机器人协作方法[J]. 智能计算机与应用,2013,3(3): 13-17. KE Wende, HONG Bingrong, CUI Gang, et al. A cooperative method for multi robots based on π-MaxQ[J]. Intelligent Computer and Applications, 2013, 3(3): 13-17. JOB J, JOVIC F, LIVADA C. Q-learning by the nth step state and multi-agent negotiation in unknown environment[J]. Tehnicki Vjesnik-Technical Gazette, 2012, 19(3): 529-534. HOLLANDJ H. Properties of the bucket brigade[C]//Proceedings of the 1st International Conference on Genetic Algorithms. Berlin: Springer-Verlag, 1985: 1-7. CHAPMAN K L, BAY J S. Task decomposition and dynamic policy merging in the distributed Q-learning classifier system[C]//Proceedings of IEEE International Symposium on Computational Intelligence in Robotics and Automation, CIRA. Monterey: IEEE Computer Society, 1997: 166-171. ONO N, IKEDA O, FUKUMOTO K. Acquisition of coordinated behavior by modular Q-learning agents[J]. IEEE International Conference on Intelligent Robots and Systems, 1996, 3: 1525-1529. BAY J S, STANHOPE J D. Distributed optimization of tactical actions by mobile intelligent agents[J]. Journal of Robotic Systems, 1997, 14(4): 313-323. RAEISY B, HAGHIGHI S G, SAFAVI A A. Active noise control system via multi-agent credit assignment[J]. Journal of Intelligent & Fuzzy Systems, 2014, 26(2): 1051-1063. ZAHRA R, HAMID B. Addition of learning to critic agent as a solution to the multi-agent credit assignment problem[C]//5th International Conference on Soft Computing, Computing with Words and Perceptions in System Analysis, Decision and Control. Famagusta: IEEE Computer Society, 2009: 219-222. NG A Y, RUSSELL S J. Algorithms for inverse reinforcement learning[C]//Proceedings of the 17th International Conference on Machine Learning. San Francisco: Morgan Kaufmann, 2000: 663-670. RAMACHANDRAN D, AMIR E. Bayesian inversere-inforcement learning[C]//Proceedings of the 20th International Joint Conference on Artificial Intelligence. Hyderabad: IJCAI, 2007: 2586-2591. MICHIN B, JONATHAN P H. Improving the efficiency of bayesian inverse reinforcement learning[C]//IEEE International Conference on Robotics and Automation (ICRA). St Paul: IEEE, 2012: 3651-3656. 张长水. 机器学习面临的挑战[J]. 中国科学:信息科学,2013, 43(12): 1612-1623. ZHANG Changshu. Challenges in machine learning[J]. ScientiaSinica: Informationis, 2013, 43(12): 1612-1623. 席裕庚,张纯刚. 一类动态不确定环境下机器人滚动路径规划[J]. 自动化学报,2002, 25(2): 261-175. XI Yugeng, ZHANG Chungang. Rolling path planning of mobile robot in a kind of dynamic uncertain environment[J]. Acta Automatica Sinica, 2002, 25(2): 261-175. MARTINEZ-GIL F, LOZANO M, FERNANDEZ F. Multi-agent reinforcement learning for simulating pedestrian navigation[C]//In Adaptive and Learning Agents, International Workshop. Berlin: Springer-Verlag, 2011, 7113: 54-69. ROKHLO M Z, ALI S, HASHIM M, et al. Multi-agent reinforcement learning for route guidance system[J]. International Journal of Advancements in Computing Technology, 2011, 3(6): 224-232. WANG Zeying, SHI Zhiguo, LI Yuankai. The optimization of path planning for multi-robot system using Boltzmann Policy based Q-learning algorithm[C]//In IEEE International Conference on Robotics and Biomimetics(ROBIO). Shenzhen: IEEE, 2013: 1199-204. MORTAZA Z A, ALI S, HASHIM S Z. Mohd, modeling of route planning system based on Q value-based dynamic programming with multi-agent reinforcement learning algorithms[J]. Engineering Applications of Artificial Intelligence, 2014(29): 163-177. GERAMIFARD A, REDDING J, HOW J P. Intelligent cooperative control architecture: a framework for performance improvement using safe learning[J]. Journal of Intelligent & Robotic Systems, 2013, 72(1): 83-103. 王冲,景宁,李军,等. 一种基于多Agent强化学习的多星协同任务规划算法[J]. 国防科技大学学报,2011,33(1): 53-58. WANG Chong, JING Ning, LI Jun, et al. An algorithm of cooperative multiple satellites mission planning based on multi-agent reinforcement learning[J]. Journal of National University of Defense Technology, 2011, 33(1): 53-58. DAHL T S, MATARIC M, SUKHATME G S. Multi-robot task allocation through vacancy chain scheduling[J]. Robotics and Autonomous Systems, 2009, 57(6/7): 674-687. SU Zhaopin, JIANG Jianguo, LIANG Changyong, et al. A distributed algorithm for parallel multi-task allocation based on profit sharing learning[J]. Acta Automatica Sinica, 2011, 37(7): 865-872. 段勇,崔宝侠,徐心和. 多智能体强化学习及其在足球机器人角色分配中的应用[J]. 控制理论与应用,2009,26(4): 371-376. DUAN Yong, CUI Baoxia, XU Xinhe. Multi-agent reinforcement learning and its application torole assignment of robot soccer[J]. Control Theory & Applications, 2009, 26(4): 371-376. ADAM S, LUCIAN B, BABUŠKA R. Experience replay for real-time reinforcement learning control[J]. IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews, 2012, 42(2): 201-212. MARTIN R, THOMAS G, ROLAND H. Reinforcement learning for robot soccer[J]. Autonomous Robots, 2009, 27(1): 55-73. HWANG K S, CHEN Y J, LEE C H. Reinforcement learning in strategy selection for a coordinated multi-robot system[J]. IEEE Transactions On Systems Man and Cybernetics Part A Systems and Humans, 2007, 37(6): 1151-1157. OLVI L M, SHAVLIK J W, EDWARD W W. Knowledge based kernel approximation[J]. The Journal of Machine Learning Research Archive, 2004, 5: 1127-1141. RICHARD M, JUDE S, LISA T, et al. Giving advice about preferred actions to reinforcement learners via knowledge-based kernel regression[C]//Proceedings of the 20th National Conference on Artificial Intelligence. Pittsburgh: American Association for Artificial Intelligence, 2005: 819-824. KWAK D J, KIM H J. Policy improvements for probabilistic pursuit-evasion game[J]. Journal of Intelligent & Robotic Systems, 2013, 74(3/4): 709-724. LI Jun, PAN Qishu, HONG Bingrong. A new approach of multi-robot cooperative pursuit based on association rule data mining[J]. International Journal of Advanced Robotic Systems, 2009, 6(4): 329-336. BOUZY B, M TIVIER M. Multi-agent model-based reinforcement learning experiments in the pursuit evasion game. Paris:Paris Descartes University, 2008. LIU Shuhua, LIU Jie, CHENG Yu. A pursuit-evasion algorithm based on hierarchical reinforcement learning[J]. Information: An International Inter- disciplinary Journal, 2010, 13(3): 635-645. ZHANG Wenxu, CHEN Xionglong, MA Lei. Multi-agent system pursuit with decision-making and formation control[C]//32nd Chinese Control Conference, Xi'an: IEEE, 2013: 7016-7022. 倪志伟,胡汤磊,吴晓璇,等. 基于分形理论的一种新的机器学习方法:分形学习[J]. 中国科学技术大学学报,2013,43(4): 265-270. NI Zhiwei, HU Tanglei, WU Xiaoxuan, et al. A novel machine learning approach based on fractal theory: Fractal learning[J]. Journal of University of Science and Technology of China, 2013, 43(4): 265-270. 陈卫东,朱奇光. 基于模糊算法的移动机器人路径规划[J]. 电子学报,2011,39(4): 971-974. CHEN Weidong, ZHU Qiguang. Mobile robot path planning based on fuzzy algorithms[J]. Acta Electronica Sinica, 2011, 39(4): 971-974. JUANG C F, HSU C H. Reinforcement ant optimized fuzzy controller for mobile-robot wall-following control[J]. IEEE Transactions on Industrial Electronics, 56(10): 3931-3940, 2009 徐明亮,柴志雷,须文波. 移动机器人模糊Q-学习沿墙导航[J]. 电机与控制学报,2010,14(6): 83-88. XU Mingliang, CHAI Zhilei, XU Wenbo. Wall-following control of a mobile robot with fuzzy Q-learning[J]. Electric Machines and Control, 2010, 14(6): 83-88. 陈宗海,杨志华,王海波,等. 从知识的表达和运用综述强化学习研究[J]. 控制与决策,2008,23(9): 961-968. CHEN Zonghai, YANG Zhihua, WANG Haibo, et al. Overview of reinforcement learning from knowledge expression and handling[J]. Control and Decision, 2008, 23(9): 961-968. 黄晗文,郑宇. 强化学习中基于定性模型的知识传递方法[J]. 计算机工程与科学,2011,33(6): 118-124. HUANG Hanwen, ZHENG Yu. Knowledge transfer method based on the qualitative model in reinforcement learning[J]. Computer Engineering & Science, 2011, 33(6): 118-124. 王皓,高阳,陈兴国. 强化学习中的迁移:方法和进展[J]. 电子学报,2008,36(12A): 39-43. WANG Hao, GAO Yang, CHEN Xingguo. Transfer of reinforcement learning : the state of the art[J]. Acta Electronica Sinica, 2008, 36(12A): 39-43. 朱美强,程玉虎,李明,等. 一类基于谱方法的强化学习混合迁移算法[J]. 自动化学报,2012,38(11): 1765-1776. ZHU Meiqiang, CHENG Yuhu, LI Ming, et al. A hybrid transfer algorithm for reinforcement learning based on spectral method[J]. Acta Automatica Sinica, 2012, 38(11): 1765-1776. BORJA F G, JOSE M L G, MANUEL G. Transfer learning with partially constrained models: application to reinforcement learning of linked multicomponent robot system control[J]. Robotics and Autonomous Systems, 2013, 61(7): 694-703. GEIST M, PIETQUIN O. Kalman temporal differences[J]. Journal of Artificial Intelligence Research, 2010, 39: 483-532. PIETQUIN O, GEIST M, CHANDRAMOHAN S, et al. Sample efficient on-line learning of optimal dialogue policies with Kalman temporal differences[C]//International Joint Conference on Artificial Intelligence (IJCAI). Barcelona: International Joint Conferences on Artificial Intelligence, 2011: 1878-1883. GEIST M, PIETQUIN O. Revisiting natural actor-critics with value function approximation[J]. Modeling Decisions for Artificial Intelligence, 2010, 6408: 207-218. FERREIRA L A, COSTA R, CARLOS H, et al. Heuristically accelerated reinforcement learning modularization for multi-agent multi-objective problems[J]. Applied Intelligence, 2014, 41(2): 551-562. 刘全,李瑾,傅启明,等. 一种最大集合期望损失的多目标Sarsa(λ)算法[J]. 电子学报,2013,41(8):1469-1473. LIU Quan, LI Jin, FU Qiming, et al, A multiple-goal Sarsa(λ) algorithm based on lost reward of greatest mass[J]. Acta Electronica Sinica, 2013, 41(8): 1469-1473. WANG Weijia, SEBAG M. Hypervolume indicator and dominance reward based multi-objective Monte-Carlo tree search[J]. Machine Learning, 2013, 93(2/3): 403-29. SNEL M, WHITESON S. Learning potential functions and their representations for multi-task reinforcement learning[J]. Autonomous Agents and Multi-Agent Systems, 2014, 28(4): 637-681.
点击查看大图
计量
- 文章访问数: 1423
- HTML全文浏览量: 108
- PDF下载量: 1401
- 被引次数: 0