面向信用分配与训练稳定性的注意力校准MAPPO算法An attention-calibrated MAPPO algorithm for credit assignment and training stability
王琳,陈雯柏,吴双双,李云飞
摘要(Abstract):
协作式多智能体强化学习(multi-agent reinforcement learning, MARL)在复杂部分可观测环境中易发生信用分配粗糙和训练不稳定,制约了多智能体近端策略优化(multi-agent proximal policy optimization, MAPPO)的工程应用。为解决上述问题,提出一种注意力校准MAPPO(attention-calibrated MAPPO, AC-MAPPO)算法。在策略端引入统一跨尺度门控校准卷积编码模块,对局部观测在通道与时间维度进行多尺度重加权;在价值端构建双通道门控注意力模块,实现通道与实体两级显式信用分配。基于SMAC(StarCraft multi-agent challenge)三个典型场景的实验表明,AC-MAPPO相对MAPPO最终胜率平均提升约6.26%,并显著降低学习曲线方差。与IPPO(independent proximal policy optimization)、QMix(Q-value mixing network)及参数量对齐的MAPPOMLP(MAPPO-multi-layer perceptron)等基线相比,AC-MAPPO在样本效率与收敛稳定性上均取得一致优势,表明在现有MAPPO框架内进行轻量结构增强是有效可行的。
关键词(KeyWords): 多智能体强化学习;近端策略优化;校准卷积;注意力机制;信用分配
基金项目(Foundation): 北京市自然科学基金-小米创新联合基金项目(L233006)
作者(Author): 王琳,陈雯柏,吴双双,李云飞
DOI: 10.16508/j.cnki.11-5866/n.2026.02.001
参考文献(References):
- [1]ORR J,DUTTA A. Multi-agent deep reinforcement learning for multi-robot applications:a survey[J]. Sensors,2023,23(7):3625.
- [2]YUE Y F, LAKSHMINARAYANAN S. Multi-agent reinforcement learning for process control:exploring the intersection between fields of reinforcement learning,control theory,and game theory[J]. The Canadian Journal of Chemical Engineering,2023,101(11):6227-6239.
- [3]KOLAT M, K??V??RI B, B??CSI T, et al. Multi-agent reinforcement learning for traffic signal control:a cooperative approach[J]. Sustainability,2023,15(4):3479.
- [4]NING Z P,XIE L H. A survey on multi-agent reinforcement learning and its application[J]. Journal of Automation and Intelligence,2024,3(2):73-91.
- [5]GRONAUER S,DIEPOLD K. Multi-agent deep reinforcement learning:a survey[J]. Artificial Intelligence Review,2022,55(2):895-943.
- [6]YU C,VELU A,VINITSKY E,et al. The surprising effectiveness of PPO in cooperative multi-agent games[J]. Advances in Neural Information Processing Systems,2022,35:24611-24624.
- [7]GUO D L,TANG L,ZHANG X G,et al. An off-policy multiagent stochastic policy gradient algorithm for cooperative continuous control[J]. Neural Networks,2024,170:610-621.
- [8]HU S Y,HADY M A,QIAO J L,et al. Adaptability in multiagent reinforcement learning:a framework and unified review[EB/OL].(2025-07-14)[2025-09-10]. arXiv:2507. 10142.
- [9]PAPOUDAKIS G,CHRISTIANOS F,SCH??FER L,et al. Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks[EB/OL].(2021-11-09)[2025-03-24].arXiv:2006. 07869.
- [10]FOERSTER J N,FARQUHAR G,AFOURAS T,et al.Counterfactual multi-agent policy gradients[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence and 30th Innovative Applications of Artificial Intelligence Conference and8th AAAI Symposium on Educational Advances in Artificial Intelligence. Palo Alto, CA, USA:AAAI Press,2018:2974-2982.
- [11]IQBAL S,SHA F. Actor-attention-critic for multi-agent reinforcement learning[C]//Proceedings of the 36th International Conference on Machine Learning. San Diego, CA,USA:JMLR,2019, 97:2961-2970.
- [12]RASHID T,SAMVELYAN M,DE WITT C S,et al. Monotonic value function factorization for deep multi-agent reinforcement learning[J]. Journal of Machine Learning Research,2020,21(1):7234-7284.
- [13]WEN M N,KUBA J,LIN R J,et al. Multi-agent reinforcement learning is a sequence modeling problem[J]. Advances in Neural Information Processing Systems,2022,35:16509-16521.
- [14]LI H P,HE H B. Multiagent trust region policy optimization[J].IEEE Transactions on Neural Networks and Learning Systems,2024,35(9):12873-12887.
- [15]KUBA J G,CHEN R Q,WEN M N,et al. Trust region policy optimization in multi-agent reinforcement learning[EB/OL].(2022-04-04)[2025-06-27]. arXiv:2109. 11251.
- [16]司鹏搏,吴兵,杨睿哲,等.基于多智能体深度强化学习的无人机路径规划[J].北京工业大学学报,2023,49(4):449-458.SI P B,WU B,YANG R Z,et al. UAV path planning based on multi-agent deep reinforcement learning[J]. Journal of Beijing University of Technology,2023,49(4):449-458.(in Chinese)
- [17]孙泽翼,王彬,胡馨月,等.深空探测器多智能体强化学习自主任务规划[J].深空探测学报(中英文),2024,11(3):244-255.SUN Z Y,WANG B,HU X Y,et al. Multi-agent reinforcement learning autonomous task planning for deep space probes[J].Journal of Deep Space Exploration,2024,11(3):244-255.(in Chinese)
- [18]李海峰,杨宏安,盛梓茂,等.基于MAPPO的多无人机协同分布式动态任务分配[J].控制与决策,2025,40(5):1429-1437.LI H F,YANG H A,SHENG Z M,et al. Multi-UAV collaborative distributed dynamic task allocation based on MAPPO[J]. Control and Decision,2025,40(5):1429-1437.(in Chinese)
- [19]罗彪,胡天萌,周育豪,等.多智能体强化学习控制与决策研究综述[J].自动化学报,2025,51(3):510-539.LUO B,HU T M,ZHOU Y H,et al. Survey on multi-agent reinforcement learning for control and decision-making[J]. Acta Automatica Sinica,2025,51(3):510-539.(in Chinese)
- [20]OLIEHOEK F A,AMATO C. A concise introduction to decentralized POMDPs[M]. Cham:Springer,2016.
- [21]ONG S C W,PNG S W,HSU D,et al. Planning under uncertainty for robotic tasks with mixed observability[J].International Journal of Robotics Research,2010,29(8):1053-1068.
- [22]LOWE R,WU Y,TAMAR A,et al. Multi-agent actor-critic for mixed cooperative-competitive environments[C]//Advances in Neural Information Processing Systems(NIPS 2017). La Jolla,California, USA:2017,30:1-12.
- [23]YANG Y D,CHEN G Y,WANG W X,et al. Transformer-based working memory for multi-agent reinforcement learning with action parsing[J]. Advances in Neural Information Processing Systems,2022,35:34874-34886.
- [24]PHAN T,RITZ F,ALTMANN P,et al. Attention-based recurrence for multi-agent reinforcement learning under stochastic partial observability[C]//Proceedings of the 40th International Conference on Machine Learning. San Diego, CA,USA:JMLR,2023,202:27840-27853.
- [25]MENG L H,WEN M N,LE C Y,et al. Offline pre-trained multiagent decision transformer[J]. Machine Intelligence Research,2023,20(2):233-248.
- [26]LI X,WANG W H,HU X L,et al. Selective kernel networks[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Los Alamitos, CA , USA:IEEE Computer Society,2019:510-519.
- [27]HENDERSON P,ISLAM R,BACHMAN P,et al. Deep reinforcement learning that matters[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, CA,USA:AAAI Press,2018:3207-3214.
- [28]FUJIMOTO S,HOOF H,MEGER D. Addressing function approximation error in actor-critic methods[C]//Proceedings of the 35th International Conference on Machine Learning. San Diego, CA, USA:JMLR,2018:1587-1596.