文章详细信息

面向信用分配与训练稳定性的注意力校准MAPPO算法
An attention-calibrated MAPPO algorithm for credit assignment and training stability

王琳,陈雯柏,吴双双,李云飞

1:北京信息科技大学自动化学院

摘要(Abstract)：

协作式多智能体强化学习（multi-agent reinforcement learning, MARL）在复杂部分可观测环境中易发生信用分配粗糙和训练不稳定，制约了多智能体近端策略优化（multi-agent proximal policy optimization, MAPPO）的工程应用。为解决上述问题，提出一种注意力校准MAPPO(attention-calibrated MAPPO, AC-MAPPO)算法。在策略端引入统一跨尺度门控校准卷积编码模块，对局部观测在通道与时间维度进行多尺度重加权；在价值端构建双通道门控注意力模块，实现通道与实体两级显式信用分配。基于SMAC(StarCraft multi-agent challenge)三个典型场景的实验表明，AC-MAPPO相对MAPPO最终胜率平均提升约6.26%，并显著降低学习曲线方差。与IPPO(independent proximal policy optimization)、QMix(Q-value mixing network)及参数量对齐的MAPPOMLP(MAPPO-multi-layer perceptron)等基线相比，AC-MAPPO在样本效率与收敛稳定性上均取得一致优势，表明在现有MAPPO框架内进行轻量结构增强是有效可行的。

关键词(KeyWords)： 多智能体强化学习;近端策略优化;校准卷积;注意力机制;信用分配

基金项目(Foundation): 北京市自然科学基金-小米创新联合基金项目(L233006)

作者(Author): 王琳,陈雯柏,吴双双,李云飞

DOI: 10.16508/j.cnki.11-5866/n.2026.02.001

参考文献(References)：

[1]ORR J,DUTTA A. Multi-agent deep reinforcement learning for multi-robot applications:a survey[J]. Sensors,2023,23(7):3625.
[2]YUE Y F, LAKSHMINARAYANAN S. Multi-agent reinforcement learning for process control:exploring the intersection between fields of reinforcement learning,control theory,and game theory[J]. The Canadian Journal of Chemical Engineering,2023,101(11):6227-6239.
[3]KOLAT M, K??V??RI B, B??CSI T, et al. Multi-agent reinforcement learning for traffic signal control:a cooperative approach[J]. Sustainability,2023,15(4):3479.
[4]NING Z P,XIE L H. A survey on multi-agent reinforcement learning and its application[J]. Journal of Automation and Intelligence,2024,3(2):73-91.
[5]GRONAUER S,DIEPOLD K. Multi-agent deep reinforcement learning:a survey[J]. Artificial Intelligence Review,2022,55(2):895-943.
[6]YU C,VELU A,VINITSKY E,et al. The surprising effectiveness of PPO in cooperative multi-agent games[J]. Advances in Neural Information Processing Systems,2022,35:24611-24624.
[7]GUO D L,TANG L,ZHANG X G,et al. An off-policy multiagent stochastic policy gradient algorithm for cooperative continuous control[J]. Neural Networks,2024,170:610-621.
[8]HU S Y,HADY M A,QIAO J L,et al. Adaptability in multiagent reinforcement learning:a framework and unified review[EB/OL].(2025-07-14)[2025-09-10]. arXiv:2507. 10142.
[9]PAPOUDAKIS G,CHRISTIANOS F,SCH??FER L,et al. Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks[EB/OL].(2021-11-09)[2025-03-24].arXiv:2006. 07869.
[10]FOERSTER J N,FARQUHAR G,AFOURAS T,et al.Counterfactual multi-agent policy gradients[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence and 30th Innovative Applications of Artificial Intelligence Conference and8th AAAI Symposium on Educational Advances in Artificial Intelligence. Palo Alto, CA, USA:AAAI Press,2018:2974-2982.
[11]IQBAL S,SHA F. Actor-attention-critic for multi-agent reinforcement learning[C]//Proceedings of the 36th International Conference on Machine Learning. San Diego, CA,USA:JMLR,2019, 97:2961-2970.
[12]RASHID T,SAMVELYAN M,DE WITT C S,et al. Monotonic value function factorization for deep multi-agent reinforcement learning[J]. Journal of Machine Learning Research,2020,21(1):7234-7284.
[13]WEN M N,KUBA J,LIN R J,et al. Multi-agent reinforcement learning is a sequence modeling problem[J]. Advances in Neural Information Processing Systems,2022,35:16509-16521.
[14]LI H P,HE H B. Multiagent trust region policy optimization[J].IEEE Transactions on Neural Networks and Learning Systems,2024,35(9):12873-12887.
[15]KUBA J G,CHEN R Q,WEN M N,et al. Trust region policy optimization in multi-agent reinforcement learning[EB/OL].(2022-04-04)[2025-06-27]. arXiv:2109. 11251.
[16]司鹏搏，吴兵，杨睿哲，等.基于多智能体深度强化学习的无人机路径规划[J].北京工业大学学报，2023,49(4):449-458.SI P B,WU B,YANG R Z,et al. UAV path planning based on multi-agent deep reinforcement learning[J]. Journal of Beijing University of Technology,2023,49(4):449-458.(in Chinese)
[17]孙泽翼，王彬，胡馨月，等.深空探测器多智能体强化学习自主任务规划[J].深空探测学报（中英文），2024,11(3):244-255.SUN Z Y,WANG B,HU X Y,et al. Multi-agent reinforcement learning autonomous task planning for deep space probes[J].Journal of Deep Space Exploration,2024,11(3):244-255.(in Chinese)
[18]李海峰，杨宏安，盛梓茂，等.基于MAPPO的多无人机协同分布式动态任务分配[J].控制与决策，2025,40(5):1429-1437.LI H F,YANG H A,SHENG Z M,et al. Multi-UAV collaborative distributed dynamic task allocation based on MAPPO[J]. Control and Decision,2025,40(5):1429-1437.(in Chinese)
[19]罗彪，胡天萌，周育豪，等.多智能体强化学习控制与决策研究综述[J].自动化学报，2025,51(3):510-539.LUO B,HU T M,ZHOU Y H,et al. Survey on multi-agent reinforcement learning for control and decision-making[J]. Acta Automatica Sinica,2025,51(3):510-539.(in Chinese)
[20]OLIEHOEK F A,AMATO C. A concise introduction to decentralized POMDPs[M]. Cham:Springer,2016.
[21]ONG S C W,PNG S W,HSU D,et al. Planning under uncertainty for robotic tasks with mixed observability[J].International Journal of Robotics Research,2010,29(8):1053-1068.
[22]LOWE R,WU Y,TAMAR A,et al. Multi-agent actor-critic for mixed cooperative-competitive environments[C]//Advances in Neural Information Processing Systems(NIPS 2017). La Jolla,California, USA:2017,30:1-12.
[23]YANG Y D,CHEN G Y,WANG W X,et al. Transformer-based working memory for multi-agent reinforcement learning with action parsing[J]. Advances in Neural Information Processing Systems,2022,35:34874-34886.
[24]PHAN T,RITZ F,ALTMANN P,et al. Attention-based recurrence for multi-agent reinforcement learning under stochastic partial observability[C]//Proceedings of the 40th International Conference on Machine Learning. San Diego, CA,USA:JMLR,2023,202:27840-27853.
[25]MENG L H,WEN M N,LE C Y,et al. Offline pre-trained multiagent decision transformer[J]. Machine Intelligence Research,2023,20(2):233-248.
[26]LI X,WANG W H,HU X L,et al. Selective kernel networks[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Los Alamitos, CA , USA:IEEE Computer Society,2019:510-519.
[27]HENDERSON P,ISLAM R,BACHMAN P,et al. Deep reinforcement learning that matters[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, CA,USA:AAAI Press,2018:3207-3214.
[28]FUJIMOTO S,HOOF H,MEGER D. Addressing function approximation error in actor-critic methods[C]//Proceedings of the 35th International Conference on Machine Learning. San Diego, CA, USA:JMLR,2018:1587-1596.

扩展功能

本文信息

PDF(3741K)

服务与反馈

本文关键词相关文章

本文作者相关文章

中国知网

北京信息科技大学学报(自然科学版)

2026, v.41;No.170(02) 1-13+34

面向信用分配与训练稳定性的注意力校准MAPPO算法
An attention-calibrated MAPPO algorithm for credit assignment and training stability

王琳,陈雯柏,吴双双,李云飞

参考文献(References)：

北京信息科技大学学报(自然科学版)

2026, v.41;No.170(02) 1-13+34

面向信用分配与训练稳定性的注意力校准MAPPO算法An attention-calibrated MAPPO algorithm for credit assignment and training stability

王琳,陈雯柏,吴双双,李云飞

参考文献(References)：

面向信用分配与训练稳定性的注意力校准MAPPO算法
An attention-calibrated MAPPO algorithm for credit assignment and training stability