• ISSN 0258-2724
  • CN 51-1277/U
  • EI Compendex
  • Scopus 收录
  • 全国中文核心期刊
  • 中国科技论文统计源期刊
  • 中国科学引文数据库来源期刊

人物交互检测研究进展综述

龚勋 张志莹 刘璐 马冰 吴昆伦

龚勋, 张志莹, 刘璐, 马冰, 吴昆伦. 人物交互检测研究进展综述[J]. 西南交通大学学报, 2022, 57(4): 693-704. doi: 10.3969/j.issn.0258-2724.20210339
引用本文: 龚勋, 张志莹, 刘璐, 马冰, 吴昆伦. 人物交互检测研究进展综述[J]. 西南交通大学学报, 2022, 57(4): 693-704. doi: 10.3969/j.issn.0258-2724.20210339
GONG Xun, ZHANG Zhiying, LIU Lu, MA Bing, WU Kunlun. A Survey of Human-Object Interaction Detection[J]. Journal of Southwest Jiaotong University, 2022, 57(4): 693-704. doi: 10.3969/j.issn.0258-2724.20210339
Citation: GONG Xun, ZHANG Zhiying, LIU Lu, MA Bing, WU Kunlun. A Survey of Human-Object Interaction Detection[J]. Journal of Southwest Jiaotong University, 2022, 57(4): 693-704. doi: 10.3969/j.issn.0258-2724.20210339

人物交互检测研究进展综述

doi: 10.3969/j.issn.0258-2724.20210339
基金项目: 国家自然科学基金(61876158);中央高校基本科研业务费专项资金(2682021ZTPY030)
详细信息
    作者简介:

    龚勋(1980—),男,教授,博士生导师,研究方向为计算机视觉与模式识别,E-mail: xgong@swjtu.edu.cn

  • 中图分类号: TP391

A Survey of Human-Object Interaction Detection

  • 摘要:

    作为目标检测、行为识别、视觉关系检测的交叉学科,人物交互(human-object interaction,HOI)检测旨在识别特定应用场景下人与物体的相互关系. 本文对基于图像的人物交互检测研究成果进行了系统总结及论述. 首先,从交互关系建模的原理出发,把人物交互检测方法分为基于全局实例和基于局部实例两类,并对代表性方法进行了详细阐述和分析;进而,根据所采用视觉特征的差异将基于全局实例的方法进行细分,包括融合空间位置信息、融合外观信息与融合人体姿态信息;然后,讨论了零样本学习、弱监督学习以及Transformer模型在人物交互检测中的应用;最后,从交互类别、视觉干扰以及运动视角三方面出发,总结了人物交互检测面临的挑战,并指出领域泛化、实时检测和端到端网络是未来发展的趋势.

     

  • 图 1  人物交互关系检测流程

    Figure 1.  Flowchart of HOI detection

    图 2  人物交互的相对空间关系

    Figure 2.  Relative spatial relationship in HOI

    图 3  基于人-物区域位置信息的HO-RCNN网络

    Figure 3.  HO-RCNN networks based on human-object regional information

    图 4  空间引发模型框架

    Figure 4.  Framework of spatial priming model

    图 5  PPDM与同类方法在HICO-DET上的推理时间、平均准确度以及速度

    Figure 5.  Inference time, mAP, speed between PPDM and similar methods on HICO-DET dataset

    图 6  融合视觉语义姿态特征的VSP-GMN网络

    Figure 6.  VSP-GMN network integrating visual, semantic and pose features

    图 7  整体人体姿态与局部特征的对比

    Figure 7.  Comparison of overall human posture and local features

    图 8  基于实例行为的局部标签

    Figure 8.  Local annotations based on instance behavior

    图 9  零样本目标检测流程

    Figure 9.  Flowchart of zero-shot object detection

    表  1  人物交互图像数据集

    Table  1.   HOI image datasets

    数据集发布机构/发布者图片/张动作/类注释类型示例
    HAKE (2020)[13]上海交通大学118 000156检测框结合
    语义描述
    HOI-A (2020)[14]北京航空航天大学17 60610检测框结合
    语义描述
    HICO-DET (2018)[12]密西根大学安娜堡分校48 000117检测框结合
    语义描述
    HCVRD (2018)[15]澳大利亚阿德莱德大学788 1609 852检测框结合
    语义描述
    V-COCO (2015)[11]加州大学伯克利分校10 34680检测框
    HICO (2015)[9]密西根大学安娜堡分校47 774600语义描述
    MPII (2014)[10]普朗克信息学研究所40 5224102D 姿态
    TUHOI (2014)[8]意大利特伦托波沃大学10 8052 974语义描述
    下载: 导出CSV

    表  2  基于视觉特征方法在HICO-DET数据集的mAP结果对比

    Table  2.   Result comparison of mAP with visual feature based methods on HICO-DET data set

    来源年份方法特征 mAP(Default)/% mAP(Know Object)/%
    FullRareNone-RareFullRareNone-Rare
    文献[46]2021 年GGNetA29.1722.1330.8433.5026.6734.89
    文献[14]2020 年PPDMA21.7313.7824.1024.5816.6526.84
    文献[17]2020 年VS-GATs
    + PMN
    A + S + P21.2117.6022.29
    文献[45]2021 年多级成对特征网络A + S + P20.0516.6621.0724.0121.0924.89
    文献[34]2019 年PMFNetA + S + P17.4615.6518.0020.3417.4721.20
    文献[31]2019 年TINA + S + P17.2213.5118.3219.3815.3820.57
    文献[20]2019 年多分支网络A + S16.2411.1617.7517.7312.7819.21
    文献[19]2018 年iCANA + S14.8410.4516.1516.2611.3317.73
    文献[12]2018 年HO-RCNNA + S 7.81 5.37 8.5410.41 8.9410.85
      注:A、S、P 分别表示外观、空间和骨架;Full、Rare、None-Rare 分别表示完整类、罕见类、非罕见类.
    下载: 导出CSV

    表  3  基于视觉特征方法在V-COCO数据集结果对比

    Table  3.   Results comparison of visual feature based methods on V-COCO data set

    来源年份方法特征AP/%
    文献[46]2021 年GGNet-HourglassA54.70
    文献[45]2021 年多级成对特征网络A + S + P52.80
    文献[17]2020 年VS-GATsA + S + P49.80
    文献[34]2019 年PMFNetA + S + P52.00
    文献[31]2019 年TINA + S + P48.70
    文献[41]2019 年Body-partA + P47.53
    文献[20]2019 年多分支网络A + S47.30
    文献[19]2018 年iCANA + S45.30
    文献[24]2017 年InteractNetA40.00
    下载: 导出CSV

    表  4  其他新技术总结

    Table  4.   Summary of other new technologies

    来源年份数据集方法主要工作概述mAP/%
    文献[57] 2021 年 V-COCO QPIC  利用注意力机制有效地聚合特征以检测各种 HOI 类 58.80
    文献[56] 2021 年 HICO-DET HOTR  首次提出基于Transformer编码器-解码器结构的预测框架 25.10
    文献[54] 2020 年 HICO GCNCL  以弱监督的方式训练未见类 16.02
    文献[51] 2019 年 COCO-a
    UnRel
    HICO-DET
    三联体模型  提出了一个类比迁移模型,可计算从未见过的视觉短语嵌入信息 7.30
    17.50
    20.90
    文献[50] 2019 年 VT60 ZSL + S2S  语义到空间体系结构融合了零样本学习共同捕获信息并查询 50.47
    文献[49] 2018 年 ILSVRC-2017
    Unseen (all)
    Seen
    端到端的深度架构  将零样本任务扩展到目标检测领域,联合建模了视觉和语义领域信息融合的端到端深度网络 16.40
    26.10
    文献[48] 2018 年 HICO-DET 多任务训练网络  通过零样本学习方法将 HOI 识别扩展到长类别,实现对未见的动词-对象对的零样本目标检测 6.46
    下载: 导出CSV
  • [1] JOHNSON J, KRISHNA R, STARK M, et al. Image retrieval using scene graphs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE Computer Society, 2015: 3668-3678.
    [2] LI Y K, OUYANG W L, ZHOU B L, et al. Scene graph generation from objects, phrases and region captions[DB/OL]. (2017-06-31)[2021-02-02]. https://arxiv.org/abs/1707.09700.
    [3] XU D F, ZHU Y K, CHOY C B, et al. Scene graph generation by iterative message passing[EB/OL]. (2017-01-10)[2021-02-02]. https://arxiv.org/abs/1701.02426.
    [4] BERGSTROM T, SHI H. Human-object interaction detection: a quick survey and examination of methods[DB/OL]. (2020-09-27)[2021-02-02]. https://arxiv.org/abs/2009.12950.
    [5] GUPTA A, KEMBHAVI A, DAVIS L S. Observing human-object interactions: using spatial and functional compatibility for recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(10): 1775-1789. doi: 10.1109/TPAMI.2009.83
    [6] ALESSANDRO P, CORDELIA S, VITTORIO F. Weakly supervised learning of interactions between humans and objects[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(3): 601-614. doi: 10.1109/TPAMI.2011.158
    [7] LI L J, LI F F. What, where and who? Classifying events by scene and object recognition[C]//Proceedings of IEEE International Conference on Computer Vision. [S.l.]: IEEE, 2007: 1-8.
    [8] LE D T, UIJLINGS J, BERNARDI R. TUHOI: trento universal human object interaction dataset[C]// Proceedings of the Third Workshop on Vision and Language. Brighton: Brighton University, 2014: 17-24.
    [9] CHAO Y W, WANG Z, HE Y, et al. HICO: a benchmark for recognizing human-object interactions in images[C]//IEEE International Conference on Computer Vision. [S.l.]: IEEE, 2015: 1-9.
    [10] ANDRILUKA M, PISHCHULIN L, GEHLER P, et al. 2d human pose estimation: New benchmark and state of the art analysis[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2014: 3686-3693.
    [11] GUPTA S, MALIK J. Visual semantic role labeling[DB/OL]. (2015-03-17)[2021-02-02]. https://arxiv.org/abs/1505.04474.pdf.
    [12] CHAO Y W, LIU Y, LIU X, et al. Learning to detect human-object interactions[C]//2018 IEEE Winter Conference on Applications of Computer Vision. [S.l.]: IEEE, 2018: 381-389.
    [13] LI Y L, XU L, LIU X, et al. Pastanet: Toward human activity knowledge engine[C]//Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2020: 379-388.
    [14] LIAO Y, LIU S, WANG F, et al. PPDM: Parallel point detection and matching for real-time human-object interaction detection[C]//Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2020: 479-487.
    [15] ZHUANG B, WU Q, SHEN C, et al. Hcvrd: a benchmark for large-scale human-centered visual relationship detection[C/OL]//Proceedings of the AAAI Conference on Artificial Intelligence, 2018. [2021-02-22]. https://ojs.aaai.org/index.php/AAAI/article/view/12260.
    [16] XU B J, LI J N, YONGKANG W, et al. Interact as You intend:intention-driven human-object interaction detection[J]. IEEE Transactions on Multimedia, 2019, 22(6): 1423-1432.
    [17] ULUTAN O, IFTEKHAR A S M, MANJUNATH B S. Vsgnet: spatial attention network for detecting human object interactions using graph convolutions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2020: 13617-13626.
    [18] GIRSHICK R. Fast R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision. [S.l.]: IEEE, 2015: 1440-1448.
    [19] GAO C, ZOU Y, HUANG J B. iCAN: instance-centric attention network for human-object interaction detection[DB/OL]. (2018-08-30)[2021-02-22]. https://arxiv.org/abs/1808.10437.
    [20] WANG T, ANWER R M, KHAN M H, et al. Deep contextual attention for human-object interaction detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. [S.l.]: IEEE, 2019: 5694-5702.
    [21] PENG C, ZHANG X, YU G, et al. Large kernel matters-improve semantic segmentation by global con- volutional network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2017: 4353-4361.
    [22] GIRDHAR R, RAMANAN D. Attentional pooling for action recognition[DB/OL]. (2017-11-04)[2021-02-15]. https://doi.org/10.48550/arXiv.1711.01467.
    [23] BANSAL A, RAMBHATLA S S, SHRIVASTAVA A, et al. Spatial priming for detecting human-object interactions[DB/OL]. (2020-04-09)[2021-02-15]. https://arxiv.org/abs/2004.04851.
    [24] GKIOXARI G, GIRSHICK R, DOLLÁR P, et al. Detecting and recognizing human-object interactions[DB/OL]. (2017-04-24)[2021-02-22]. https://arxiv.org/abs/1704.07333
    [25] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.
    [26] GUPTA T, SCHWING A, HOIEM D. No-frills human-object interaction detection: factorization, layout encodings, and training techniques[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. [S.l.]: IEEE, 2019: 9677-9685.
    [27] YU F, WANG D, SHELHAMER E, et al. Deep layer aggregation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2018: 2403-2412.
    [28] ZHOU X Y, WANG D Q, KRÄHENBÜHL P. Objects as points[DB/OL]. (2019-04-16)[2021-02-15]. http://arxiv.org/abs/1904.07850.
    [29] LAW H, DENG J. Cornernet: detecting objects as paired keypoints[C]//Proceedings of the European Conference on Computer Vision. [S.l.]: Springer, 2018: 734-750.
    [30] NEWELL A, YANG K, DENG J. Stacked hourglass networks for human pose estimation[C]//European Conference on Computer Vision. [S.l.]: Springer, 2016: 483-499.
    [31] LI Y L, ZHOU S, HUANG X, et al. Transferable interactiveness knowledge for human-object interaction detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2019: 3585-3594.
    [32] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//European Conference on Computer Vision. Cham: Springer, 2014: 740-755
    [33] LI J, WANG C, ZHU H, et al. Crowdpose: efficient crowded scenes pose estimation and a new benchmark[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2019: 10863-10872.
    [34] WAN B, ZHOU D, LIU Y, et al. Pose-aware multi-level feature network for human object interaction detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. [S.l.]: IEEE, 2019: 9469-9478.
    [35] CHEN Y, WANG Z, PENG Y, et al. Cascaded pyramid network for multi-person pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2018: 7103-7112.
    [36] LIANG Z J, LIU J F, GUAN Y S, et al. Pose-based modular network for human-object interaction detection[DB/OL]. (2020-08-05)[2021-02-22]. https://arxiv.org/abs/2008.02042
    [37] LIANG Z J, LIU J F, GUAN Y S, et al. Visual-semantic graph attention networks for human-object interaction detection[DB/OL]. (2020-01-07)[2021-02-22]. https://arxiv.org/abs/2001.02302
    [38] FANG H S, CAO J, TAI Y W, et al. Pairwise body-part attention for recognizing human-object interactions[C]//Proceedings of the European Conference on Computer Vision. [S.l.]: Springer, 2018: 51-67.
    [39] FANG H S, XIE S, TAI Y W, et al. Rmpe: regional multi-person pose estimation[C]//Proceedings of the IEEE International Conference on Computer Vision. [S.l.]: IEEE, 2017: 2334-2343.
    [40] MALLYA A, LAZEBNIK. Learning models for actions and person-object interactions with transfer to question answering[C]//Proceedings of the European Conference on Computer Vision. [S.l.]: Springer, 2016: 414-428.
    [41] ZHOU P, CHI M. Relation parsing neural network for human-object interaction detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. [S.l.]: IEEE, 2019: 843-851.
    [42] GIRSHICK R, RADOSAVOVIC I, GKIOXARI G, et al.Detectron[CP/OL]. (2020-09-22)[2021-02-11]. https://github.com/facebookresearch/detectron.
    [43] HE K, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision. [S.l.]: IEEE, 2017: 2961-2969.
    [44] QI S, WANG W, JIA B, et al. Learning human-object interactions by graph parsing neural networks[C]//Proceedings of the European Conference on Computer Vision. [S.l.]: Springer, 2018: 401-417.
    [45] LIU H C, MU T J, HUANG X L. Detecting human-object interaction with multi-level pairwise feature network[J]. Computational Visual Media, 2021, 7(2): 229-239. doi: 10.1007/s41095-020-0188-2
    [46] ZHONG X, QU X, DING C, et al. Glance and gaze: inferring action-aware points for one-stage human-object interaction detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2021: 13234-13243.
    [47] LAMPERT C H, NICKISCH H, HARMELING S. Learning to detect unseen object classes by between-class attribute transfer[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). [S.l.]: IEEE, 2009: 951-958.
    [48] SHEN L, YEUNG S, HOFFMAN J, et al. Scaling human-object interaction recognition through zero-shot learning[C]//2018 IEEE Winter Conference on Applications of Computer Vision. [S.l.]: IEEE, 2018: 1568-1576.
    [49] EUM S, KWON H. Semantics to space (S2S): embedding semantics into spatial space for zero-shot verb-object query inferencing[DB/OL]. (2019-06-13)[2022-02-22]. https://arxiv.org/abs/1906.05894
    [50] RAHMAN S, KHAN S, PORIKLI F. Zero-shot object detection: learning to simultaneously recognize and localize novel concepts[DB/OL]. (2018-03-16)[2021-02-22]. https://arxiv.org/abs/1803.06049
    [51] PEYRE J, LAPTEV I, SCHMID C, et al. Detecting unseen visual relations using analogies[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. [S.l.]: IEEE, 2019: 1981-1990.
    [52] ALESSANDRO P, SCHMID C, FERRARI V. Weakly supervised learning of interactions between humans and objects[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 34(3): 601-614.
    [53] PEYRE J, LAPTEV I, SCHMID C, et al. Weakly-supervised learning of visual relations[DB/OL]. (2017-07-29)[2021-02-22]. https://arxiv.org/abs/1707.09472.
    [54] SARULLO A, MU T T. Zero-shot human-object interaction recognition via affordance graphs[DB/OL]. (2020-09-02)[2021-02-22]. https://arxiv.org/abs/2009. 01039.
    [55] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[DB/OL]. (2017-06-12)[2022-02-26]. https://doi.org/10.48550/arXiv.1706.03762
    [56] KIM B, LEE J, KANG J, et al. HOTR: end-to-end human-object interaction detection with transfor- mers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2021: 74-83.
    [57] TAMURA M, OHASHI H, YOSHINAGA T. QPIC: query-based pairwise human-object interaction detection with image-wide contextual information[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2021: 10410-10419.
  • 加载中
图(9) / 表(4)
计量
  • 文章访问数:  1073
  • HTML全文浏览量:  667
  • PDF下载量:  229
  • 被引次数: 0
出版历程
  • 收稿日期:  2021-04-28
  • 修回日期:  2021-09-14
  • 刊出日期:  2021-10-27

目录

    /

    返回文章
    返回