Citation: | GONG Xun, ZHANG Zhiying, LIU Lu, MA Bing, WU Kunlun. A Survey of Human-Object Interaction Detection[J]. Journal of Southwest Jiaotong University, 2022, 57(4): 693-704. doi: 10.3969/j.issn.0258-2724.20210339 |
As an interdisciplinary subject of object detection, action recognition and visual relationship detection, human-object interaction (HOI) detection aims to identify the interaction between humans and objects in specific application scenarios. Here, recent work in the field of image-based HOI detection is systematically summarized. Firstly, based on the theory of interaction modeling, HOI detection methods can be divided into two categories: global instance based and local instance based, and the representative methods are elaborated and analyzed in detail. Further, according to the differences in visual features, the methods based on the global instance are further subdivided into fusion of spatial information, fusion of appearance information and fusion of body posture information. Finally, the applications of zero-shot learning, weakly supervised learning and Transformer model in HOI detection are discussed. From three aspects of HOI, visual distraction and motion perspective, the challenges faced by HOI detection are listed, and it is pointed out that domain generalization, real-time detection and end-to-end network are the future development trends.
[1] |
JOHNSON J, KRISHNA R, STARK M, et al. Image retrieval using scene graphs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE Computer Society, 2015: 3668-3678.
|
[2] |
LI Y K, OUYANG W L, ZHOU B L, et al. Scene graph generation from objects, phrases and region captions[DB/OL]. (2017-06-31)[2021-02-02]. https://arxiv.org/abs/1707.09700.
|
[3] |
XU D F, ZHU Y K, CHOY C B, et al. Scene graph generation by iterative message passing[EB/OL]. (2017-01-10)[2021-02-02]. https://arxiv.org/abs/1701.02426.
|
[4] |
BERGSTROM T, SHI H. Human-object interaction detection: a quick survey and examination of methods[DB/OL]. (2020-09-27)[2021-02-02]. https://arxiv.org/abs/2009.12950.
|
[5] |
GUPTA A, KEMBHAVI A, DAVIS L S. Observing human-object interactions: using spatial and functional compatibility for recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(10): 1775-1789. doi: 10.1109/TPAMI.2009.83
|
[6] |
ALESSANDRO P, CORDELIA S, VITTORIO F. Weakly supervised learning of interactions between humans and objects[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(3): 601-614. doi: 10.1109/TPAMI.2011.158
|
[7] |
LI L J, LI F F. What, where and who? Classifying events by scene and object recognition[C]//Proceedings of IEEE International Conference on Computer Vision. [S.l.]: IEEE, 2007: 1-8.
|
[8] |
LE D T, UIJLINGS J, BERNARDI R. TUHOI: trento universal human object interaction dataset[C]// Proceedings of the Third Workshop on Vision and Language. Brighton: Brighton University, 2014: 17-24.
|
[9] |
CHAO Y W, WANG Z, HE Y, et al. HICO: a benchmark for recognizing human-object interactions in images[C]//IEEE International Conference on Computer Vision. [S.l.]: IEEE, 2015: 1-9.
|
[10] |
ANDRILUKA M, PISHCHULIN L, GEHLER P, et al. 2d human pose estimation: New benchmark and state of the art analysis[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2014: 3686-3693.
|
[11] |
GUPTA S, MALIK J. Visual semantic role labeling[DB/OL]. (2015-03-17)[2021-02-02]. https://arxiv.org/abs/1505.04474.pdf.
|
[12] |
CHAO Y W, LIU Y, LIU X, et al. Learning to detect human-object interactions[C]//2018 IEEE Winter Conference on Applications of Computer Vision. [S.l.]: IEEE, 2018: 381-389.
|
[13] |
LI Y L, XU L, LIU X, et al. Pastanet: Toward human activity knowledge engine[C]//Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2020: 379-388.
|
[14] |
LIAO Y, LIU S, WANG F, et al. PPDM: Parallel point detection and matching for real-time human-object interaction detection[C]//Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2020: 479-487.
|
[15] |
ZHUANG B, WU Q, SHEN C, et al. Hcvrd: a benchmark for large-scale human-centered visual relationship detection[C/OL]//Proceedings of the AAAI Conference on Artificial Intelligence, 2018. [2021-02-22]. https://ojs.aaai.org/index.php/AAAI/article/view/12260.
|
[16] |
XU B J, LI J N, YONGKANG W, et al. Interact as You intend:intention-driven human-object interaction detection[J]. IEEE Transactions on Multimedia, 2019, 22(6): 1423-1432.
|
[17] |
ULUTAN O, IFTEKHAR A S M, MANJUNATH B S. Vsgnet: spatial attention network for detecting human object interactions using graph convolutions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2020: 13617-13626.
|
[18] |
GIRSHICK R. Fast R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision. [S.l.]: IEEE, 2015: 1440-1448.
|
[19] |
GAO C, ZOU Y, HUANG J B. iCAN: instance-centric attention network for human-object interaction detection[DB/OL]. (2018-08-30)[2021-02-22]. https://arxiv.org/abs/1808.10437.
|
[20] |
WANG T, ANWER R M, KHAN M H, et al. Deep contextual attention for human-object interaction detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. [S.l.]: IEEE, 2019: 5694-5702.
|
[21] |
PENG C, ZHANG X, YU G, et al. Large kernel matters-improve semantic segmentation by global con- volutional network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2017: 4353-4361.
|
[22] |
GIRDHAR R, RAMANAN D. Attentional pooling for action recognition[DB/OL]. (2017-11-04)[2021-02-15]. https://doi.org/10.48550/arXiv.1711.01467.
|
[23] |
BANSAL A, RAMBHATLA S S, SHRIVASTAVA A, et al. Spatial priming for detecting human-object interactions[DB/OL]. (2020-04-09)[2021-02-15]. https://arxiv.org/abs/2004.04851.
|
[24] |
GKIOXARI G, GIRSHICK R, DOLLÁR P, et al. Detecting and recognizing human-object interactions[DB/OL]. (2017-04-24)[2021-02-22]. https://arxiv.org/abs/1704.07333
|
[25] |
REN S, HE K, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.
|
[26] |
GUPTA T, SCHWING A, HOIEM D. No-frills human-object interaction detection: factorization, layout encodings, and training techniques[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. [S.l.]: IEEE, 2019: 9677-9685.
|
[27] |
YU F, WANG D, SHELHAMER E, et al. Deep layer aggregation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2018: 2403-2412.
|
[28] |
ZHOU X Y, WANG D Q, KRÄHENBÜHL P. Objects as points[DB/OL]. (2019-04-16)[2021-02-15]. http://arxiv.org/abs/1904.07850.
|
[29] |
LAW H, DENG J. Cornernet: detecting objects as paired keypoints[C]//Proceedings of the European Conference on Computer Vision. [S.l.]: Springer, 2018: 734-750.
|
[30] |
NEWELL A, YANG K, DENG J. Stacked hourglass networks for human pose estimation[C]//European Conference on Computer Vision. [S.l.]: Springer, 2016: 483-499.
|
[31] |
LI Y L, ZHOU S, HUANG X, et al. Transferable interactiveness knowledge for human-object interaction detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2019: 3585-3594.
|
[32] |
LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//European Conference on Computer Vision. Cham: Springer, 2014: 740-755
|
[33] |
LI J, WANG C, ZHU H, et al. Crowdpose: efficient crowded scenes pose estimation and a new benchmark[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2019: 10863-10872.
|
[34] |
WAN B, ZHOU D, LIU Y, et al. Pose-aware multi-level feature network for human object interaction detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. [S.l.]: IEEE, 2019: 9469-9478.
|
[35] |
CHEN Y, WANG Z, PENG Y, et al. Cascaded pyramid network for multi-person pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2018: 7103-7112.
|
[36] |
LIANG Z J, LIU J F, GUAN Y S, et al. Pose-based modular network for human-object interaction detection[DB/OL]. (2020-08-05)[2021-02-22]. https://arxiv.org/abs/2008.02042
|
[37] |
LIANG Z J, LIU J F, GUAN Y S, et al. Visual-semantic graph attention networks for human-object interaction detection[DB/OL]. (2020-01-07)[2021-02-22]. https://arxiv.org/abs/2001.02302
|
[38] |
FANG H S, CAO J, TAI Y W, et al. Pairwise body-part attention for recognizing human-object interactions[C]//Proceedings of the European Conference on Computer Vision. [S.l.]: Springer, 2018: 51-67.
|
[39] |
FANG H S, XIE S, TAI Y W, et al. Rmpe: regional multi-person pose estimation[C]//Proceedings of the IEEE International Conference on Computer Vision. [S.l.]: IEEE, 2017: 2334-2343.
|
[40] |
MALLYA A, LAZEBNIK. Learning models for actions and person-object interactions with transfer to question answering[C]//Proceedings of the European Conference on Computer Vision. [S.l.]: Springer, 2016: 414-428.
|
[41] |
ZHOU P, CHI M. Relation parsing neural network for human-object interaction detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. [S.l.]: IEEE, 2019: 843-851.
|
[42] |
GIRSHICK R, RADOSAVOVIC I, GKIOXARI G, et al.Detectron[CP/OL]. (2020-09-22)[2021-02-11]. https://github.com/facebookresearch/detectron.
|
[43] |
HE K, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision. [S.l.]: IEEE, 2017: 2961-2969.
|
[44] |
QI S, WANG W, JIA B, et al. Learning human-object interactions by graph parsing neural networks[C]//Proceedings of the European Conference on Computer Vision. [S.l.]: Springer, 2018: 401-417.
|
[45] |
LIU H C, MU T J, HUANG X L. Detecting human-object interaction with multi-level pairwise feature network[J]. Computational Visual Media, 2021, 7(2): 229-239. doi: 10.1007/s41095-020-0188-2
|
[46] |
ZHONG X, QU X, DING C, et al. Glance and gaze: inferring action-aware points for one-stage human-object interaction detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2021: 13234-13243.
|
[47] |
LAMPERT C H, NICKISCH H, HARMELING S. Learning to detect unseen object classes by between-class attribute transfer[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). [S.l.]: IEEE, 2009: 951-958.
|
[48] |
SHEN L, YEUNG S, HOFFMAN J, et al. Scaling human-object interaction recognition through zero-shot learning[C]//2018 IEEE Winter Conference on Applications of Computer Vision. [S.l.]: IEEE, 2018: 1568-1576.
|
[49] |
EUM S, KWON H. Semantics to space (S2S): embedding semantics into spatial space for zero-shot verb-object query inferencing[DB/OL]. (2019-06-13)[2022-02-22]. https://arxiv.org/abs/1906.05894
|
[50] |
RAHMAN S, KHAN S, PORIKLI F. Zero-shot object detection: learning to simultaneously recognize and localize novel concepts[DB/OL]. (2018-03-16)[2021-02-22]. https://arxiv.org/abs/1803.06049
|
[51] |
PEYRE J, LAPTEV I, SCHMID C, et al. Detecting unseen visual relations using analogies[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. [S.l.]: IEEE, 2019: 1981-1990.
|
[52] |
ALESSANDRO P, SCHMID C, FERRARI V. Weakly supervised learning of interactions between humans and objects[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 34(3): 601-614.
|
[53] |
PEYRE J, LAPTEV I, SCHMID C, et al. Weakly-supervised learning of visual relations[DB/OL]. (2017-07-29)[2021-02-22]. https://arxiv.org/abs/1707.09472.
|
[54] |
SARULLO A, MU T T. Zero-shot human-object interaction recognition via affordance graphs[DB/OL]. (2020-09-02)[2021-02-22]. https://arxiv.org/abs/2009. 01039.
|
[55] |
VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[DB/OL]. (2017-06-12)[2022-02-26]. https://doi.org/10.48550/arXiv.1706.03762
|
[56] |
KIM B, LEE J, KANG J, et al. HOTR: end-to-end human-object interaction detection with transfor- mers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2021: 74-83.
|
[57] |
TAMURA M, OHASHI H, YOSHINAGA T. QPIC: query-based pairwise human-object interaction detection with image-wide contextual information[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2021: 10410-10419.
|