A Survey of Human-Object Interaction Detection

GONG Xun; ZHANG Zhiying; LIU Lu; MA Bing; WU Kunlun

doi:10.3969/j.issn.0258-2724.20210339

Volume 57 Issue 4

Jul. 2022

Turn off MathJax

Article Contents

Abstract

References

Journal of Southwest Jiaotong University > 2022 > 57(4): 693-704.

GONG Xun, ZHANG Zhiying, LIU Lu, MA Bing, WU Kunlun. A Survey of Human-Object Interaction Detection[J]. Journal of Southwest Jiaotong University, 2022, 57(4): 693-704. doi: 10.3969/j.issn.0258-2724.20210339

Citation:

GONG Xun, ZHANG Zhiying, LIU Lu, MA Bing, WU Kunlun. A Survey of Human-Object Interaction Detection[J]. Journal of Southwest Jiaotong University, 2022, 57(4): 693-704. doi: 10.3969/j.issn.0258-2724.20210339

Citation:

PDF( 2685 KB)

A Survey of Human-Object Interaction Detection

doi: 10.3969/j.issn.0258-2724.20210339

GONG Xun^{1, 2
,},
ZHANG Zhiying²,
LIU Lu²,
MA Bing²,
WU Kunlun¹

1.
School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu 611756, China
2.
Graduate School of Tangshan, Southwest Jiaotong University, Tangshan 063000, China

Received Date: 28 Apr 2021
Rev Recd Date: 14 Sep 2021
Publish Date: 27 Oct 2021

Abstract

Abstract

As an interdisciplinary subject of object detection, action recognition and visual relationship detection, human-object interaction (HOI) detection aims to identify the interaction between humans and objects in specific application scenarios. Here, recent work in the field of image-based HOI detection is systematically summarized. Firstly, based on the theory of interaction modeling, HOI detection methods can be divided into two categories: global instance based and local instance based, and the representative methods are elaborated and analyzed in detail. Further, according to the differences in visual features, the methods based on the global instance are further subdivided into fusion of spatial information, fusion of appearance information and fusion of body posture information. Finally, the applications of zero-shot learning, weakly supervised learning and Transformer model in HOI detection are discussed. From three aspects of HOI, visual distraction and motion perspective, the challenges faced by HOI detection are listed, and it is pointed out that domain generalization, real-time detection and end-to-end network are the future development trends.
- human-object interaction (HOI),
- visual relationship,
- object detection,
- action recognition

FullText(HTML)

References(57)

References

[1]	JOHNSON J, KRISHNA R, STARK M, et al. Image retrieval using scene graphs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE Computer Society, 2015: 3668-3678.
[2]	LI Y K, OUYANG W L, ZHOU B L, et al. Scene graph generation from objects, phrases and region captions[DB/OL]. (2017-06-31)[2021-02-02]. https://arxiv.org/abs/1707.09700.
[3]	XU D F, ZHU Y K, CHOY C B, et al. Scene graph generation by iterative message passing[EB/OL]. (2017-01-10)[2021-02-02]. https://arxiv.org/abs/1701.02426.
[4]	BERGSTROM T, SHI H. Human-object interaction detection: a quick survey and examination of methods[DB/OL]. (2020-09-27)[2021-02-02]. https://arxiv.org/abs/2009.12950.
[5]	GUPTA A, KEMBHAVI A, DAVIS L S. Observing human-object interactions: using spatial and functional compatibility for recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(10): 1775-1789. doi: 10.1109/TPAMI.2009.83
[6]	ALESSANDRO P, CORDELIA S, VITTORIO F. Weakly supervised learning of interactions between humans and objects[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(3): 601-614. doi: 10.1109/TPAMI.2011.158
[7]	LI L J, LI F F. What, where and who? Classifying events by scene and object recognition[C]//Proceedings of IEEE International Conference on Computer Vision. [S.l.]: IEEE, 2007: 1-8.
[8]	LE D T, UIJLINGS J, BERNARDI R. TUHOI: trento universal human object interaction dataset[C]// Proceedings of the Third Workshop on Vision and Language. Brighton: Brighton University, 2014: 17-24.
[9]	CHAO Y W, WANG Z, HE Y, et al. HICO: a benchmark for recognizing human-object interactions in images[C]//IEEE International Conference on Computer Vision. [S.l.]: IEEE, 2015: 1-9.
[10]	ANDRILUKA M, PISHCHULIN L, GEHLER P, et al. 2d human pose estimation: New benchmark and state of the art analysis[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2014: 3686-3693.
[11]	GUPTA S, MALIK J. Visual semantic role labeling[DB/OL]. (2015-03-17)[2021-02-02]. https://arxiv.org/abs/1505.04474.pdf.
[12]	CHAO Y W, LIU Y, LIU X, et al. Learning to detect human-object interactions[C]//2018 IEEE Winter Conference on Applications of Computer Vision. [S.l.]: IEEE, 2018: 381-389.
[13]	LI Y L, XU L, LIU X, et al. Pastanet: Toward human activity knowledge engine[C]//Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2020: 379-388.
[14]	LIAO Y, LIU S, WANG F, et al. PPDM: Parallel point detection and matching for real-time human-object interaction detection[C]//Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2020: 479-487.
[15]	ZHUANG B, WU Q, SHEN C, et al. Hcvrd: a benchmark for large-scale human-centered visual relationship detection[C/OL]//Proceedings of the AAAI Conference on Artificial Intelligence, 2018. [2021-02-22]. https://ojs.aaai.org/index.php/AAAI/article/view/12260.
[16]	XU B J, LI J N, YONGKANG W, et al. Interact as You intend:intention-driven human-object interaction detection[J]. IEEE Transactions on Multimedia, 2019, 22(6): 1423-1432.
[17]	ULUTAN O, IFTEKHAR A S M, MANJUNATH B S. Vsgnet: spatial attention network for detecting human object interactions using graph convolutions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2020: 13617-13626.
[18]	GIRSHICK R. Fast R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision. [S.l.]: IEEE, 2015: 1440-1448.
[19]	GAO C, ZOU Y, HUANG J B. iCAN: instance-centric attention network for human-object interaction detection[DB/OL]. (2018-08-30)[2021-02-22]. https://arxiv.org/abs/1808.10437.
[20]	WANG T, ANWER R M, KHAN M H, et al. Deep contextual attention for human-object interaction detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. [S.l.]: IEEE, 2019: 5694-5702.
[21]	PENG C, ZHANG X, YU G, et al. Large kernel matters－improve semantic segmentation by global con- volutional network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2017: 4353-4361.
[22]	GIRDHAR R, RAMANAN D. Attentional pooling for action recognition[DB/OL]. (2017-11-04)[2021-02-15]. https://doi.org/10.48550/arXiv.1711.01467.
[23]	BANSAL A, RAMBHATLA S S, SHRIVASTAVA A, et al. Spatial priming for detecting human-object interactions[DB/OL]. (2020-04-09)[2021-02-15]. https://arxiv.org/abs/2004.04851.
[24]	GKIOXARI G, GIRSHICK R, DOLLÁR P, et al. Detecting and recognizing human-object interactions[DB/OL]. (2017-04-24)[2021-02-22]. https://arxiv.org/abs/1704.07333
[25]	REN S, HE K, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.
[26]	GUPTA T, SCHWING A, HOIEM D. No-frills human-object interaction detection: factorization, layout encodings, and training techniques[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. [S.l.]: IEEE, 2019: 9677-9685.
[27]	YU F, WANG D, SHELHAMER E, et al. Deep layer aggregation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2018: 2403-2412.
[28]	ZHOU X Y, WANG D Q, KRÄHENBÜHL P. Objects as points[DB/OL]. (2019-04-16)[2021-02-15]. http://arxiv.org/abs/1904.07850.
[29]	LAW H, DENG J. Cornernet: detecting objects as paired keypoints[C]//Proceedings of the European Conference on Computer Vision. [S.l.]: Springer, 2018: 734-750.
[30]	NEWELL A, YANG K, DENG J. Stacked hourglass networks for human pose estimation[C]//European Conference on Computer Vision. [S.l.]: Springer, 2016: 483-499.
[31]	LI Y L, ZHOU S, HUANG X, et al. Transferable interactiveness knowledge for human-object interaction detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2019: 3585-3594.
[32]	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//European Conference on Computer Vision. Cham: Springer, 2014: 740-755
[33]	LI J, WANG C, ZHU H, et al. Crowdpose: efficient crowded scenes pose estimation and a new benchmark[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2019: 10863-10872.
[34]	WAN B, ZHOU D, LIU Y, et al. Pose-aware multi-level feature network for human object interaction detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. [S.l.]: IEEE, 2019: 9469-9478.
[35]	CHEN Y, WANG Z, PENG Y, et al. Cascaded pyramid network for multi-person pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2018: 7103-7112.
[36]	LIANG Z J, LIU J F, GUAN Y S, et al. Pose-based modular network for human-object interaction detection[DB/OL]. (2020-08-05)[2021-02-22]. https://arxiv.org/abs/2008.02042
[37]	LIANG Z J, LIU J F, GUAN Y S, et al. Visual-semantic graph attention networks for human-object interaction detection[DB/OL]. (2020-01-07)[2021-02-22]. https://arxiv.org/abs/2001.02302
[38]	FANG H S, CAO J, TAI Y W, et al. Pairwise body-part attention for recognizing human-object interactions[C]//Proceedings of the European Conference on Computer Vision. [S.l.]: Springer, 2018: 51-67.
[39]	FANG H S, XIE S, TAI Y W, et al. Rmpe: regional multi-person pose estimation[C]//Proceedings of the IEEE International Conference on Computer Vision. [S.l.]: IEEE, 2017: 2334-2343.
[40]	MALLYA A, LAZEBNIK. Learning models for actions and person-object interactions with transfer to question answering[C]//Proceedings of the European Conference on Computer Vision. [S.l.]: Springer, 2016: 414-428.
[41]	ZHOU P, CHI M. Relation parsing neural network for human-object interaction detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. [S.l.]: IEEE, 2019: 843-851.
[42]	GIRSHICK R, RADOSAVOVIC I, GKIOXARI G, et al.Detectron[CP/OL]. (2020-09-22)[2021-02-11]. https://github.com/facebookresearch/detectron.
[43]	HE K, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision. [S.l.]: IEEE, 2017: 2961-2969.
[44]	QI S, WANG W, JIA B, et al. Learning human-object interactions by graph parsing neural networks[C]//Proceedings of the European Conference on Computer Vision. [S.l.]: Springer, 2018: 401-417.
[45]	LIU H C, MU T J, HUANG X L. Detecting human-object interaction with multi-level pairwise feature network[J]. Computational Visual Media, 2021, 7(2): 229-239. doi: 10.1007/s41095-020-0188-2
[46]	ZHONG X, QU X, DING C, et al. Glance and gaze: inferring action-aware points for one-stage human-object interaction detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2021: 13234-13243.
[47]	LAMPERT C H, NICKISCH H, HARMELING S. Learning to detect unseen object classes by between-class attribute transfer[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). [S.l.]: IEEE, 2009: 951-958.
[48]	SHEN L, YEUNG S, HOFFMAN J, et al. Scaling human-object interaction recognition through zero-shot learning[C]//2018 IEEE Winter Conference on Applications of Computer Vision. [S.l.]: IEEE, 2018: 1568-1576.
[49]	EUM S, KWON H. Semantics to space (S2S): embedding semantics into spatial space for zero-shot verb-object query inferencing[DB/OL]. (2019-06-13)[2022-02-22]. https://arxiv.org/abs/1906.05894
[50]	RAHMAN S, KHAN S, PORIKLI F. Zero-shot object detection: learning to simultaneously recognize and localize novel concepts[DB/OL]. (2018-03-16)[2021-02-22]. https://arxiv.org/abs/1803.06049
[51]	PEYRE J, LAPTEV I, SCHMID C, et al. Detecting unseen visual relations using analogies[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. [S.l.]: IEEE, 2019: 1981-1990.
[52]	ALESSANDRO P, SCHMID C, FERRARI V. Weakly supervised learning of interactions between humans and objects[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 34(3): 601-614.
[53]	PEYRE J, LAPTEV I, SCHMID C, et al. Weakly-supervised learning of visual relations[DB/OL]. (2017-07-29)[2021-02-22]. https://arxiv.org/abs/1707.09472.
[54]	SARULLO A, MU T T. Zero-shot human-object interaction recognition via affordance graphs[DB/OL]. (2020-09-02)[2021-02-22]. https://arxiv.org/abs/2009. 01039.
[55]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[DB/OL]. (2017-06-12)[2022-02-26]. https://doi.org/10.48550/arXiv.1706.03762
[56]	KIM B, LEE J, KANG J, et al. HOTR: end-to-end human-object interaction detection with transfor- mers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2021: 74-83.
[57]	TAMURA M, OHASHI H, YOSHINAGA T. QPIC: query-based pairwise human-object interaction detection with image-wide contextual information[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2021: 10410-10419.

Relative Articles

[1]	HUA Zexi, SHI Huibin, LUO Yan, ZHANG Ziyuan, LI Weilong, TANG Yongchuan. Detection and Recognition of Digital Instruments Based on Lightweight YOLO-v4 Model at Substations[J]. Journal of Southwest Jiaotong University, 2024, 59(1): 70-80. doi: 10.3969/j.issn.0258-2724.20210544
[2]	PAN Lei, GUO Yushi, LI Hengchao, WANG Weiye, LI Zechen, MA Tianyu. SAR Image Generation Method via PCGAN for Ship Detection[J]. Journal of Southwest Jiaotong University, 2024, 59(3): 547-555. doi: 10.3969/j.issn.0258-2724.20210630
[3]	ZHU Jun, ZHANG Tianyi, XIE Yakun, ZHANG Jie, LI Chuangnong, ZHAO Li, LI Weilian. Intelligent Statistic Method for Video Pedestrian Flow Considering Small Object Features[J]. Journal of Southwest Jiaotong University, 2022, 57(4): 705-712, 736. doi: 10.3969/j.issn.0258-2724.20200425
[4]	GUO Lie, GE Pingshu, WANG Xiao, WANG Dongxing. Visual Simultaneous Localization and Mapping Algorithm Based on Convolutional Neural Network to Optimize Loop Detection[J]. Journal of Southwest Jiaotong University, 2021, 56(4): 706-712, 768. doi: 10.3969/j.issn.0258-2724.20190723
[5]	LI Zechen, LI Hengchao, HU Wenshuai, YANG Jinyu, HUA Zexi. Masked Face Detection Model Based on Multi-scale Attention-Driven Faster R-CNN[J]. Journal of Southwest Jiaotong University, 2021, 56(5): 1002-1010. doi: 10.3969/j.issn.0258-2724.20210017
[6]	FAN Hong, HOU Yun, LI Bailin, XIONG Ying. Adaptive Detection Algorithm for High-Speed Railway Fasteners by Vision[J]. Journal of Southwest Jiaotong University, 2020, 55(4): 896-902. doi: 10.3969/j.issn.0258-2724.20180496
[7]	WANG Huifeng, ZHANG Jiajia, ZHAO Xiangmo, WEI Feiting, WANG Guiping. Lane Line Detection and Recognition by Polarisation Imaging[J]. Journal of Southwest Jiaotong University, 2019, 54(2): 415-420. doi: 10.3969/j.issn.0258-2724.20160412
[8]	LI Ying, LI Bo, GAO Xinbo. Efficient Compression Algorithm for Improving Visual Quality ofWeak Targets in High Precision Images[J]. Journal of Southwest Jiaotong University, 2019, 54(5): 1012-1020. doi: 10.3969/j.issn.0258-2724.20180180
[9]	PAN Yi, WANG Xiaoyue, XU Hu, XIE Dan. Seismic Fragility Analysis of Nepalese Brick-Timber Heritage Structures under Near-Fault Pulse-Like Ground Motions[J]. Journal of Southwest Jiaotong University, 2017, 30(6): 1156-1163. doi: 10.3969/j.issn.0258-2724.2017.06.016
[10]	WU Shengchuan, YU Cheng, ZHANG Weihua, XUE Biyi. Simulation of Interactions between Pores and Cracks inside Fusion Welded Aluminum Alloys[J]. Journal of Southwest Jiaotong University, 2014, 27(5): 855-861. doi: 10.3969/j.issn.0258-2724.2014.05.018
[11]	LI Renxian, GUAN Yongjiu, ZHAO Jing, ZHAO Jiwei. Aerodynamic Analysis on Entrance Hood of High Speed railway Tunnels[J]. Journal of Southwest Jiaotong University, 2012, 25(2): 175-180. doi: 10.3969/j.issn.0258-2724.2012.02.001
[12]	Guo Lie, GAO Long, ZHAO Zongyan. PedestrianDetectionandTrackingBasedonAutomotiveVision[J]. Journal of Southwest Jiaotong University, 2012, 25(1): 19-25. doi: 10.3969/j.issn.0258-2724.2012.021.01.004
[13]	HUANG Jin, JIN Weidong, QIN Na. Moving Objects Detection Algorithm Based on Three-Dimensional Gaussian Mixture Codebook Model[J]. Journal of Southwest Jiaotong University, 2012, 25(4): 662-668. doi: 10.3969/j.issn.0258-2724.2012.04.020
[14]	ZHAO Jing, LI Renxian. Numerical Analysis of Aerodynamics of High-Speed Trains Running into Tunnels[J]. Journal of Southwest Jiaotong University, 2009, 22(1): 96-100.
[15]	RONG Jian, SHEN Jine, ZHONG Xiaochun. New Method for Infrared Dim Target Detection Based on Wavelet and SVR[J]. Journal of Southwest Jiaotong University, 2008, 21(5): 555-560.
[16]	LUTao. OperationalEquation and Setting ofDifferentialProtection for Impedance-M atching Balance Transformer[J]. Journal of Southwest Jiaotong University, 2005, 18(2): 158-162.
[17]	PENG Qiang, JIANGHao. Vision Subsystem and Identification Algorithm for M iroSotLarge Field Soccer-Robot System[J]. Journal of Southwest Jiaotong University, 2005, 18(2): 168-172.
[18]	JINJian, DU Wen. Reliability Evaluation of Sight for Trainman[J]. Journal of Southwest Jiaotong University, 2000, 13(5): 543-545.
[19]	Luo Gang, Chen Chunjun, Li Zhi. Contradictory Relations between the Objects in Multi-Objective Optimization Problem[J]. Journal of Southwest Jiaotong University, 1999, 12(5): 471-475.
[20]	LUO 　Gang, CHEN Chun-Dun, LI 　Chi. Contradictory Relations between the Objects in Multi-Objective Optimization Problem[J]. Journal of Southwest Jiaotong University, 1999, 12(5): 471-475.

Supplements(0)

Cited By

Cited by

Periodical cited type(4)

1.	曾文献，李岳松. 面向人体姿态图像关键点检测的深度学习算法. 计算机仿真. 2024(05): 209-213+219 .
2.	李宽，龚勋，樊剑锋. 结合时空距离的多网络互学习行人重识别. 中国图象图形学报. 2023(05): 1409-1421 .
3.	张润江，郭杰龙，俞辉，兰海，王希豪，魏宪. 面向多姿态点云目标的在线类增量学习. 液晶与显示. 2023(11): 1542-1553 .
4.	刘沅畅，钱秋林，钟淼. 深度卷积神经网络在网络哑资源管理上的应用. 通信与信息技术. 2022(S1): 81-84 .

Other cited types(12)

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(9) / Tables(4)

Get Citation

PDF

XML

Article views(1380) PDF downloads(265)