Robot Vision System for Human Detection and Action Recognition

Satoshi Hoshino; Kyohei Niimura

doi:10.20965/jaciii.2020.p0346

single-jc.php

« previous

JACIII Vol.24 No.3 pp. 346-356

doi: 10.20965/jaciii.2020.p0346

(2020)

Paper:

Views over last 60 days: 1,642

Robot Vision System for Human Detection and Action Recognition

Satoshi Hoshino and Kyohei Niimura

Department of Mechanical and Intelligent Engineering, Graduate School of Engineering, Utsunomiya University
7-1-2 Yoto, Utsunomiya, Tochigi 321-8585, Japan

Received:

December 2, 2019

Accepted:

March 4, 2020

Published:

May 20, 2020

Keywords:

robot vision, generic object recognition, real-time image processing, CNN, optical flow

Abstract

Mobile robots equipped with camera sensors are required to perceive humans and their actions for safe autonomous navigation. For simultaneous human detection and action recognition, the real-time performance of the robot vision is an important issue. In this paper, we propose a robot vision system in which original images captured by a camera sensor are described by the optical flow. These images are then used as inputs for the human and action classifications. For the image inputs, two classifiers based on convolutional neural networks are developed. Moreover, we describe a novel detector (a local search window) for clipping partial images around the target human from the original image. Since the camera sensor moves together with the robot, the camera movement has an influence on the calculation of optical flow in the image, which we address by further modifying the optical flow for changes caused by the camera movement. Through the experiments, we show that the robot vision system can detect humans and recognize the action in real time. Furthermore, we show that a moving robot can achieve human detection and action recognition by modifying the optical flow.

Cite this article as:

S. Hoshino and K. Niimura, “Robot Vision System for Human Detection and Action Recognition,” J. Adv. Comput. Intell. Intell. Inform., Vol.24 No.3, pp. 346-356, 2020.

Data files:

References

[1] R. Girshick, “Fast R-CNN,” IEEE Int. Conf. on Computer Vision, pp. 1440-1448, 2015.
[2] S. Ren et al., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.39, Issue 6, pp. 1137-1149, 2016.
[3] J. Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection,” IEEE Conf. on Computer Vision and Pattern Recognition, pp. 779-788, 2016.
[4] W. Liu et al., “SSD: Single Shot MultiBox Detector,” European Conf. on Computer Vision, pp. 21-37, 2016.
[5] T.-Y. Lin et al., “Focal Loss for Dense Object Detection,” IEEE Trans. on Pattern Analysis and Machine Intelligence, pp. 2999-3007, 2017.
[6] Y. LeCun et al., “Gradient-Based Learning Applied to Document Recognition,” Proc. of the IEEE, Vol.86, Issue 11, pp. 2278-2324, 1998.
[7] M. Baccouche et al., “Sequential Deep Learning for Human Action Recognition,” Int. Workshop on Human Behavior Understanding, pp. 29-39, 2011.
[8] K. Simonyan and A. Zisserman, “Two-Stream Convolutional Networks for Action Recognition in Videos,” Advances in Neural Information Processing Systems, pp. 568-576, 2014.
[9] C. Feichtenhofer et al., “Spatiotemporal Residual Networks for Video Action Recognition,” Int. Conf. on Neural Information Processing Systems, pp. 3476-3484, 2016.
[10] L. Wang et al., “Temporal Segment Networks: Towards Good Practices for Deep Action Recognition,” European Conf. on Computer Vision, pp. 20-36, 2016.
[11] D. Schuurmans, “Greedy Importance Sampling,” Int. Conf. on Neural Information Processing Systems, pp. 596-602, 1999.
[12] K. E. A. van de Sande et al., “Segmentation as Selective Search for Object Recognition,” IEEE Int. Conf. on Computer Vision, pp. 1879-1886, 2011.
[13] J. R. R. Uijlings et al., “Selective Search for Object Recognition,” Int. J. of Computer Vision, Vol.104, pp. 154-171, 2013.
[14] K. Okuma et al., “A Boosted Particle Filter: Multitarget Detection and Tracking,” European Conf. on Computer Vision, pp. 28-39, 2004.
[15] G. Farnebäck, “Two-frame Motion Estimation based on Polynomial Expansion,” Scandinavian Conf. on Image Analysis, Lecture Notes in Computer Science, Vol.2749, pp. 363-370, 2003.
[16] A. Fathi and G. Mori, “Action recognition by learning mid-level motion features,” IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[17] M. Jain et al., “Better Exploiting Motion for Better Action Recognition,” IEEE Conf. on Computer Vision and Pattern Recognition, pp. 2555-2562, 2013.
[18] Y. LeCun et al., “Deep learning,” Nature, Vol.521, Issue 7553, pp. 436-444, 2015.
[19] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” Int. Conf. for Learning Representations, 2015.
[20] N. Srivastava et al., “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” J. of Machine Learning Research, Vol.15, Issue 1, pp. 1929-1958, 2014.
[21] N. Otsu, “A Threshold Selection Method from Gray-Level Histograms,” IEEE Trans. on Systems, Man, and Cybernetics, Vol.9, Issue 1, pp. 62-66, 1979.
[22] J. M. Odobez and P. Bouthemy, “Robust Multiresolution Estimation of Parametric Motion Models,” J. of Visual Communication and Image Representation, Vol.6, Issue 4, pp. 348-365, 1995.
[23] F. Goudail et al., “Bhattacharyya Distance as a Contrast Parameter for Statistical Processing of Noisy Optical Images,” J. of the Optical Society of America A, Vol.21, Issue 7, pp. 1231-1240, 2004.
[24] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” IEEE Conf. on Computer Vision and Pattern Recognition, Vol.1, pp. 886-893, 2005.
[25] X. Lu et al., “A Simple, yet Effective and Efficient, Sliding Window Sampling Algorithm,” Int. Conf. on Database Systems for Advanced Applications, Lecture Notes in Computer Science, Vol.5981, pp. 337-351, 2010.
[26] D. Comaniciu et al., “Mean Shift: A Robust Approach Toward Feature Space Analysis,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.24, Issue 5, pp. 603-619, 2002.

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.

[1] [1] R. Girshick, “Fast R-CNN,” IEEE Int. Conf. on Computer Vision, pp. 1440-1448, 2015.

[2] [2] S. Ren et al., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.39, Issue 6, pp. 1137-1149, 2016.

[3] [3] J. Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection,” IEEE Conf. on Computer Vision and Pattern Recognition, pp. 779-788, 2016.

[4] [4] W. Liu et al., “SSD: Single Shot MultiBox Detector,” European Conf. on Computer Vision, pp. 21-37, 2016.

[5] [5] T.-Y. Lin et al., “Focal Loss for Dense Object Detection,” IEEE Trans. on Pattern Analysis and Machine Intelligence, pp. 2999-3007, 2017.

[6] [6] Y. LeCun et al., “Gradient-Based Learning Applied to Document Recognition,” Proc. of the IEEE, Vol.86, Issue 11, pp. 2278-2324, 1998.

[7] [7] M. Baccouche et al., “Sequential Deep Learning for Human Action Recognition,” Int. Workshop on Human Behavior Understanding, pp. 29-39, 2011.

[8] [8] K. Simonyan and A. Zisserman, “Two-Stream Convolutional Networks for Action Recognition in Videos,” Advances in Neural Information Processing Systems, pp. 568-576, 2014.

[9] [9] C. Feichtenhofer et al., “Spatiotemporal Residual Networks for Video Action Recognition,” Int. Conf. on Neural Information Processing Systems, pp. 3476-3484, 2016.

[10] [10] L. Wang et al., “Temporal Segment Networks: Towards Good Practices for Deep Action Recognition,” European Conf. on Computer Vision, pp. 20-36, 2016.

[11] [11] D. Schuurmans, “Greedy Importance Sampling,” Int. Conf. on Neural Information Processing Systems, pp. 596-602, 1999.

[12] [12] K. E. A. van de Sande et al., “Segmentation as Selective Search for Object Recognition,” IEEE Int. Conf. on Computer Vision, pp. 1879-1886, 2011.

[13] [13] J. R. R. Uijlings et al., “Selective Search for Object Recognition,” Int. J. of Computer Vision, Vol.104, pp. 154-171, 2013.

[14] [14] K. Okuma et al., “A Boosted Particle Filter: Multitarget Detection and Tracking,” European Conf. on Computer Vision, pp. 28-39, 2004.

[15] [15] G. Farnebäck, “Two-frame Motion Estimation based on Polynomial Expansion,” Scandinavian Conf. on Image Analysis, Lecture Notes in Computer Science, Vol.2749, pp. 363-370, 2003.

[16] [16] A. Fathi and G. Mori, “Action recognition by learning mid-level motion features,” IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1-8, 2008.

[17] [17] M. Jain et al., “Better Exploiting Motion for Better Action Recognition,” IEEE Conf. on Computer Vision and Pattern Recognition, pp. 2555-2562, 2013.

[18] [18] Y. LeCun et al., “Deep learning,” Nature, Vol.521, Issue 7553, pp. 436-444, 2015.

[19] [19] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” Int. Conf. for Learning Representations, 2015.

[20] [20] N. Srivastava et al., “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” J. of Machine Learning Research, Vol.15, Issue 1, pp. 1929-1958, 2014.

[21] [21] N. Otsu, “A Threshold Selection Method from Gray-Level Histograms,” IEEE Trans. on Systems, Man, and Cybernetics, Vol.9, Issue 1, pp. 62-66, 1979.

[22] [22] J. M. Odobez and P. Bouthemy, “Robust Multiresolution Estimation of Parametric Motion Models,” J. of Visual Communication and Image Representation, Vol.6, Issue 4, pp. 348-365, 1995.

[23] [23] F. Goudail et al., “Bhattacharyya Distance as a Contrast Parameter for Statistical Processing of Noisy Optical Images,” J. of the Optical Society of America A, Vol.21, Issue 7, pp. 1231-1240, 2004.

[24] [24] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” IEEE Conf. on Computer Vision and Pattern Recognition, Vol.1, pp. 886-893, 2005.

[25] [25] X. Lu et al., “A Simple, yet Effective and Efficient, Sliding Window Sampling Algorithm,” Int. Conf. on Database Systems for Advanced Applications, Lecture Notes in Computer Science, Vol.5981, pp. 337-351, 2010.

[26] [26] D. Comaniciu et al., “Mean Shift: A Robust Approach Toward Feature Space Analysis,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.24, Issue 5, pp. 603-619, 2002.