Optical Flow for Real-Time Human Detection and Action Recognition Based on CNN Classifiers

Satoshi Hoshino; Kyohei Niimura

doi:10.20965/jaciii.2019.p0735

single-jc.php

« previous

JACIII Vol.23 No.4 pp. 735-742

doi: 10.20965/jaciii.2019.p0735

(2019)

Paper:

Views over last 60 days: 2,127

Optical Flow for Real-Time Human Detection and Action Recognition Based on CNN Classifiers

Satoshi Hoshino and Kyohei Niimura

Department of Mechanical and Intelligent Engineering, Graduate School of Engineering, Utsunomiya University
7-1-2 Yoto, Utsunomiya, Tochigi 321-8585, Japan

Received:

October 5, 2018

Accepted:

February 19, 2019

Published:

July 20, 2019

Keywords:

robot vision, generic object recognition, real-time image processing, CNN, optical flow

Abstract

Mobile robots equipped with camera sensors are required to perceive surrounding humans and their actions for safe and autonomous navigation. In this work, moving humans are the target objects. For robot vision, real-time performance is an important requirement. Therefore, we propose a robot vision system in which the original images captured by a camera sensor are described by optical flow. These images are then used as inputs to a classifier. For classifying images into human and not-human classifications, and the actions, we use a convolutional neural network (CNN), rather than coding invariant features. Moreover, we present a local search window as a novel detector for clipping partial images around target objects in an original image. Through the experiments, we ultimately show that the robot vision system is able to detect moving humans and recognize action in real time.

Real-time human detection and action recognition for multiple persons

Cite this article as:

S. Hoshino and K. Niimura, “Optical Flow for Real-Time Human Detection and Action Recognition Based on CNN Classifiers,” J. Adv. Comput. Intell. Intell. Inform., Vol.23 No.4, pp. 735-742, 2019.

Data files:

References

[1] T. Ojala et al., “Performance Evaluation of Texture Measures with Classification based on Kullback Discrimination of Distributions,” Proc. of 12th Int. Conf. on Pattern Recognition, Vol.1, pp. 582-585, 1994.
[2] D. G. Lowe, “Object Recognition from Local Scale-Invariant Features,” Proc. of 7th Int. Conf. on Computer Vision, pp. 1150-1157, 1999.
[3] G. Csurka et al., “Visual Categorization with Bags of Keypoints,” Int. Workshop on Statistical Learning in Computer Vision, pp. 59-74, 2004.
[4] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” 2005 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, Vol.1, pp. 886-893, 2005.
[5] P. Dollar et al., “Behavior Recognition via Sparse Spatio-temporal Features,” 2005 IEEE Int. Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65-72, 2005.
[6] R. Girshick, “Fast R-CNN,” 2015 IEEE Int. Conf. on Computer Vision, pp. 1440-1448, 2015.
[7] S. Ren et al., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.39, Issue 6, pp. 1137-1149, 2016.
[8] J. Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection,” 2016 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 779-788, 2016.
[9] W. Liu et al., “SSD: Single Shot MultiBox Detector,” European Conf. on Computer Vision, pp. 21-37, 2016.
[10] T.-Y. Lin et al., “Focal Loss for Dense Object Detection,” 2017 IEEE Int. Conf. on Computer Vision (ICCV), pp. 2999-3007, 2017.
[11] Y. LeCun et al., “Gradient-Based Learning Applied to Document Recognition,” Proc. of the IEEE, Vol.86, Issue 11, pp. 2278-2324, 1998.
[12] M. Baccouche et al., “Sequential Deep Learning for Human Action Recognition,” Int. Workshop on Human Behavior Understanding, pp. 29-39, 2011.
[13] K. Simonyan and A. Zisserman, “Two-Stream Convolutional Networks for Action Recognition in Videos,” Proc. of the 27th Int. Conf. on Neural Information Processing Systems, pp. 568-576, 2014.
[14] C. Feichtenhofer et al., “Spatiotemporal Residual Networks for Video Action Recognition,” 30th Int. Conf. on Neural Information Processing Systems, pp. 3476-3484, 2016.
[15] L. Wang et al., “Temporal Segment Networks: Towards Good Practices for Deep Action Recognition,” European Conf. on Computer Vision, pp. 20-36, 2016.
[16] D. Schuurmans, “Greedy Importance Sampling,” Proc. of the 12th Int. Conf. on Neural Information Processing Systems, pp. 596-602, 1999.
[17] K. E. A. van de Sande et al., “Segmentation as Selective Search for Object Recognition,” 2011 IEEE Int. Conf. on Computer Vision, pp. 1879-1886, 2011.
[18] J. R. R. Uijlings et al., “Selective Search for Object Recognition,” Int. J. of Computer Vision, Vol.104, pp. 154-171, 2013.
[19] K. Okuma et al., “A Boosted Particle Filter: Multitarget Detection and Tracking,” European Conf. on Computer Vision, pp. 28-39, 2004.
[20] G. Farnebäck, “Two-frame Motion Estimation based on Polynomial Expansion,” Scandinavian Conf. on Image Analysis, pp. 363-370, 2003.
[21] A. Fathi and G. Mori, “Action recognition by learning mid-level motion features,” 2008 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[22] M. Jain et al., “Better Exploiting Motion for Better Action Recognition,” 2013 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 2555-2562, 2013.
[23] Y. LeCun et al., “Deep learning,” Nature, Vol.521, Issue 7553, pp. 436-444, 2015.
[24] D. P. Kingma and J. L. Ba, “Adam: A Method for Stochastic Optimization,” Int. Conf. for Learning Representations, 2015.
[25] N. Srivastava et al., “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” J. of Machine Learning Research, Vol.15, Issue 1, pp. 1929-1958, 2014.
[26] N. Otsu, “A Threshold Selection Method from Gray-Level Histograms,” IEEE Trans. on Systems, Man, and Cybernetics, Vol.9, Issue 1, pp. 62-66, 1979.
[27] F. Goudail et al., “Bhattacharyya Distance as a Contrast Parameter for Statistical Processing of Noisy Optical Images,” J. of the Optical Society of America A, Vol.21, Issue 7, pp. 1231-1240, 2004.
[28] X. Lu, “A Simple, yet Effective and Efficient, Sliding Window Sampling Algorithm,” Int. Conf. on Database Systems for Advanced Applications, pp. 337-351, 2010.
[29] D. Comaniciu et al., “Mean Shift: A Robust Approach Toward Feature Space Analysis,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.24, Issue 5, pp. 603-619, 2002.

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.

[1] [1] T. Ojala et al., “Performance Evaluation of Texture Measures with Classification based on Kullback Discrimination of Distributions,” Proc. of 12th Int. Conf. on Pattern Recognition, Vol.1, pp. 582-585, 1994.

[2] [2] D. G. Lowe, “Object Recognition from Local Scale-Invariant Features,” Proc. of 7th Int. Conf. on Computer Vision, pp. 1150-1157, 1999.

[3] [3] G. Csurka et al., “Visual Categorization with Bags of Keypoints,” Int. Workshop on Statistical Learning in Computer Vision, pp. 59-74, 2004.

[4] [4] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” 2005 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, Vol.1, pp. 886-893, 2005.

[5] [5] P. Dollar et al., “Behavior Recognition via Sparse Spatio-temporal Features,” 2005 IEEE Int. Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65-72, 2005.

[6] [6] R. Girshick, “Fast R-CNN,” 2015 IEEE Int. Conf. on Computer Vision, pp. 1440-1448, 2015.

[7] [7] S. Ren et al., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.39, Issue 6, pp. 1137-1149, 2016.

[8] [8] J. Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection,” 2016 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 779-788, 2016.

[9] [9] W. Liu et al., “SSD: Single Shot MultiBox Detector,” European Conf. on Computer Vision, pp. 21-37, 2016.

[10] [10] T.-Y. Lin et al., “Focal Loss for Dense Object Detection,” 2017 IEEE Int. Conf. on Computer Vision (ICCV), pp. 2999-3007, 2017.

[11] [11] Y. LeCun et al., “Gradient-Based Learning Applied to Document Recognition,” Proc. of the IEEE, Vol.86, Issue 11, pp. 2278-2324, 1998.

[12] [12] M. Baccouche et al., “Sequential Deep Learning for Human Action Recognition,” Int. Workshop on Human Behavior Understanding, pp. 29-39, 2011.

[13] [13] K. Simonyan and A. Zisserman, “Two-Stream Convolutional Networks for Action Recognition in Videos,” Proc. of the 27th Int. Conf. on Neural Information Processing Systems, pp. 568-576, 2014.

[14] [14] C. Feichtenhofer et al., “Spatiotemporal Residual Networks for Video Action Recognition,” 30th Int. Conf. on Neural Information Processing Systems, pp. 3476-3484, 2016.

[15] [15] L. Wang et al., “Temporal Segment Networks: Towards Good Practices for Deep Action Recognition,” European Conf. on Computer Vision, pp. 20-36, 2016.

[16] [16] D. Schuurmans, “Greedy Importance Sampling,” Proc. of the 12th Int. Conf. on Neural Information Processing Systems, pp. 596-602, 1999.

[17] [17] K. E. A. van de Sande et al., “Segmentation as Selective Search for Object Recognition,” 2011 IEEE Int. Conf. on Computer Vision, pp. 1879-1886, 2011.

[18] [18] J. R. R. Uijlings et al., “Selective Search for Object Recognition,” Int. J. of Computer Vision, Vol.104, pp. 154-171, 2013.

[19] [19] K. Okuma et al., “A Boosted Particle Filter: Multitarget Detection and Tracking,” European Conf. on Computer Vision, pp. 28-39, 2004.

[20] [20] G. Farnebäck, “Two-frame Motion Estimation based on Polynomial Expansion,” Scandinavian Conf. on Image Analysis, pp. 363-370, 2003.

[21] [21] A. Fathi and G. Mori, “Action recognition by learning mid-level motion features,” 2008 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1-8, 2008.

[22] [22] M. Jain et al., “Better Exploiting Motion for Better Action Recognition,” 2013 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 2555-2562, 2013.

[23] [23] Y. LeCun et al., “Deep learning,” Nature, Vol.521, Issue 7553, pp. 436-444, 2015.

[24] [24] D. P. Kingma and J. L. Ba, “Adam: A Method for Stochastic Optimization,” Int. Conf. for Learning Representations, 2015.

[25] [25] N. Srivastava et al., “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” J. of Machine Learning Research, Vol.15, Issue 1, pp. 1929-1958, 2014.

[26] [26] N. Otsu, “A Threshold Selection Method from Gray-Level Histograms,” IEEE Trans. on Systems, Man, and Cybernetics, Vol.9, Issue 1, pp. 62-66, 1979.

[27] [27] F. Goudail et al., “Bhattacharyya Distance as a Contrast Parameter for Statistical Processing of Noisy Optical Images,” J. of the Optical Society of America A, Vol.21, Issue 7, pp. 1231-1240, 2004.

[28] [28] X. Lu, “A Simple, yet Effective and Efficient, Sliding Window Sampling Algorithm,” Int. Conf. on Database Systems for Advanced Applications, pp. 337-351, 2010.

[29] [29] D. Comaniciu et al., “Mean Shift: A Robust Approach Toward Feature Space Analysis,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.24, Issue 5, pp. 603-619, 2002.