Learning-Based Stereoscopic View Synthesis with Cascaded Deep Neural Networks

Wei Liu; Liyan Ma; Mingyue Cui

doi:10.20965/jaciii.2022.p0393

single-jc.php

« previous

JACIII Vol.26 No.3 pp. 393-406

doi: 10.20965/jaciii.2022.p0393

(2022)

Paper:

Views over last 60 days: 698

Learning-Based Stereoscopic View Synthesis with Cascaded Deep Neural Networks

Wei Liu^*,†, Liyan Ma^**, and Mingyue Cui^*

^*College of Electromechanic Engineering, Nanyang Normal University
No.1638 Wolong Road, Wolong District, Nanyang, Henan 473061, China

^**College of Computer Engineering and Science, Shanghai University
No.99 Shangda Road, Baoshan District, Shanghai 200444, China

^†Corresponding author

Received:

October 25, 2021

Accepted:

March 8, 2022

Published:

May 20, 2022

Keywords:

DIBR, deep neural networks, hole filling, view synthesis

Abstract

Depth image-based rendering (DIBR) is an important technique in the 2D to 3D conversion process, which renders virtual views with a texture image and the associated depth map. However, certain problems, such as disocclusion, still exist in current DIBR systems. In this study, a new learning-based framework that models conventional DIBR synthesis pipelines is proposed to solve these problems. The proposed model adopts a coarse-to-fine approach to realize virtual view prediction and disocclusion region refinement sequentially in a unified deep learning framework that includes two cascaded joint filter block-based convolutional neural networks (CNNs) and one residual learning-based generative adversarial network (GAN). An edge-guided global looping optimization strategy is adopted to progressively reconstruct the scene structures on the novel view, and a novel directional discounted reconstruction loss is proposed for better training. In this way, our framework performs well in terms of virtual view quality and is more suitable for 2D to 3D conversion applications. The experimental results demonstrate that the proposed method can generate visually satisfactory results.

Cite this article as:

W. Liu, L. Ma, and M. Cui, “Learning-Based Stereoscopic View Synthesis with Cascaded Deep Neural Networks,” J. Adv. Comput. Intell. Intell. Inform., Vol.26 No.3, pp. 393-406, 2022.

Data files:

References

[1] X. Chen, H. Liang, H. Xu, S. Ren, H. Cai, and Y. Wang, “Virtual view synthesis based on asymmetric bidirectional DIBR for 3D video and free viewpoint video,” Applied Sciences, Vol.10, No.5, 1562, 2020.
[2] L.-H. Wang, J. Zhang, S.-J. Yao, D.-X. Li, and M. Zhang, “GPU based implementation of 3DTV system,” 2011 Sixth Int. Conf. on Image and Graphics, pp. 847-851, 2011.
[3] H. Liang, X. Chen, H. Xu, S. Ren, H. Cai, and Y. Wang, “Local Foreground Removal Disocclusion Filling Method for View Synthesis,” IEEE Access, Vol.8, pp. 201286-201299, 2020.
[4] S. Zhu, H. Xu, and L. Yan, “An improved depth image based virtual view synthesis method for interactive 3D video,” IEEE Access, Vol.7, pp. 115171-115180, 2019.
[5] L.-H. Wang, X.-J. Huang, M. Xi, D.-X. Li, and M. Zhang, “An asymmetric edge adaptive filter for depth generation and hole filling in 3DTV,” IEEE Trans. on Broadcasting, Vol.56, No.3, pp. 425-431, 2010.
[6] C.-W. Liu, S.-E. Li, J.-L. Syu, H.-T. Li, W.-H. Cheng, C.-H. Hsia, and J.-S. Chiang, “DIBR with content-adaptive filtering for 3D view,” 2014 IEEE Int. Conf. on Consumer Electronics-Taiwan, pp. 245-246, 2014.
[7] W. Liu, L. Ma, B. Qiu, M. Cui, and J. Ding, “An efficient depth map preprocessing method based on structure-aided domain transform smoothing for 3D view generation,” PloS one, Vol.12, No.4, e0175910, 2017.
[8] C.-C. Kao, “Stereoscopic image generation with depth image based rendering,” Multimedia Tools and Applications, Vol.76, No.11, pp. 12981-12999, 2017.
[9] D. Han, H. Chen, C. Tu, and Y. Xu, “View synthesis using foreground object extraction for disparity control and image inpainting,” J. of Visual Communication and Image Representation, Vol.56, pp. 287-295, 2018.
[10] A. Q. d. Oliveira, M. Walter, and C. R. Jung, “An artifact-type aware DIBR method for view synthesis,” IEEE Signal Processing Letters, Vol.25, No.11, pp. 1705-1709, 2018.
[11] A. Atapour-Abarghouei and T. P. Breckon, “A comparative review of plausible hole filling strategies in the context of scene depth image completion,” Computers & Graphics, Vol.72, pp. 39-58, 2018.
[12] W. Liu, D. Zhang, M. Cui, and J. Ding, “An enhanced depth map based rendering method with directional depth filter and image inpainting,” The Visual Computer, Vol.32, No.5, pp. 579-589, 2016.
[13] H. C. Burger, C. J. Schuler, and S. Harmeling, “Image denoising: Can plain neural networks compete with BM3D?,” 2012 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 2392-2399, 2012.
[14] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” European Conf. on Computer Vision, pp. 184-199, 2014.
[15] J. Sun, W. Cao, Z. Xu, and J. Ponce, “Learning a convolutional neural network for non-uniform motion blur removal,” Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 769-777, 2015.
[16] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” 2016 Fourth Int. Conf. on 3D Vision (3DV), pp. 239-248, 2016.
[17] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” Advances in Neural Information Processing Systems, pp. 2366-2374, 2014.
[18] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 270-279, 2017.
[19] D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe, “Multi-scale continuous crfs as sequential deep networks for monocular depth estimation,” Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 5354-5362, 2017.
[20] J. Xie, R. Girshick, and A. Farhadi, “Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks,” European Conf. on Computer Vision, pp. 842-857, 2016.
[21] J. Flynn, I. Neulander, J. Philbin, and N. Snavely, “Deepstereo: Learning to predict new views from the world’s imagery,” Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 5515-5524, 2016.
[22] J. Lee, H. Jung, Y. Kim, and K. Sohn, “Automatic 2d-to-3d conversion using multi-scale deep neural network,” 2017 IEEE Int. Conf. on Image Processing (ICIP), pp. 730-734, 2017.
[23] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial transformer networks,” Advances in Neural Information Processing Systems, pp. 2017-2025, 2015.
[24] L. Wei, W. Yihong, and H. Zhanyi, “A Survey of 2D to 3D Conversion Technology for Film,” J. of Computer-Aided Design & Computer Graphics, Vol.24, No.1, pp. 14-28, 2012.
[25] H.-t. Lim, H. G. Kim, and Y. M. Ro, “Learning based hole filling method using deep convolutional neural network for view synthesis,” Electronic Imaging, Vol.2016, No.14, pp. 1-5, 2016.
[26] C. Li, X. Sang, D. Chen, and D. Zhang, “Innovative hole-filling method for depth-image-based rendering (DIBR) based on context learning,” Optoelectronic Imaging and Multimedia Technology V, Vol.10817, 1081706, Int. Society for Optics and Photonics, 2018.
[27] N. K. Kalantari, T.-C. Wang, and R. Ramamoorthi, “Learning-based view synthesis for light field cameras,” ACM Trans. on Graphics (TOG), Vol.35, No.6, pp. 1-10, 2016.
[28] Y. Li, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep joint image filtering,” European Conf. on Computer Vision, pp. 154-169, 2016.
[29] S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Globally and locally consistent image completion,” ACM Trans. on Graphics (ToG), Vol.36, No.4, pp. 1-14, 2017.
[30] K. Nazeri, E. Ng, T. Joseph, F. Z. Qureshi, and M. Ebrahimi, “Edgeconnect: Generative image inpainting with adversarial edge learning,” arXiv preprint arXiv:1901.00212, 2019.
[31] L.-C. Chen, J. T. Barron, G. Papandreou, K. Murphy, and A. L. Yuille, “Semantic image segmentation with task-specific edge detection using cnns and a discriminatively trained domain transform,” Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 4545-4554, 2016.
[32] L. Zhang and W. J. Tam, “Stereoscopic image generation based on depth images for 3D TV,” IEEE Trans. on Broadcasting, Vol.51, No.2, pp. 191-199, 2005.
[33] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman, “PatchMatch: A randomized correspondence algorithm for structural image editing,” ACM Trans. Graph., Vol.28, No.3, 24, 2009.
[34] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Free-form image inpainting with gated convolution,” Proc. of the IEEE/CVF Int. Conf. on Computer Vision, pp. 4471-4480, 2019.
[35] Y. Chen, L. Shi, Q. Feng, J. Yang, H. Shu, L. Luo, J.-L. Coatrieux, and W. Chen, “Artifact suppressed dictionary learning for low-dose CT image processing,” IEEE Trans. on Medical Imaging, Vol.33, No.12, pp. 2271-2292, 2014.

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.

[1] [1] X. Chen, H. Liang, H. Xu, S. Ren, H. Cai, and Y. Wang, “Virtual view synthesis based on asymmetric bidirectional DIBR for 3D video and free viewpoint video,” Applied Sciences, Vol.10, No.5, 1562, 2020.

[2] [2] L.-H. Wang, J. Zhang, S.-J. Yao, D.-X. Li, and M. Zhang, “GPU based implementation of 3DTV system,” 2011 Sixth Int. Conf. on Image and Graphics, pp. 847-851, 2011.

[3] [3] H. Liang, X. Chen, H. Xu, S. Ren, H. Cai, and Y. Wang, “Local Foreground Removal Disocclusion Filling Method for View Synthesis,” IEEE Access, Vol.8, pp. 201286-201299, 2020.

[4] [4] S. Zhu, H. Xu, and L. Yan, “An improved depth image based virtual view synthesis method for interactive 3D video,” IEEE Access, Vol.7, pp. 115171-115180, 2019.

[5] [5] L.-H. Wang, X.-J. Huang, M. Xi, D.-X. Li, and M. Zhang, “An asymmetric edge adaptive filter for depth generation and hole filling in 3DTV,” IEEE Trans. on Broadcasting, Vol.56, No.3, pp. 425-431, 2010.

[6] [6] C.-W. Liu, S.-E. Li, J.-L. Syu, H.-T. Li, W.-H. Cheng, C.-H. Hsia, and J.-S. Chiang, “DIBR with content-adaptive filtering for 3D view,” 2014 IEEE Int. Conf. on Consumer Electronics-Taiwan, pp. 245-246, 2014.

[7] [7] W. Liu, L. Ma, B. Qiu, M. Cui, and J. Ding, “An efficient depth map preprocessing method based on structure-aided domain transform smoothing for 3D view generation,” PloS one, Vol.12, No.4, e0175910, 2017.

[8] [8] C.-C. Kao, “Stereoscopic image generation with depth image based rendering,” Multimedia Tools and Applications, Vol.76, No.11, pp. 12981-12999, 2017.

[9] [9] D. Han, H. Chen, C. Tu, and Y. Xu, “View synthesis using foreground object extraction for disparity control and image inpainting,” J. of Visual Communication and Image Representation, Vol.56, pp. 287-295, 2018.

[10] [10] A. Q. d. Oliveira, M. Walter, and C. R. Jung, “An artifact-type aware DIBR method for view synthesis,” IEEE Signal Processing Letters, Vol.25, No.11, pp. 1705-1709, 2018.

[11] [11] A. Atapour-Abarghouei and T. P. Breckon, “A comparative review of plausible hole filling strategies in the context of scene depth image completion,” Computers & Graphics, Vol.72, pp. 39-58, 2018.

[12] [12] W. Liu, D. Zhang, M. Cui, and J. Ding, “An enhanced depth map based rendering method with directional depth filter and image inpainting,” The Visual Computer, Vol.32, No.5, pp. 579-589, 2016.

[13] [13] H. C. Burger, C. J. Schuler, and S. Harmeling, “Image denoising: Can plain neural networks compete with BM3D?,” 2012 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 2392-2399, 2012.

[14] [14] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” European Conf. on Computer Vision, pp. 184-199, 2014.

[15] [15] J. Sun, W. Cao, Z. Xu, and J. Ponce, “Learning a convolutional neural network for non-uniform motion blur removal,” Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 769-777, 2015.

[16] [16] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” 2016 Fourth Int. Conf. on 3D Vision (3DV), pp. 239-248, 2016.

[17] [17] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” Advances in Neural Information Processing Systems, pp. 2366-2374, 2014.

[18] [18] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 270-279, 2017.

[19] [19] D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe, “Multi-scale continuous crfs as sequential deep networks for monocular depth estimation,” Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 5354-5362, 2017.

[20] [20] J. Xie, R. Girshick, and A. Farhadi, “Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks,” European Conf. on Computer Vision, pp. 842-857, 2016.

[21] [21] J. Flynn, I. Neulander, J. Philbin, and N. Snavely, “Deepstereo: Learning to predict new views from the world’s imagery,” Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 5515-5524, 2016.

[22] [22] J. Lee, H. Jung, Y. Kim, and K. Sohn, “Automatic 2d-to-3d conversion using multi-scale deep neural network,” 2017 IEEE Int. Conf. on Image Processing (ICIP), pp. 730-734, 2017.

[23] [23] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial transformer networks,” Advances in Neural Information Processing Systems, pp. 2017-2025, 2015.

[24] [24] L. Wei, W. Yihong, and H. Zhanyi, “A Survey of 2D to 3D Conversion Technology for Film,” J. of Computer-Aided Design & Computer Graphics, Vol.24, No.1, pp. 14-28, 2012.

[25] [25] H.-t. Lim, H. G. Kim, and Y. M. Ro, “Learning based hole filling method using deep convolutional neural network for view synthesis,” Electronic Imaging, Vol.2016, No.14, pp. 1-5, 2016.

[26] [26] C. Li, X. Sang, D. Chen, and D. Zhang, “Innovative hole-filling method for depth-image-based rendering (DIBR) based on context learning,” Optoelectronic Imaging and Multimedia Technology V, Vol.10817, 1081706, Int. Society for Optics and Photonics, 2018.

[27] [27] N. K. Kalantari, T.-C. Wang, and R. Ramamoorthi, “Learning-based view synthesis for light field cameras,” ACM Trans. on Graphics (TOG), Vol.35, No.6, pp. 1-10, 2016.

[28] [28] Y. Li, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep joint image filtering,” European Conf. on Computer Vision, pp. 154-169, 2016.

[29] [29] S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Globally and locally consistent image completion,” ACM Trans. on Graphics (ToG), Vol.36, No.4, pp. 1-14, 2017.

[30] [30] K. Nazeri, E. Ng, T. Joseph, F. Z. Qureshi, and M. Ebrahimi, “Edgeconnect: Generative image inpainting with adversarial edge learning,” arXiv preprint arXiv:1901.00212, 2019.

[31] [31] L.-C. Chen, J. T. Barron, G. Papandreou, K. Murphy, and A. L. Yuille, “Semantic image segmentation with task-specific edge detection using cnns and a discriminatively trained domain transform,” Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 4545-4554, 2016.

[32] [32] L. Zhang and W. J. Tam, “Stereoscopic image generation based on depth images for 3D TV,” IEEE Trans. on Broadcasting, Vol.51, No.2, pp. 191-199, 2005.

[33] [33] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman, “PatchMatch: A randomized correspondence algorithm for structural image editing,” ACM Trans. Graph., Vol.28, No.3, 24, 2009.

[34] [34] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Free-form image inpainting with gated convolution,” Proc. of the IEEE/CVF Int. Conf. on Computer Vision, pp. 4471-4480, 2019.

[35] [35] Y. Chen, L. Shi, Q. Feng, J. Yang, H. Shu, L. Luo, J.-L. Coatrieux, and W. Chen, “Artifact suppressed dictionary learning for low-dose CT image processing,” IEEE Trans. on Medical Imaging, Vol.33, No.12, pp. 2271-2292, 2014.

Learning-Based Stereoscopic View Synthesis with Cascaded Deep Neural Networks

Wei Liu*,†, Liyan Ma**, and Mingyue Cui*

Wei Liu^*,†, Liyan Ma^**, and Mingyue Cui^*