The influence of the scene on monocular 3D pose estimation through a CNN
This project is already completed.

Introduction
Spatial information about the user is used to register the user in virtual reality (VR), augmented reality (AR) or mixed reality (MR)[3]. In order to register a user in virtual space one or more sensor have to determine the user position relative to itself. The resulting position can be used to rendered a scene from the users perspective, which is utilized by head-mounted displays (HMD), CAVES and Virtual Mirrors.
For multi-modal interaction scenarios, the registration of the users position and pose is crucial in order to be able to design the most direct interaction techniques possible. For example a pointing ray or cone can be calculated using the head and hand position [6] and be used to select object in virtual space. In addition to the derivation of deictic gestures a complete and accurate motion capturing can enable the recognition of iconic or symbolic gestures [19]. Thus a wide range of interaction possibilities is available through gestures. Human Pose Estimation is the localization of human joints or a particular pose. The pose consists of position and orientation. The Pose is also used in other areas of human-computer interaction (HCI). The proximic interaction [4][8] uses the information to change a system state depending on the users proximity. The context-aware interaction [18] uses location information to adjust the system status to the users context, for example, when the user leaves or joins a group of people. Natural user interaction [23] shall allow the user to interact directly with a system, a possible form is multi-modal interaction (MMI). All this areas strive for interaction without instrumentation of the user, to create an interaction scenario that is as natural as possible.
There are different methods for motion capturing [24]. Non-visual motion capturing can be realized with acoustic, magnetic or inertial sensors. Ultrasonic based systems like WearTrack [7] or the Cricket location system [16] determine positions based on the time-of-flight of audio signals from an emitter. In contrast, magnetic sensors can determine their position and orientation within a global coordinate system using information from the local magnetic environment [10]. Such sensors are line of sight independent but prone to ferromagnetic interferences. Inertial sensors like accelerometers or gyroscopes can also be used to capture the position of a user. These systems are, however, subject to a drift in position data over time. All these non-visual systems have in common that they require an instrumentation of the user.
Visual motion capturing approaches are using either RGB-D- or RGB-sensors and can be marker-based or makerless [15]. In order to use a marker-based technique the user has to wear a marker-suit, the markers are attached directly to the user or the markers are built into hardware. The latter is used in commercial HMDs. Markerless approaches using depth sensors (RGB-D) like the Microsoft Kinect have the advantage that there is no need to attach markers to the user before using an application. A problem concerning infrared sensors is the inference with sunlight that can lead to an decrease in accuracy. Also without instrumentation of the user but less prone to sunlight are markerless methods using RGB-sensors [15]. Since users do not need to be instrumented and RGB-sensors are available nearly everywhere a markerless approach using RGB-sensor should be used to determine the user position relative to the sensor. Upcoming approaches use a convolutional neural network (CNN) [12] to predict the users position and pose. The CNN is trained with a set of RGB-images annotated with the position of each joint.
Problem
Machine learning methods require large amounts of data in order to train complex problems, since small amounts of data and many features can lead to high variance and thus to an overfitting of the problem [5]. A solution to obtain large amounts of data for this problem is to generate synthetic training data [17]. Varol et al. [21] have shown that CNNs trained with synthetic generated training data can achieve plausible results and are able to predict depth. They use the SMPL body-model [13] to generate avatars and insert these into images. This approach does deliver training data, containing realistic backgrounds with synthetic looking avatars. However these avatar are just randomly inserted into the image without reference to the environment (see Figure 2). This results in unrealistic images where avatars are floating, standing on walls or of unrealistic sizes. Additionally the shadows inserted afterwards are not always compatible with the lightning in the scenes.

Goal
Within the scope of this work it should be investigated, whether synthetic training data, in which avatars have reference to the environment, deliver more accurate depth prediction since the CNN can learn reference points. For this purpose different scenes are to be modeled and avatars [1][22] to be generated. The scenes, avatars and lightning should be as photo-realistic as possible. Unlike previous work, our avatars are scans of real persons and not an average model of several people. Using the scenes and avatars training and test material is selected to train a state of the art CNN, like a ResNet [9][20] which is a very deep and performant network compared to other CNNs. Subsequently the resulting network is to be benchmarked against common datasets, like Human3.6m [11] and compared with reported results.
The following objectives are to be achieved:
- Realistic scenes and avatars are to be modeled.
- Training material is to be generated from the avatars and scenes.
- A CNN is to be trained using the training material.
- The accuracy shall be compared with common test material.

Approach
Instead of using SMPL to get a variation of avatars, around 80 test persons are to be scanned using a system described by [22]. Since it is the goal to be as photo-realistic as possible high quality Unity Assets and Tech-Demonstrations are used to model the scene.

In order to produce realistic training material the Unity Engine is used to place and animate Avatars with MoCap animations in a 3D scene. During a scripted playback, screenshots are taken from different perspectives and the corresponding positions (as Vector) and orientations (as quaternion) of joints are saved as comma-separated values (CSV).
The images and annotations are used to train a ResNet on dedicated hardware. Therefore the python scripts of Tobias Schmee’s preliminary work must be adapted to predict multiple joints.
Most of the time the quality of CNNs is compared with the mean per joint position error (MPJPE) [25][14]. The per joint position error is the euclidean distance between groundtruth and prediction for a joint. Therefore the MPJPE is the mean of per joint position error for all joints. This quality indicator allows a comparison with systems reported in literature. To make a statement about whether our trainings material leads to better results the MPJPE for the same CNN with a training set like Human3.6m has to be calculated.
REFERENCES
[1] Jascha Achenbach, Thomas Waltemate, Marc Erich Latoschik, and Mario Botsch.
- Fast Generation of Realistic Virtual Humans. In 23rd ACM Symposium on Virtual Reality Software and Technology (VRST). 12:1–12:10.
[2] ArchVizPRO. 2018. ArchVizPRO Interior Vol.6. https://assetstore.unity.com/packages/3d/environments/urban/archvizpro-interior-vol-6-120489
[3] Ronald T Azuma. 1997. A survey of augmented reality. Presence: Teleoperators & Virtual Environments 6, 4 (1997), 355–385.
[4] Till Ballendat, Nicolai Marquardt, and Saul Greenberg. 2010. Proxemic inter- action: designing for a proximity and orientation-aware environment. In ACM International Conference on Interactive Tabletops and Surfaces. ACM, 121–130.
[5] Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning (Infor- mation Science and Statistics). Springer-Verlag, Berlin, Heidelberg.
[6] Doug Bowman, Ernst Kruijff, Joseph J LaViola Jr, and Ivan P Poupyrev. 2004. 3D User interfaces: theory and practice, CourseSmart eTextbook. Addison-Wesley.
[7] E. Foxlin and M. Harrington. 2000. WearTrack: a self-referenced head and hand tracker for wearable computers and portable VR. In Digest of Papers. Fourth International Symposium on Wearable Computers. 155–162. https://doi.org/10.1109/ISWC.2000.888482
[8] Saul Greenberg, Nicolai Marquardt, Till Ballendat, Rob Diaz-Marino, and Miaosen Wang. 2011. Proxemic interactions: the new ubicomp? interactions 18, 1 (2011), 42–50.
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
[10] J. Hightower and G. Borriello. 2001. Location systems for ubiquitous computing.s Computer 34, 8 (Aug 2001), 57–66. https://doi.org/10.1109/2.940014
[11] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2014. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 7 (jul 2014), 1325–1339.
[12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifica- tion with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.
[13] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. 2015. SMPL: A skinned multi-person linear model. ACM Transactions on Graphics (TOG) 34, 6 (2015), 248.
[14] Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. 2017. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 2017 International Conference on 3D Vision (3DV). IEEE, 506–516.
[15] Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Mo- hammad Shafiei, Hans-Peter Seidel, Weipeng Xu, Dan Casas, and Christian Theobalt. 2017. VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera. ACM Transactions on Graphics 36, 4, 14. https://doi.org/10.1145/3072959. 3073596
[16] Thomas B Moeslund, Adrian Hilton, and Volker Krüger. 2006. A survey of advances in vision-based human motion capture and analysis. Computer vision and image understanding 104, 2-3 (2006), 90–126.
[17] Nissanka B. Priyantha, Anit Chakraborty, and Hari Balakrishnan. 2000. The Cricket Location-support System. In Proceedings of the 6th Annual International Conference on Mobile Computing and Networking (MobiCom ’00). ACM, New York, NY, USA, 32–43. https://doi.org/10.1145/345910.345917
[18] Grégory Rogez and Cordelia Schmid. 2016. Mocap-guided data augmentation for 3d pose estimation in the wild. In Advances in Neural Information Processing Systems. 3108–3116.
[19] Bill Schilit, Norman Adams, and Roy Want. 1994. Context-aware computing applications. In Mobile Computing Systems and Applications, 1994. Proceedings., Workshop on. IEEE, 85–90.
[20] Matthew Turk. 2014. Multimodal interaction: A review. Pattern Recognition Letters 36 (2014), 189–195.
[21] ujjwalkarn.me. 2016. An Intuitive Explanation of Convolutional Neural Networks. https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/
[22] Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J Black, Ivan Laptev, and Cordelia Schmid. 2017. Learning from synthetic humans. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017). IEEE, 4627–4635.
[23] Thomas Waltemate, Dominik Gall, Daniel Roth, Mario Botsch, and Marc Erich Latoschik. 2018. The Impact of Avatar Personalization and Immersion on Virtual Body Ownership, Presence, and Emotional Response. IEEE Transactions on Visualization and Computer Graphics 24, 4 (2018), 1643–1652.
[24] Daniel Wigdor and Dennis Wixon. 2011. Brave NUI world: designing natural user interfaces for touch and gesture. Elsevier.
[25] Huiyu Zhou and Huosheng Hu. 2008. Human motion tracking for rehabilita- tion—A survey. Biomedical Signal Processing and Control 3, 1 (2008), 1–18.
[26] Xingyi Zhou, Qixing Huang, Xiao Sun, Xiangyang Xue, and Yichen Wei. 2017. Towards 3d human pose estimation in the wild: a weakly-supervised approach. In Proceedings of the IEEE International Conference on Computer Vision. 398–407
Contact Persons at the University Würzburg
Dr. Martin Fischbach (Primary Contact Person)Mensch-Computer-Interaktion, Universität Würzburg
martin.fischbach@uni-wuerzburg.de