Markerless Visual Head-Tracking System
This project is already completed.

Motivation
Human body pose, position, and motion estimation, or so-called motion capturing, is a vast field of research and a crucial technique for various applications in human-computer interaction. Many different types of applications along the Reality-Virtuality continuum (Milgram, Takemura, Utsumi, & Kishino) rely on rendering of virtual content from the perspective of the user and thus require the pose and the position of the user’s head. For example, CAVEs (Cruz-Neira, Sandin, & DeFanti, 1993), virtual mirrors (Latoschik, Lugrin, & Roth, 2016), Head-Mounted Display (HMD) setups, seetrough FishtankVR (Ware, Arthur, & Booth, 1993), and autostereoscopic displays (Dodgson, 2005. Moreover, certain interaction paradigms implicitly require information of the user’s position and posture. For instance, natural user interfaces (Widgor & Wixon, 2011), proximic interaction (Ballendat, Marquardt, & Greenberg, 2010; Greenberg, Marquardt, Ballendat, Diaz-Marino, & Wang, 2011), and context-aware interaction (Schilit, Adams, & Want, 1994).
In the last decades, many different approaches to capture a humans motion have developed and evaluated (Moeslund, Hilton, & Krüger, 2006; Sigal, Balan, & Black, 2010). However, these kinds of motion capture systems often require marker attached to the body of the tracked subject. Many applications would benefit from a markerless solution to estimate the humans’ pose and position. Therefore, also different markerless solutions have been developed in the past years (Elhayek et al., 2017; Mehta et al., 2017; Sigal et al., 2010; Stoll, Hasler, Gall, Seidel, & Theobalt, 2011), but these solutions are often expensive, inaccurate, have a high latency, or are not publicly available yet. RGB-D camera-based approaches (Shotton et al., 2013) produce good results but potentially interfere with other infrared based technics used in the tracking space or suffer from distance restrictions of RGB-D cameras.
Goal
In this project we aim to design a markerless, machine learning and RGB camera-based outside-in tracking system to estimate the position of a user’s head within a tracking space. Our neural network will be based on a combination of Convolutional Neural Networks (Krizhevsky, Sutskever & Hinton, 2012) layer to extract relevant features and LSTM (Hochreiter & Schmidhuber, 1997) cells to fuse the data of multiple images. To get a huge amount of training data for our neural network, we will use a self-created rendering pipeline which is able to export images of an avatar in a virtual tracking space and the corresponding ground truth similar to Liu, Liang, Wang, Li, & Pei (2016). The rendering pipeline will be able to provide training data with a huge variety by modifying the appearance and the avatar and the environment. We will explore different possible image pre-processing techniques to improve the quality of the training data and the tracking.
The expected outcome of this project is a standalone proof of concept head tracking prototype that is able to take the input from at least two unsynchronized RBG cameras to predict the approximate position of a person within a tracking space. In addition, we expect to identify recommendations for evaluation and further improvements.
Related Work
Pose estimation
Predicting the pose of human body parts with neural networks is quite a common approach in recent studies. Particularly interesting for our project is which kind of images are taken to train the network. Some approaches make use of depth images (RGB-D) (Gupta, Arbeláez, Girshick, & Malik, 2015; Shotton et al., 2013), others use standard RGB images (Ababsa, Tran, & Charbit, 2017; Liu, Liang, Wang, Li, & Pei, 2016; Toshev & Szegedy, 2014; Vatahska, Bennewitz, & Behnke, 2007). Eventually, they all achieve at least state of the art accuracy. Additionally, the work of Murphy-Chutorian and Trivedi (2009) gives a good overview of different approaches.
The approach of Liu et al. (2016) is similar to what we have planned for this project and is worth a closer look. In this work, they try to estimate the head pose of a human in front of an RGB camera. They regard this as a regression problem and train a Convolutional Neural Network (CNN) to detect head features that give insight about the rotation (pose) of the head. That means, the network’s output comprises three values, the pitch, the yaw, and the roll. The input of the network is a 96x96 pixel RGB image, that is passed through three convolutional layers and three max polling layers. With an error threshold of 10 degrees, they achieve an accuracy of 79.3%. Another interesting concept is presented by Su, Qi, Li, and Guibas (2015). They implement a viewpoint estimation, based on a CNN that is trained with 2D images. The orientation here is defined as a 3-tuple consisting of the azimuth angle, the elevation angle and the in-plane rotation angle. In this case, the network yields a classifier. That means the output angles are discretized into 360 bins.
Position estimation
The field of face recognition experienced great progress with the help of deep neural network approaches (Li, Lin, Shen, Brandt, & Hua, 2015; Parkhi, Vedaldi, & Zisserman, 2015; Sun, Liang, Wang, & Tang, 2015; Taigman, Yang, Ranzato, & Wolf, 2014). Still, most of the works either focus on making a decision if there exists a face on the image or yield a box, that surrounds the face. Moreover, there are interesting approaches, that estimate facial key points or landmarks on an image (Liang, Ding, & Lin, 2015; Sun, Wang, & Tang, 2013; Wu, Hassner, Kim, Medioni, & Natarajan, 2017). They all use a deep CNN to estimate the 2D coordinates of 5 facial landmarks (two for the eyes, one for the nose, two for the mouth) on a picture, which comes close to what we want to achieve. The most basic one is probably the work of Sun, Wang, & Tang (2013). They present a 3-level network cascade architecture, with different networks for different facial points. The first level is entrusted with high-level feature detection and coarse detection of the key points. The second and third level work more precisely. They are responsible for fine-tuning the position estimations of the levels above. With this CNN approach, they were able to significantly improve the prediction accuracy of state-of-the-art methods.
The mentioned works all have in common, that they only try to detect a position on an image, hence a 2-dimensional point. Luckily, there are other approaches, that involve the third spatial dimension.
One of the best known is the use of depth-cameras. Pixels in a depth camera also store a depth information, rather than just the mere color information. The Microsoft Kinect e.g. uses an infrared projector that emits a grid of infrared points. An infrared camera measures the offset between the dot pattern that was emitted and the actual pattern in the room (Zhang, 2012). For example, the information can be used to predict 3D positions of body joints (Shotton et al., 2011).
Furthermore, there are geometric approaches, that try to calculate depth from more than one image of the same scene. Mühlmann, Maier, Hesser, & Männer, (2002) search for corresponding pixels in two images. With the help of a calculated disparity map, the depth of the pixels can be determined. A newer approach is presented by Krutikova, Sisojevs, & Kovalovs,(2017). By finding similar points on stereo images, they can calculate a calibration for cameras. In the next step, they also calculate a disparity map, by using the previously found similarities. Obviously, both images need to be intersected. Finally, they can create a 3D-model of objects on the images. Worth mentioning is also the work of Hirschmuller (2008) which was implemented by OpenCV as the standard algorithm for stereo image matching. Tippetts, Lee, Lillywhite & Archibald (2013) give a good overview of various approaches to create a disparity map out of stereo images.
There are a number of publications, that try to deduce the depth of points in an image, only using a single RGB image as an input for a neural network (Eigen & Fergus, 2015; Eigen, Puhrsch, & Fergus, 2014; Laina, Rupprecht, Belagiannis, Tombari, & Navab, 2016; Wang et al., 2015). Laina et al. (2016) for example use a fully convolutional architecture, with an input size of 304 x 228 pixels and a depth map prediction of the size 160 x 128 as the output. With this approach, they need 55 ms to produce one depth map, which yields approximately 18 maps per second. Noteworthy is the approach of Godard, Mac Aodha, and Brostow (2017), as they go for an unsupervised approach, that has a great performance. Instead of training with labeled depth data, they make use of binocular stereo footage for training a CNN. They report a duration of 35 ms (28 fps) for one depth map.
Synthesis of training data
Many of the previously mentioned publications, already produce rendered data, in order to train their networks (Ababsa et al., 2017; Gupta et al., 2015; Liu et al., 2016; Su et al., 2015). They already show that training with synthetic data can yield a network, that can make valid predictions for natural input. Rozantsev, Lepetit, & Fua, (2015) propose some techniques, that aim to produce synthetic images, that are similar to real ones. E.g. they apply a Gaussian blurring along object boundaries, they add random noise to pixels or they induce a material variety, by changing the weight of diffuse reflection. Another interesting paper for data acquisition is from Varol et al., (2017). They describe the procedure of creating a training dataset with synthetically-generated images of people, rendered from 3D scenes. Therefore, they created a pipeline, that randomizes different render variables. These variables are for example the model, shape, pose, and texture of the humanoid avatar or the camera and the light of the scene.
Concept
As described before, the field of head pose estimation is already well researched. Therefore, within this project, we will focus on predicting the user’s position rather than head pose estimation. In the following, we will present our concept for a head-tracking application based on deep neural networks.
When using the power of deep learning approaches there are two main aspects that need to be addressed. First of all, an appropriate dataset has to be created for the task and second, a neural network architecture has to be found due to the problem. As the creation of an appropriate often becomes a time-consuming and sophisticated task, we will approach this problem with creating a synthetic dataset from 3D-rendered humans. The Neural Network will supsequently be trained on the synthetically generated images and their corresponding ground-truth. Afterwards we will test the trained network with real images taken from RGB-cameras. A visualization of our planned approach can be found in Figure 1.

Figure 1. Visualization of the planned pipeline for one network
Synthetic Training Data Generation
Generating an appropriate and considerable dataset for a deep learning task is often the most time-consuming aspect of these approaches. Especially within the domain of image processing, a large amount of intra-class variability exists in the data. Factors like lightings, misalignment, non-rigid deformation, occlusion, and corruption create this large variability within the data (Chan et al., 2015). Therefore creating a dataset of considerable size and variable data can be a challenging and time-consuming task. To tackle this issue we will automatically generate a dataset of synthetic images to train our neural network. 3D Game Engines like Unity or Unreal Engine can be used to create human avatars with different appearance and physiognomy. These engines not only allow us to create a variable set of different looking persons, other parameters like illumination, environments, etc. can also be changed easily within the engine. As the game engine also holds all relevant variables of position and head pose, it is easy to construct a variable, considerable and labeled dataset of different avatars in human poses as RGB-images.
To further increase the variability of the dataset and bridge the domains of synthetically generated images and real-world photographs, some additional preprocessing steps will be performed on the generated images. These preprocessing steps may include techniques like blurring or random noise and may be adopted from Rozantsev, Lepetit & Fua (2015).
Rendering Variations or Features
Many possible variations are possible and we need to decide which one should be considered. Considering the environment in which the data is produced, we vary a few render parameters. This is important, as the network should be able to make head position/orientation estimations in different setups. We want to cover the following settings:
- Indoor room: The images are taken in a generic room with no special furniture.
- Background: The background is rendered with different materials. It is important to have a certain variability here, as the network shall not find any background-dependent features.
- Lightning: In reality, light quality is always different. We want to consider this here, by manipulating the intensity of the ambient light. This way, the light that falls on the avatar is also changing.
- Shadows: Also shadows change, dependent on the room setup and the lightning. Therefore a ceiling light in our room chooses random positions, so the avatar casts different shadows throughout the pictures.
- Blurring: Synthetically generated images have sharper edges than real ones. Moreover, real images often have a certain amount of noise. In order to compensate for these points, we lay a gaussian blur over the images.
Since we aim to predict position (and orientation) of the avatars head, we obviously need scripts for varying the position of the avatar and the rotation of its head for different images. This way, the ground truth of our data will continuously change, which is vital for the training process.
Another aspect we want to cover is the variability a human, respectively the avatar can have. Therefore we want to change the look of the avatar throughout the image production. At least we want to have differences in the avatar’s biometric, face expression, age, gender, race, and style. To realize the avatar generation, we use the UMA tool of Unity’s asset store which provides us with an easy way to change the parameters of an avatar’s appearance.
Network Architecture
The second important part of deep learning applications will be finding an appropriate architecture for the problem. Deep learning has raised a lot of attention in the field of computer vision. Krizhevsky, Sutskever & Hinton (2012) set new standards in computer vision when they won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012 by using Deep Convolutional Neural Networks. This milestone in the field of computer vision has encouraged a lot of work about using a CNN for processing images in machine learning tasks with great success. CNNs have proven to be the best architectures for most computer vision tasks. Based on these formidable successes, we will construct a neural network architecture based on CNNs.
Estimation of depth is easy for human as we have two eyes. When closing one of our eyes, it is much more difficult to make a precise prediction of the distance of objects. We expect the same phenomenon to be present in an artificial neural network. Therefore we will train and test our network with two images taken at the same time. A Long Short-Term Memory (LSTM) is used in neural networks to process sequences as input (Hochreiter & Schmidhuber, 1997). LSTMs are designed to process spatial sequences of data and to remember and forget information from earlier timesteps. The input to our network will consist of two images taken at the same time. Therefore we will not have a temporal sequence of images, but they can be seen as a spatial sequence. And just like a spatial sequence the network will need to filter out relevant information from the first image and that is what LSTMs do. So, even though we do not have a temporal sequence of images we will use these LSTM cells to process our input sequence.
After training our neural networks for the tasks of head detection and head pose estimation, we will do an analysis on these tasks. Therefore a labeled dataset of images of people in different situations has to be created. Afterwards, this dataset will be split into two sets of equal size, a validation set, and a test set. The validation is used to tune the parameters within the neural network. For our final test, we will use the test set to estimate different parameters, like the error rate or accuracy of our networks. A visualization of the network architecture can be found in Figure 1.
Evaluation
After training the neural network, we want to know whether the network performs well and do an evaluation on the trained tasks. Evaluating Neural Networks is usually done in a two step process. In the first step we use an evaluation dataset to check the performance of the network. This dataset was never seen by the network during training. It consists of data the network was not trained on. So one can see the fit of the network on new data and what parameters may be adapted to get a better fit. In the second step, we use a test set that was also never seen by the network during training. This dataset is used to actually test the performance of the network with all its adaptations made after evaluation.
In the beginning it will be easy to follow this process as we can simply split the synthetically generated dataset in 3 parts. A big part for training and two smaller parts for evaluation and testing. Later in that process, it won’t be any longer interesting to evaluate the model on synthetic images. Therefore we will generate an own test dataset with real images to approximate an evaluation of the trained model. As creating a precise test set with synchronized camera data may go beyond the scope of this project, we will create a simplified dataset using not exactly synchronized camera data. The cameras will be positioned similar to the cameras used within the game engine. To still get an appropriate dataset the user visible on the images will be asked not to move for a second to still get pseudo-synchronized image pairs. The ground truth for this test dataset will be measured. Of course, this procedure cannot be considered exact, but it will give an impression whether the model is able to bridge the gap between synthetic and real data.
Approach
Requirements
- The head tracking system should use RGB images to determine the position of the head.
- The head tracking system should use single images from at least two cameras.
- The head tracking system should be based on a neural network.
- The head tracking application should be based on Python, OpenCV, PyQt and TensorFlow.
- The neural network should be based on a combination of CNN layer and LSTM cells.
- The head tracking system should be trained with synthetically generated data.
- The synthetically generated data should have a certain variability. (avatar und umgebung)
- Image processing methods (eg. blurring) should be tested to improve the predictions.
- The synthetic data should be generated using Unity.
- The system should be validated at least with synthetic data.
Tasks
- Literature research
- Creation of expose
- Familiarization with used techniques
- Definition of data exchange format
- Establishment of the basic environment for synthetic data creation
- Implementation of a method for the dynamic creation of avatars
- Implementation of a method to capture and export images from Unity
- Implementation of a method for exporting ground truth data from Unity
- Implementation of a method to randomly vary the avatars
- Implementation of a method to randomly vary the environment
- Establishment of the basic neural network
- Implementation of a method to import ground truth data into head tracking application
- Implementation of a method to import training images into head tracking application
- Implementation of batch data generator to feed neural network
- Training, validation and optimization of basic neural network
- Test of image (pre-)processing methods (eg. blurring, noise)
- Refinement synthetic data production pipeline
- Refinement of neural network architecture
- Validation of system accuracy with synthetic images
- Creation of a system to get (unsynchronized) camera images and predict
- Definition of delay between cameras
- Validation of the approximate system accuracy with (unsynchronized) camera images
- Expo preparation and presentation of the project
- Write up project results
Time Schedule
The project has a duration of 16 weeks, began on 11.10.2018 and ends on 14.02.2019. The exact chronological planning can be seen in Figure 2.

Figure 2. Time schedule of the project
Future Work
As this work has to be seen as a basic proof of concept, there are multiple limitations which suggest future work:
- Evaluation of the accuracy of the system by the use of real-world images and real-world ground truth.
- Data generation using photorealistic avatars.
- Motion capturing using sequences instead of images to take advantage of motion parallax.
- Synchronize the images from the used cameras to minimize the discrepancy between pictures.
- Add another neural network to estimate the pose of the head.
References
Ababsa, F., Tran, N.-T., & Charbit, M. (2017). Challenging 3D Head Tracking and Evaluation Using Unconstrained Test Data Set. Paper presented at the Information Visualisation (IV), 2017 21st International Conference.
Ballendat, T., Marquardt, N., & Greenberg, S. (2010). Proxemic interaction: designing for a proximity and orientation-aware environment. Paper presented at the ACM International Conference on Interactive Tabletops and Surfaces.
Chan, T.-H., Jia, K., Gao, S., Lu, J., Zeng, Z., & Ma, Y. (2015). PCANet: A simple deep learning baseline for image classification? *IEEE Transactions on Image Processing**, 24(12), 5017-5032.
Cruz-Neira, C., Sandin, D. J., & DeFanti, T. A. (1993). Surround-screen projection-based virtual reality: the design and implementation of the CAVE. Paper presented at the Proceedings of the 20th annual conference on Computer graphics and interactive techniques.
Dodgson, N. A. (2005). Autostereoscopic 3D displays. Computer(8), 31-36.
Eigen, D., & Fergus, R. (2015). Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. Paper presented at the Advances in neural information processing systems.
Elhayek, A., de Aguiar, E., Jain, A., Thompson, J., Pishchulin, L., Andriluka, M., . . . intelligence, m. (2017). MARCOnI—ConvNet-Based MARker-less motion capture in outdoor and indoor scenes. 39(3), 501-514.
Godard, C., Mac Aodha, O., & Brostow, G. J. (2017). Unsupervised monocular depth estimation with left-right consistency. Paper presented at the CVPR.
Greenberg, S., Marquardt, N., Ballendat, T., Diaz-Marino, R., & Wang, M. (2011). Proxemic interactions: the new ubicomp? *Interactions**, 18(1), 42-50.
Gupta, S., Arbeláez, P., Girshick, R., & Malik, J. (2015). Inferring 3d object pose in RGB-D images. arXiv.
Hirschmuller, H. (2008). Stereo processing by semiglobal matching and mutual information. IEEE Transactions on pattern analysis machine intelligence, 30(2), 328-341.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Paper presented at the Advances in neural information processing systems.
Krutikova, O., Sisojevs, A., & Kovalovs, M. (2017). Creation of a Depth Map from Stereo Images of Faces for 3D Model Reconstruction. Procedia Computer Science, 104, 452-459.
Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., & Navab, N. (2016). Deeper depth prediction with fully convolutional residual networks. Paper presented at the 3D Vision (3DV), 2016 Fourth International Conference on.
Latoschik, M. E., Lugrin, J.-L., & Roth, D. (2016). FakeMi: a fake mirror system for avatar embodiment studies. Paper presented at the Proceedings of the 22nd ACM Conference on Virtual Reality Software and Technology.
Li, H., Lin, Z., Shen, X., Brandt, J., & Hua, G. (2015). A convolutional neural network cascade for face detection. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Liang, Z., Ding, S., & Lin, L. (2015). Unconstrained facial landmark localization with backbone-branches fully-convolutional networks. arXiv.
Liu, X., Liang, W., Wang, Y., Li, S., & Pei, M. (2016). 3D head pose estimation with convolutional neural network trained on synthetic images. Paper presented at the Image Processing (ICIP), 2016 IEEE International Conference on.
Mehta, D., Sridhar, S., Sotnychenko, O., Rhodin, H., Shafiei, M., Seidel, H.-P., . . . Theobalt, C. J. A. T. o. G. (2017). Vnect: Real-time 3d human pose estimation with a single rgb camera. 36(4), 44.
Milgram, P., Takemura, H., Utsumi, A., & Kishino, F. (1995). Augmented reality: A class of displays on the reality-virtuality continuum. Paper presented at the Telemanipulator and telepresence technologies.
Moeslund, T. B., Hilton, A., & Krüger, V. (2006). A survey of advances in vision-based human motion capture and analysis. Computer vision image understanding, 104(2-3), 90-126.
Murphy-Chutorian, E., & Trivedi, M. M. (2009). Head pose estimation in computer vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4), 607-626.
Mühlmann, K., Maier, D., Hesser, J., & Männer, R. (2002). Calculating dense disparity maps from color stereo images, an efficient implementation. International Journal of Computer Vision, 47(1-3), 79-88.
Parkhi, O. M., Vedaldi, A., & Zisserman, A. (2015). Deep face recognition. Paper presented at the BMVC.
Rozantsev, A., Lepetit, V., & Fua, P. (2015). On rendering synthetic images for training an object detector. Computer vision image understanding, 137, 24-37.
Schilit, B., Adams, N., & Want, R. (1994). Context-aware computing applications. Paper presented at the Mobile Computing Systems and Applications, 1994. Proceedings., Workshop on.
Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., … & Blake, A. (2011, June). Real-time human pose recognition in parts from single depth images. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on (pp. 1297-1304). Ieee.
Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A., . . . Moore, R. (2013). Real-time human pose recognition in parts from single depth images. Communications of the ACM, 56(1), 116-124.
Sigal, L., Balan, A. O., & Black, M. J. (2010). Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International journal of computer vision, 87(1-2), 4.
Stoll, C., Hasler, N., Gall, J., Seidel, H.-P., & Theobalt, C. (2011). Fast articulated motion tracking using a sums of gaussians body model. Paper presented at the Computer Vision (ICCV), 2011 IEEE International Conference on.
Su, H., Qi, C. R., Li, Y., & Guibas, L. J. (2015). Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
Sun, Y., Liang, D., Wang, X., & Tang, X. (2015). Deepid3: Face recognition with very deep neural networks. arXiv.
Sun, Y., Wang, X., & Tang, X. (2013). Deep convolutional network cascade for facial point detection. Paper presented at the 2013 IEEE Conference on Computer Vision and Pattern Recognition.
Taigman, Y., Yang, M., Ranzato, M. A., & Wolf, L. (2014). Deepface: Closing the gap to human-level performance in face verification. Paper presented at the Proceedings of the IEEE conference on computer vision and pattern recognition.
Tippetts, B., Lee, D. J., Lillywhite, K., & Archibald, J. (2013). Review of stereo vision algorithms and their suitability for resource-limited systems. Journal of Real-Time Image Processing, 11(1), 5-25.
Toshev, A., & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. Paper presented at the Proceedings of the IEEE conference on computer vision and pattern recognition.
Turk, M. (2014). Multimodal interaction: A review. Pattern Recognition Letters, 36, 189-195.
Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M. J., Laptev, I., & Schmid, C. (2017). Learning from synthetic humans. Paper presented at the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017).
Vatahska, T., Bennewitz, M., & Behnke, S. (2007). Feature-based head pose estimation from images. Paper presented at the Humanoid Robots, 2007 7th IEEE-RAS International Conference on.
Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., & Yuille, A. L. (2015). Towards unified depth and semantic prediction from a single image. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Ware, C., Arthur, K., & Booth, K. S. (1993). Fish tank virtual reality. Paper presented at the Proceedings of the INTERACT’93 and CHI’93 conference on Human factors in computing systems.
Wigdor, D., & Wixon, D. (2011). Brave NUI world: designing natural user interfaces for touch and gesture. Elsevier.
Wu, Y., Hassner, T., Kim, K., Medioni, G., & Natarajan, P. (2017). Facial landmark detection with tweaked convolutional neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Zhang, Z. (2012). Microsoft kinect sensor and its effect. IEEE multimedia, 19(2), 4-10.
Contact Persons at the University Würzburg
Dr. Martin Fischbach (Primary Contact Person)Chair of Human-Computer Interaction, University of Würzburg
martin.fischbach@uni-wuerzburg.de