A performance comparison of CNNs trained on synthetically generated or recorded data for gaze estimation.

This project is already assigned.

Introduction

Eye gaze tracking is a useful practice in multiple fields of HCI besides research [1, 2] such as personal security [13] or gaming [12]. But where formerly researchers had to use many devices and dedicated hardware for gaze tracking [3] more modern approaches utilize Deep Learning and Convolutional Neural Networks (CNNs) [4] to reduce the necessary devices to a single RGB camera. According to Chen et al. [16] the reasons for the shift towards this approach are manifold. If we compare the formerly more common technique of eye gaze estimation by illuminating the eye region with infrared lights (IR) and building the corneal reflection on the corneal surface, the first advantage of the RGB camera-based approach would be, that that technique would work even in broad day light, were as the IR could be affected by the sunshine. Secondly, the relative position between the IR lights and the camera needs to be calculated carefully. Third, because the pupil and the glint are very small, usually a high-resolution camera is needed. So, IR based eye gaze estimation is limited to indoor applications and well-equipped workstations. On the contrary, single RGB camera-based gaze estimation promise to work even outdoors and on more common available hardware. To train CNNs to be capable to classify gazes, datasets of RGB images are needed. According to a review by Wang et al. [6] there exist some prerecorded datasets dedicated to gaze estimation: UT-Multiview, MPIIGaze or EYEDIAP. Note that this list is not complete. The UT-Multiview dataset was recorded using eight cameras at once focusing on one of the 50 participants holding still. The 64000 images of this set were than used in a random regression forest model to learn the mapping function between eye region and eye gaze [7]. The MPIIGaze dataset consists of 213659 images recorded on the laptops of the 15 participants. They were given visual cues on their screens, which they had to look at, and then had to confirm that they were looking, pressing spacebar [8]. The EYEDIAP dataset consists of videos of 16 participants executing different tasks, for example looking at a ball in front of a computer screen or at a target on the screen [9]. The downside of learning algorithms for complex tasks as eye gaze tracking is the need for a large dataset, since smaller amounts of data would lead to overfitting [5]. The above datasets have proven to be sufficient in size and in the related studies [7,8,9] algorithms were trained using them. But these datasets have shortcomings. For example, the UT-Multiview datasets consist of images of just the faces of the participants, but if our task is also to detect the face, an in-the-wild-approach [8], this dataset is not sufficient, because there is no necessity in detecting the face due to the fact, that the picture is filled by the face. Regarding the MPIIGaze dataset the precision is questionable. The participants looked at cues and pressed space bar, and we had no method for controlling the fixational eye movements that are natural for humans while fixating an object [17]. Additionally, the recruitment of participants took probably place in the sphere of influence of the related research institute, therefor we must deal with phenotypic similarity of those participants. If we want to train a neural network to be able to classify eye gazes of as many humans as possible, a dataset with an as diverse as possible pool of participants is needed. Our participants would have to vary in height, built, age, ethnicity, clothing and pose along other parameters. To find enough participants so our demand of depicting the maximum of visual variation in humans is met would be a resource intensive and lengthy process, therefor other methods of data generation, which are human-independent, should be preferred. Varol et al. [10] demonstrated, that CNNs trained with synthetic generated training data can achieve plausible results in the realm of pose estimation. Also, the synthetic data generator PeopleSansPeople, a Unity project using the Unity Perception package [14, 15], was used in its release related study to generate annotated RGB images and corresponding labels for its human assets. These images were used in the training of a Detectron2 Keypoint R-CNN variant [11]. The pre-training with the synthetic data resulted in the network outperforming another network pre-trained with the ImageNet database [11]. Consequently, neural network training can benefit from synthetic data. It is to be shown if we can transfer these achievements to eye gaze estimation.

Goal

In this work the data generator PeopleSansPeople will be adapted and used to create a dataset of images of the 3D human assets included in PeopleSansPeople for the training of a CNN, that should be capable of eye gaze estimation. This network will be compared with another CNN that was trained with an already existing dataset, to validate the assumption, that we can use synthetic data to get plausible results, just like in the work of Varol et al. [10]. The objectives are:

Adaptation of PeopleSansPeople to generate data feasible for eye gaze estimation
Data generation
Training of two networks with the same architecture but one will be trained with the generated data, the other with an already existing dataset
Comparison of the two networks using common test material
Creation of a report of this study

Milestones

Approach

The in-the-wild approach of Zhang et al. [8] will be reproduced because it is close to a real-world application scenario involving only one RGB camera. Therefor the related MPIIGaze dataset will be used to train the second network, that will compare the first network to.

REFERENCES

[1] Katsini, C., Abdrabou, Y., Raptis, G. E., Khamis, M., & Alt, F. (2020, April). The role of eye gaze in security and privacy applications: Survey and future HCI research directions. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (pp. 1-21). [2] Corcoran, P. M., Nanu, F., Petrescu, S., & Bigioi, P. (2012). Real-time eye gaze tracking for gaming design and consumer electronics systems. IEEE Transactions on Consumer Electronics, 58(2), 347-355. [3] Qi, Y., Wang, Z. L., & Huang, Y. (2007, November). A non-contact eye-gaze tracking system for human computer interaction. In 2007 International Conference on Wavelet Analysis and Pattern Recognition (Vol. 1, pp. 68-72). IEEE. [4] Wang, Z., Chai, J., & Xia, S. (2019). Realtime and accurate 3D eye gaze capture with DCNN-based iris and pupil segmentation. IEEE transactions on visualization and computer graphics, 27(1), 190-203. [5] Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg. [6] Wang, X., Zhang, J., Zhang, H., Zhao, S., & Liu, H. (2021). Vision-based gaze estimation: a review. IEEE Transactions on Cognitive and Developmental Systems. [7] Sugano, Y., Matsushita, Y., & Sato, Y. (2014). Learning-by-synthesis for appearance-based 3d gaze estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1821-1828). [8] Zhang, X., Sugano, Y., Fritz, M., & Bulling, A. (2015). Appearance-based gaze estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4511-4520). [9] Funes Mora, K. A., Monay, F., & Odobez, J. M. (2014, March). Eyediap: A database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras. In Proceedings of the symposium on eye tracking research and applications (pp. 255-258). [10] ujjwalkarn.me. 2016. An Intuitive Explanation of Convolutional Neural Networks. https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/ [11] Ebadi, S. E., Jhang, Y. C., Zook, A., Dhakad, S., Crespi, A., Parisi, P., … & Ganguly, S. (2021). PEOPLESANSPEOPLE: A Synthetic Data Generator for Human-Centric Computer Vision. arXiv preprint arXiv:2112.09290. [15] Borkman, S., Crespi, A., Dhakad, S., Ganguly, S., Hogins, J., Jhang, Y. C., … & Yadav, N. (2021). Unity perception: Generate synthetic data for computer vision. arXiv preprint arXiv:2107.04259. [16] Chen, J., & Ji, Q. (2008, December). 3D gaze estimation with a single camera without IR illumination. In 2008 19th International Conference on Pattern Recognition (pp. 1-4). IEEE. [17] Martinez-Conde, S. (2006). Fixational eye movements in normal and pathological vision. Progress in brain research, 154, 151-176.

WEBLINKS

[12] https://gaming.tobii.com/games/ [13] https://tech.tobii.com/products/aware/ [14] https://github.com/Unity-Technologies/com.unity.perception

Contact Persons at the University Würzburg

Marvin Thäns (Primary Contact Person)
Mensch-Computer-Interaktion, Universität Würzburg
marvin.thaens@uni-wuerzburg.de