Less machine (=) More vision
|Author||: Amogh Gudi|
|Promotor(s)||: Prof. dr. ir. M.J.T. Reinders / Dr. J.C. van Gemert|
|University||: Delft University of Technology|
|Year of publication||: 2022|
|Link to repository||: TU Delft Research Repository|
Machines that interact with humans can do so better if they can also visually understand us, but they have limited resources to do so. The main topic of this dissertation is contrasting the use of resources by machine vision systems against the accuracy obtained by them. This thesis focuses on reducing the need for data, memory, and computation in real-world machine vision systems, applied to human observation and face analysis.
This dissertation tackles annotation effort by exploring how weakly-supervised object /person detectors can be improved. Findings show that prior knowledge about objects’ bounds in images helps the detector learn the spatial extent of objects using only weak image-level labels. The proposed implementation enables single-shot detection, thus improving computational efficiency of this data-efficient method.
The thesis also demonstrates how prior knowledge about eye locations can be used to reduce the computational burden of gaze tracking: non-vital parts of the input image can be discarded without losing accuracy. Additionally, the thesis finds how a priori known geometrical relations can be exploited to project gaze onto a screen with little human annotation effort.
Findings of this dissertation further suggest that spatial structures in images can be exploited for improving efficiency of vision tasks. The proposed solution allows for learning detection of facial occlusions and anomalies from only a few examples. Results also indicate that this solution can be used as a loss function for unsupervised pre-training of neural networks when resources are constrained.
Lastly, this thesis showcases how prior know-how about blood-flow physiology in faces can be applied in a camera-based vital signs estimator. Even when data is available, this hand-crafted method performs better than deep learning methods — both in terms of accuracy and efficiency. At the same time, the results also reveal the pitfalls of assumptions made in the prior knowledge when exposed to more complex tasks — such as video compression noise filtering.
Through its common theme of incorporating prior knowledge, this dissertation brings attention to the costs incurred by machine vision systems to achieve high accuracy