Deep Visual Domain Adaptation: - From Synthetic Data to the Real World
Abstract
In the field of computer vision, the increasing use of convolutional neural networks (CNN) fuels the need for more and more labeled training data. Synthetic data generated from computer graphics represent an alternative approach for fast acquisition of training data. However, synthetic data suffers from dataset bias, making models trained on synthetic data underperform "in-the-wild". In this master thesis, a survey comparing state-of-the-art domain adaptation techniques for CNNs in visual applications was conducted, accompanied by a brief look at how computer graphics can aid CNNs when real-world data is scarce. The survey concludes that many techniques are available for classification architectures, but the same principles used in classification can be used to extend architectures for other visual applications. To further add to this research, several classical domain adaptation techniques consisting of different types of fine-tuning was attempted on the CNN architecture \textit{Mask-RCNN}. The task was to predict salmon masks/silhouettes on photographs from real fish farms, but by pre-training on synthetic images of salmon in a virtual fish farming environment as a requisite. The synthetic pre-trained model managed 55.8 mAP(\%) on synthetic images, but only 9.4 mAP(\%) on real images, showing a dataset bias was present. To adapt to the real world, the pre-trained model was given only 19 fine-tuning real-world examples, making this a few-shot domain adaptation problem. The different fine-tuning techniques attempted was: \textit{regular fine-tuning} and \textit{gradually opening up layers for fine-tuning from a frozen state, starting from the deepest layers}. For both techniques it was further attempted to extend the small-real-word dataset by data augmentation. Real-world performance increased to 27.5\% mAP after regular fine-tuning, 28.5 mAP(\%) after gradually opening up layers in fine-tuning, and 41.9 mAP(\%) and 36.5 mAP(\%) mAP respectively with data augmentation. The regular fine-tuned model did 56.8 mAP(\%) on synthetic images, showing that domain invariant features was learned. We argue that this must be due to a close overlap of distributions over computer graphics - and real-world images in the CNN solution-space. Furthermore, results also show that data augmentation can be used as supplement for extra performance on real-world images, but a dataset bias towards the real-world might become apparent. As for gradually opening up layers, results suggests that this can help keep cross-domain performance, but overtraining on each stage can reduce the performance in-the-wild. However, more research is needed to support this claim. Lastly, an unsupervised DA technique was attempted using a GAN trained for style-transfer to synthesize hybrid images (with labels taken from source). This failed and was likely due to the GAN not being designed for instance segmentation. This master thesis concludes that synthetic data used in a CNN will underperform compared to real-world data, but domain adaptation techniques can boost performance considerably such that synthetic data is as a good alternative to manual labeling training data.