Author: Kawashti, Yomna Ahmed Ibrahim Shehata./ Title: Self-Supervised Representation Learning<br>for Visual Scene Understanding /

Search In this Thesis

العنوان

Self-Supervised Representation Learning
for Visual Scene Understanding /

المؤلف

Kawashti, Yomna Ahmed Ibrahim Shehata.

هيئة الاعداد

باحث / يمنى أحمد ابراهيم شحاتة قوشتي

مشرف / مصطفى محمود عارف

مشرف / دينا رضا خطاب

تاريخ النشر

2023.

عدد الصفحات

130 P. :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

Computer Science (miscellaneous)

تاريخ الإجازة

1/1/2023

مكان الإجازة

جامعة عين شمس - كلية الحاسبات والمعلومات - قسم علوم الحاسب

الفهرس

Only 14 pages are availabe for public view

from

130

from

130

Abstract

Great advancements have been achieved in the field of computer vision in the last decade. These advances are mainly attributed to the rapid evolution of deep learning research. Despite these milestones, there is a significant drawback to the deep learning framework, which is, its reliance on hand-crafted labels to achieve competitive results. This reliance was referred to as the bottleneck of deep learning. That is, any given model, no matter how efficient, is only as powerful as the data available to train it. Self-supervised learning is a recently active area of research in computer vision that aims to solve this annotation problem. The self-supervised framework consists of a proxy task and a downstream task. The proxy task (usually referred to as the self-supervised task) is not the target, but rather a manufactured task with the purpose of teaching the deep learning backbone how to generate meaningful feature representations from a large unlabeled dataset. The term self-supervision refers to the fact that the supervision comes from “pseudo” labels which are automatically derived from the data itself. This knowledge is then transferred to a downstream task which is the target task.
In this thesis, a comprehensive overview of the self-supervised learning domain is provided with a focus on the evolution of the proxy tasks from standard self-supervised learning to contrastive learning and masked image modeling. Then, our main proposed framework based on mix-based augmentations was discussed. Additionally, among our preliminary experiments, two preliminary studies were presented to investigate the effects of changing several factors in the self-supervised framework such as the pretraining dataset type, size and backbone architecture. All our experiments were evaluated in the downstream task using multilabel classification since it requires a higher level of scene understanding than single label classification tasks.
In our first preliminary study, we utilized the solution of a jigsaw puzzle as the pretext task while experimenting with various CNN backbone architectures. Features were extracted from different convolutional blocks throughout each architecture and used to evaluate the downstream performance using the PascalVOC dataset on the multilabel classification task. The second block of AlexNet achieved the highest transfer performance of (36.17%) mean average precision. This showed a rank reversal with similar comparative studies wherein Res50 achieved the highest downstream performance. Additionally, features extracted from earlier layers in the network had higher transferability than later layers. We attributed this behavior to the smaller size of our pretraining dataset (Tiny ImageNet) compared to the usually used ImageNet. Through our first study, it was proven that features extracted from shallower architectures and earlier layers had the highest transferability when working with downscaled data size.
In our second preliminary study, we studied the effect of extending features learnt from single versus multi-object centric datasets to complex downstream tasks. Our experiments prove that the self-supervised model trained using multi-object centric datasets such as COCO has higher transferability than the one trained using single-object centric datasets such as ImageNet100. Both pretrained models were evaluated using an autonomous driving dataset (CityScapes) in addition to PascalVOC in the task of multilabel classification. The COCO pretrained model achieved (" ~ "+1%) gain in performance in both downstream cases.
Moreover, we proposed a self-supervised framework that uses mix-based augmentations to enrich the nature of images observed by the CNN backbone during the self-supervised task. Our hypothesis implies that the use of mix-based augmentations could improve downstream performance for multi-object centric tasks such as multilabel classification. The results of both preliminary studies guided some of our design choices in the main proposed framework such as the CNN architecture used (i.e., a shallower architecture). Our experiments using UnMix-based modified SimSiam achieved a performance gain of (+3.7%) mean average precision. The proposed model performed well with datasets that are different in nature (single-object centric as well as multi-object centric). Other findings in our experiments implied that replacing the original images with mix-based augmented images leads to performance degradation. In contrast, utilizing UnMix’s strategy of adding the mixed images as auxiliaries to the original ones and not as a replacement, achieved the highest results.