Bachelor’s thesis
I wrote my bachelor’s thesis Evaluating multi-stream networks for self-supervised representation learning at Computer Vision & Remote Sensing Lab at TU Berlin.
A short formulation of the research question could sound something like this:
Does a neural network, which learns to compress videos in a self-supervised manner, improve when provided with extra visual modalities like depth or segmentation?
Abstract
This thesis tries to answer the question: Can visual representation-learning neural networks profit from adding privileged information, such as semantic segmentation, instance segmentation, optical flow, or depth estimation to the RGB frames? A video-based sequence-to-sequence model was implemented, which was trained to predict future video frames. A dataset generator, which uses the CARLA simulator, was created. With this dataset generator, a 10-hours-long, multi-modal video dataset was created. The dataset was used to train the proposed model in a self-supervised fashion. The results suggest that the models can indeed profit from some privileged information. However, further research is necessary, as this work shows that the results are highly dependent on the model's architecture. This work also shows that a weighted binary cross-entropy (WBCE) is a better loss function than mean squared error (MSE) for evaluating binary image similarities. Furthermore, Peak Signal Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) should not be used as metrics for this task.