Sameni, Sepehr (2024). Efficient Self-Supervised Visual Representation Learning via Sparsity. (Thesis). Universität Bern, Bern
|
Text
24sameni_s.pdf - Thesis Available under License Creative Commons: Attribution-Noncommercial (CC-BY-NC 4.0). Download (19MB) | Preview |
Abstract
Large collections of labeled data have greatly improved the performance of Deep Neural Networks in computer vision tasks. However, the vast majority of visual data generated daily remains unlabeled, limiting the potential of supervised learning paradigms. This thesis explores novel techniques to guide deep models towards learning generalizable visual patterns without human supervision, with a particular focus on leveraging sparsity as a key principle. Our primary tool in this endeavor is the design of Self-Supervised Learning (SSL) tasks that do not require manual labeling. Beyond enabling learning from vast amounts of unlabeled data, we demonstrate how sparsity-based self-supervision can capture relevant patterns often overlooked by traditional supervised approaches. We design learning tasks that extract rich representations from various visual modalities: shape information from images, temporal dynamics from videos, and multimodal understanding from vision-language data. A common thread running through our work is the strategic application of sparsity. In contrastive learning, we show how token sparsity can enhance both computational efficiency and representation quality. For video analysis, we leverage spatio-temporal sparsity to enable efficient and scalable representation learning. In generative tasks, we demonstrate how sparse conditioning can tackle complex problems like video prediction while implicitly modeling world dynamics. Notably, our task designs follow a unifying principle: the recognition and manipulation of sparse patterns in data. The strong performance of the learned representations on downstream vision tasks such as image classification, video understanding, and multimodal reasoning validates this approach. By consistently demonstrating that thoughtful application of sparsity can not only reduce computational demands but often improve the quality and generalizability of learned representations, this work lays a foundation for more efficient, scalable, and effective visual understanding systems. Our contributions pave the way for artificial systems with visual perception and reasoning capabilities that can better leverage the vast amounts of unlabeled visual data surrounding us.
Item Type: | Thesis |
---|---|
Dissertation Type: | Cumulative |
Date of Defense: | 6 September 2024 |
Subjects: | 000 Computer science, knowledge & systems 500 Science > 510 Mathematics 600 Technology > 620 Engineering |
Institute / Center: | 08 Faculty of Science > Institute of Computer Science (INF) |
Depositing User: | Hammer Igor |
Date Deposited: | 22 Oct 2024 14:02 |
Last Modified: | 22 Oct 2024 14:02 |
URI: | https://boristheses.unibe.ch/id/eprint/5523 |
Actions (login required)
View Item |