Learning Object Interactions via Efficient and Controllable Generative Models

Davtyan, Aram (2024). Learning Object Interactions via Efficient and Controllable Generative Models. (Thesis). Universität Bern, Bern

Preview

Text
24davtyan_a.pdf - Thesis
Available under License Creative Commons: Attribution-Noncommercial (CC-BY-NC 4.0).
Download (163MB) | Preview

Abstract

Imagine a world where artificial agents can learn to understand and manipulate their environment simply by observing, much like a child exploring their surroundings. This thesis embarks on a journey towards this vision by addressing two fundamental challenges in machine learning: 1) developing efficient learning algorithms and 2) bulding controllable generative models that learn autonomously from visual data. We first introduce novel optimization techniques to improve the efficiency of neural network training and inference. KOALA, a Kalman filtering-inspired approach, enhances neural network optimization across various tasks. LOOM-CFM extends Conditional Flow Matching to improve the trade-off between sampling speed and quality in generative models. These innovations culminate in RIVER, a video prediction model that balances computational efficiency with high-quality and high-resolution output. Building upon this, we explore controllable video generation as a key component in developing agents that can simulate and understand dynamic, interactive scenarios. Our approach, inspired by human cognition, suggests learning directly from real-world data without manual annotations, offering a more scalable and flexible solution than traditional simulation methods. We present a series of models for unsupervised, controllable video generation. GLASS introduces combined global and local action spaces for fine-grained control over object dynamics. MIRAGE enables unsupervised novel view synthesis that enhances our ability to manipulate object poses. YODA generates video sequences from sparse motion input, while CAGE represents our most advanced model, integrating composition and animation for flexible video generation. These models, trained through passive observation of unannotated video datasets, demonstrate the ability to predict future outcomes based on specific control inputs, generalize to unseen situations, and generate realistic, context-aware video sequences. By learning to simulate potential futures, they exhibit a form of visual reasoning that is crucial for intelligent agents. Our work opens new avenues for unsupervised learning of complex dynamics. This thesis sets a groundwork for future research in developing foundation models for object dynamics and creating artificial agents that can interpret and interact with the visual world in increasingly sophisticated ways.

Item Type:	Thesis
Dissertation Type:	Single
Date of Defense:	19 December 2024
Subjects:	000 Computer science, knowledge & systems 500 Science > 510 Mathematics 600 Technology > 620 Engineering
Institute / Center:	08 Faculty of Science > Institute of Computer Science (INF)
Depositing User:	Hammer Igor
Date Deposited:	11 Feb 2025 07:49
Last Modified:	12 Feb 2025 07:26
URI:	https://boristheses.unibe.ch/id/eprint/5809

Actions (login required)

View Item