Tascón Morales, Sergio (2024). Spatial Awareness and Logic for Robust Visual Question Answering. (Thesis). Universität Bern, Bern
|
Text
24tasconmorales_s.pdf - Thesis Available under License Creative Commons: Attribution (CC-BY 4.0). Download (67MB) | Preview |
Abstract
In recent years, deep learning models have become an integral part of the daily lives of millions, extending their influence into specific domains such as medicine. The integration of vision and language capabilities has notably facilitated smoother interactions between users and models. Questions and answers have long served not only as a means of interaction with machines but also as a test for evaluating their level of intelligence. In particular, inquiries related to visual content, encapsulated by Visual Question Answering (VQA), provide a mechanism to probe a model’s visual understanding. In the medical domain, this aspect holds considerable significance, given the crucial role that trust plays in the adoption of these systems by medical professionals. However, the often opaque nature of most models hinders the assessment of true visual understanding, concealing potential shortcuts and biases. Crucial aspects of reasoning, such as compositionality and consistency, are at times overlooked in favor of high overall performance. In line with this perspective, this work introduces several contributions in the domains of localized questions and consistency for VQA. The first part of the thesis explores questions about specific image regions. Two distinct methodologies are proposed. The first method employs a localized attention mechanism, integrating information about the target region through a binary mask. Localized attention allows the network to consider contextual cues necessary for answering the question, focusing subsequently on the region specified by the user. The second method extends the concept of localized questions to Multimodal Large Language Models (MLLMs) by introducing targeted visual prompting. Here, a customized visual prompt is formulated, encompassing the isolated region and its contextual representation within the image. The second part of the thesis focuses on avoiding contradictions by enhancing consistency. The first method involves categorizing queries as perception vs. reasoning questions and utilizing a loss function term to penalize inconsistencies during training. The second method proposes a broader interpretation of consistency in VQA based on logical relations and introduces an auxiliary method for predicting these relations. Similar to the first method, this approach employs a loss term to enforce more consistent behavior during the training phase.
Item Type: | Thesis |
---|---|
Dissertation Type: | Cumulative |
Date of Defense: | 15 May 2024 |
Subjects: | 000 Computer science, knowledge & systems 100 Philosophy > 160 Logic 500 Science > 570 Life sciences; biology 600 Technology > 610 Medicine & health |
Institute / Center: | 04 Faculty of Medicine |
Depositing User: | Sarah Stalder |
Date Deposited: | 30 Aug 2024 14:43 |
Last Modified: | 30 Aug 2024 22:30 |
URI: | https://boristheses.unibe.ch/id/eprint/5389 |
Actions (login required)
View Item |