We present FSD (From Seeing to Doing) with: Embodied-FSD Model: We develop FSD, a novel vision-language model that generates intermediate representations through spatial relationship reasoning, ...
Abstract: The knowledge-based visual question answering (KB-VQA) task involves using external knowledge about the image to assist reasoning. Building on the impressive performance of multimodal large ...