We present FSD (From Seeing to Doing) with: Embodied-FSD Model: We develop FSD, a novel vision-language model that generates intermediate representations through spatial relationship reasoning, ...
Abstract: The knowledge-based visual question answering (KB-VQA) task involves using external knowledge about the image to assist reasoning. Building on the impressive performance of multimodal large ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results