Helping computer vision and language models understand what they see