Michele Cafagna

Leiden, The Netherlands

I’m Michele [miˈkɛːle], NLP R&D Scientist at Okra.ai,

I’m work in Medical NLP. however, my interests lie at the intersection of Computer Vision, Natural Language, Cognitive Sciences, and XAI,

I’ve received my Phd from the Institute of Linguistics & Language Technology at the University of Malta, 🇲🇹, supervised by Prof. Albert Gatt and co-supervised by Prof. Kees van Deemter. Here I was a Marie Curie PhD Fellow and Early Stage Researcher in the NL4XAI Project During my Phd, I worked as a Visiting Researcher at the University of Utrecht, 🇳🇱 focusing on Multimodal Grounding and interning at the 🍊Orange Innovations Labs, Lannion, 🇫🇷.

Previously, I was Machine Learning Research Scientist at Aptus.AI, a RegTech startup based in Pisa, 🇮🇹. I earned my Master’s in Computer Science and AI, at the University of Pisa 🇮🇹, with a thesis project in NLG carried out as a Visiting Researcher at the Center for Language and Cognition of the University of Groeningen (CLCG), 🇳🇱 . I’ve also collaborated with the ItaliaNLP Lab at the Institute of Linguistics of the National Research Center (ILC-CNR) based in Pisa, 🇮🇹.

news

May 15, 2024	My Phd Thesis is online 🎉 : Visually Grounded Language Generation: Data, Models and Explanations beyond Descriptive Captions.
Mar 18, 2024	Joined Okra.ai as NLP Data Scentist,🇳🇱 🎉
Jan 16, 2024	Our paper “ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models” , accepted @ICLR 2024 , Vienna,🇦🇹 🎉
Oct 10, 2023	Reviewer for COLING-LREC2024, Torino, 🇮🇹
Sep 11, 2023	Presented the poster“HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales” @INLG 2023 , Prague, 🇨🇿
Sep 4, 2023	Our paper “Interpreting Vision and Language Generative Models with Semantic Visual Priors” published in @Frontiers in AI Journal
Aug 6, 2023	Reviewer for 26th European Conference on Artificial Intelligence ECAI 2023, Kraków, 🇵🇱
Jul 13, 2023	Our paper “HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales” , accepted @INLG 2023 , Prague, 🇨🇿 🎉
Jun 18, 2023	Reviewer for MMNLG2023 co-located with INLG 2023, Prague, 🇨🇿
Jun 14, 2023	Reviewer for EMNLP2023, Singapore, 🇸🇬

selected publications

Interpreting vision and language generative models with semantic visual priors

Michele Cafagna, Lina M. Rojas-Barahona, Kees Deemter, and Albert Gatt

Frontiers in Artificial Intelligence, 2023

Abs PDF Code

When applied to Image-to-text models, explainability methods have two challenges. First, they often provide token-by-token explanations namely, they compute a visual explanation for each token of the generated sequence. This makes explanations expensive to compute and unable to comprehensively explain the model’s output. Second, for models with visual inputs, explainability methods such as SHAP typically consider superpixels as features. Since superpixels do not correspond to semantically meaningful regions of an image, this makes explanations harder to interpret. We develop a framework based on SHAP, that allows for generating comprehensive, meaningful explanations leveraging the meaning representation of the output sequence as a whole. Moreover, by exploiting semantic priors in the visual backbone, we extract an arbitrary number of features that allows the efficient computation of Shapley values on large-scale models, generating at the same time highly meaningful visual explanations. We demonstrate that our method generates semantically more expressive explanations than traditional methods at a lower compute cost and that it can be generalized to a large family of vision-language models.
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena

Letitia Parcalabescu, Michele Cafagna, Lilitta Muradjan, Anette Frank, Iacer Calixto, and Albert Gatt

In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022

Abs PDF Code

We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for testing general-purpose pretrained vision and language (V&L) models for their visio-linguistic grounding capabilities on specific linguistic phenomena. VALSE offers a suite of six tests covering various linguistic constructs. Solving these requires models to ground linguistic phenomena in the visual modality, allowing more fine-grained evaluations than hitherto possible. We build VALSE using methods that support the construction of valid foils, and report results from evaluating five widely-used V&L models. Our experiments suggest that current models have considerable difficulty addressing most phenomena. Hence, we expect VALSE to serve as an important benchmark to measure future progress of pretrained V&L models from a linguistic perspective, complementing the canonical task-centred V&L evaluations.
HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales

Michele Cafagna, Kees Deemter, and Albert Gatt

In Proceedings of the 16th International Natural Language Generation Conference, Sep 2023

Abs PDF Code Poster

Current captioning datasets focus on object-centric captions, describing the visible objects in the image, often ending up stating the obvious (for humans), e.g. “people eating food in a park”. Although these datasets are useful to evaluate the ability of Vision & Language models to recognize and describe visual content, they do not support controlled experiments involving model testing or fine-tuning, with more high-level captions, which humans find easy and natural to produce. For example, people often describe images based on the type of scene they depict (“people at a holiday resort”) and the actions they perform (“people having a picnic”). Such concepts are based on personal experience and contribute to forming common sense assumptions. We present the High-Level Dataset, a dataset extending 14997 images from the COCO dataset, aligned with a new set of 134,973 human-annotated (high-level) captions collected along three axes: scenes, actions and rationales. We further extend this dataset with confidence scores collected from an independent set of readers, as well as a set of narrative captions generated synthetically, by combining each of the three axes. We describe this dataset and analyse it extensively. We also present baseline results for the High-Level Captioning task.
Understanding Cross-modal Interactions in V&L Models that Generate Scene Descriptions

Michele Cafagna, Kees van Deemter, and Albert Gatt

In Proceedings of the Workshop on Unimodal and Multimodal Induction of Linguistic Structures (UM-IoS), Dec 2022

Abs PDF

Image captioning models tend to describe images in an object-centric way, emphasising visible objects. But image descriptions can also abstract away from objects and describe the type of scene depicted. In this paper, we explore the potential of a state of the art Vision and Language model, VinVL, to caption images at the scene level using (1) a novel dataset which pairs images with both object-centric and scene descriptions. Through (2) an in-depth analysis of the effect of the fine-tuning, we show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene; the model acquires a more holistic view of the image compared to when object-centric descriptions are generated. We discuss the parallels between these results and insights from computational and cognitive science research on scene perception.
What Vision-Language Models ’See’ when they See Scenes

Michele Cafagna, Kees Deemter, and Albert Gatt

arXiv preprint arXiv:2109.07301, Dec 2021

Abs arXiv PDF Code

Images can be described in terms of the objects they contain, or in terms of the types of scene or place that they instantiate. In this paper we address to what extent pretrained Vision and Language models can learn to align descriptions of both types with images. We compare 3 state-of-the-art models, VisualBERT, LXMERT and CLIP. We find that (i) V&L models are susceptible to stylistic biases acquired during pretraining; (ii) only CLIP performs consistently well on both object- and scene-level descriptions. A follow-up ablation study shows that CLIP uses object-level information in the visual modality to align with scene-level textual descriptions.