Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning

Error distribution for different models on the VisText and Pew subset of the proposed Logo Chocolate dataset. The error rates are computed per sentence. An error rate of 0.4 indicates that 40% of the sentences in the generated captions contain such an error. Note that a single caption may contain multiple types of errors; hence, the maximum value for a stacked bar is greater than 1.0. We show that even the most advanced LVLM, GPT-4V, generates captions with a high rate of factual error.

GPT-4V often cannot understand charts without value labeling (i.e. cannot align datapoints to axes). We prompted GPT-4V to generate captions of two charts we created using the Seaborn library based on an underlying table sampled from the Chart-to-Text dataset, with or without labeling the values of the bars on the chart. We see that when the labeled values are absent from the chart, GPT-4V is more prone to produce less factual captions.

Abstract

Recent advancements in Large Vision-language Models (LVLMs) have led to significant progress in generating natural language descriptions for visual content and thus enhancing various applications. One issue with these powerful models is that they sometimes produce texts that are factually inconsistent with the visual input. While there has been some effort to mitigate such inconsistencies in natural image captioning, the factuality of generated captions for structured document images, such as charts, has not received as much scrutiny, posing a potential threat to information reliability in critical applications.

This work delves into the factuality aspect by introducing a comprehensive typology of factual errors in generated chart captions. A large-scale human annotation effort provides insight into the error patterns and frequencies in captions crafted by various chart captioning models, ultimately forming the foundation of a novel dataset, Logo Chocolate. The Chocolate dataset is split into three subsets based on the model that produces the caption: Lvlm (Large Vision-Language Model), Llm (Large Language Model), and Ft (Fine-tuned Model).

Our analysis reveals that even state-of-the-art models, including GPT-4V, frequently produce captions laced with factual inaccuracies. In response to this challenge, we establish the new task of Chart Caption Factual Error Correction and introduce ChartVE, a model for visual entailment that outperforms proprietary and open-source LVLMs in evaluating factual consistency. Furthermore, we propose C2TFec, an interpretable two-stage framework that excels at correcting factual errors. This work inaugurates a new domain in factual error correction for chart captions, presenting a novel evaluation mechanism, and demonstrating an effective approach to ensuring the factuality of generated chart captions.

Leaderboard on Factual Inconsistency Detection in Chart Captioning

Kendall's Tau on the Logo Chocolate dataset.

#	Model	Method	Source	Chocolate-Lvlm	Chocolate-Llm	Chocolate-Ft
	SummaC	Reference-based Text-only	Link	-0.011	0.023	0.036
	QAFactEval	Reference-based Text-only	Link	0.064	0.045	0.054
	LLaVA-1.5-13B	Large Vision-language Model	Link	0.002	0.057	0.214
	ChartLlama	Large Vision-language Model	Link	0.010	0.065	0.141
	ChartAssistant-S	Large Vision-language Model	Link	0.015	0.020	0.036
	Bard	Large Vision-language Model	Link	-0.014	0.105	0.291
	Gemini 1.5 Pro	Large Vision-language Model	Link	0.034	0.060	0.175
	GPT-4V	Large Vision-language Model	Link	0.157	0.205	0.215
	GPT-4o	Large Vision-language Model	Link	0.250	0.244	0.305
	DePlot + GPT-4	Tool-augmented Large Language Model	Link	0.129	0.117	0.109
	ChartVE	Small Vision-language Model	Link	0.178	0.091	0.215

Overview

Logo CHOCOLATE is a benchmark for detecting and correcting factual inconsistency in generated chart captions. It consists of captions produced by six most advanced models, which are categorized into three subsets:

Lvlm: GPT-4V, Bard (before Gemini)
LLM-based pipeline: DePlot + GPT-4
Ft: ChartT5, MatCha, UniChart

The charts are from two datasets: VisText and the Pew split of Chart-to-Text. In total, CHOCOLATE consists of 1,187 examples. Each instance in CHOCOLATE consists of a caption generated by one of the model and the annotations of the factual errors for each caption sentence.

You can download the dataset on Hugging Face Dataset.

Key statistics of Logo Chocolate.

Evaluation and Qualitative Analysis

Human evaluation results on subsets of the Logo CHOCOLATE dataset , comparing C2TFec and GPT-4V. C2TFec corrects significantly more errors compared to GPT-4V, especially Value, Label, and Trend Errors. .

An example showing how decomposing the visual reasoning process into image-to-structure rendering and text-based reasoning allows C2TFec to accurately rectify errors in chart captions. Texts marked in red indicate non-factual information units in the caption, whereas those marked in blue represent information units faithful to the chart. In this instance, C2TFec successfully corrects all Value and Label Errors presented in the original caption. Conversely, GPT-4V fails to identify the factual inconsistencies and merely reorders the entities in the caption.

An example showing GPT-4V cannot accurately extract tables from charts. This indicates its inability to infer the actual value of each data point within the chart.

BibTeX

@inproceedings{huang-etal-2024-lvlms,
    title = "Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning",
    author = "Huang, Kung-Hsiang  and
      Zhou, Mingyang and
      Chan, Hou Pong  and
      Fung, Yi R. and
      Wang, Zhenhailong and
      Zhang, Lingyu and
      Chang, Shih-Fu and
      Ji, Heng",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2024",
    month = aug,
    year = "2024",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-acl.85",
    doi = "10.18653/v1/2023.findings-acl.85",
    pages = "1314--1326",
}

Do LVLMs Understand Charts?

Analyzing and Correcting Factual Errors in Chart Captioning

Abstract

Leaderboard on Factual Inconsistency Detection in Chart Captioning

CHOCOLATE Dataset

Overview

Experiment Results

Evaluation and Qualitative Analysis

BibTeX