WebUIBench

Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code

1 Institute of Artificial Intelligence (TeleAI), China Telecom, 2Northwestern Polytechnical University, 3Beijing Jiaotong University, 4Nanjing University

Evaluation taxonomy(left) and Task examples(right) of WebUIBench.

Abstract

The emergence of Large Language Models(LLMs) has rapidly reshaped the landscape of software engineering. AI code generation evolves from assisting developers to independently completing the entire development lifecycle (i.e., AI software engineer). With the rapid advancement of Generative AI technology, Multimodal Large Language Models(MLLMs) have the potential to act as AI software engineers capable of executing complex web application development.

Considering that the model requires a confluence of multidimensional sub-capabilities to address the challenges of various development phases, constructing a multi-view evaluation framework is crucial for accurately guiding the enhancement of development efficiency. However, existing benchmarks usually fail to provide an assessment of sub-capabilities and focus solely on webpage generation outcomes.

In this work, we draw inspiration from the principles of software engineering and further propose WebUIBench, a benchmark systematically designed to evaluate MLLMs in four key areas: WebUI Perception, HTML Programming, WebUI-HTML Understanding, and WebUI-to-Code. WebUIBench comprises 21K high-quality question-answer pairs derived from over 0.7K real-world websites. This paper has been accepted by ACL 2025 (findings).

Dataset Overview and Download

WebUIBench consists of 5 categories of websites commonly visited by users: enterprise portals, background management systems, personal blogs, news sites, and e-commerce platforms.

For webpage data collection, our dataset consists of 719 full webpage and 2488 webpage slices from 5 categories, covering a variety of resolution modes. We open-source the screenshot (.png files), source HTML code (.html files), and element information (.json files) for these webpage. Based on this, WebUIBench includes 21,793 question-answer pairs, with an average of 10.68 question-answer pairs per webpage screenshot.

Key statistics(left) and Question-Answer distribution(right) of WebUIBench.

Download the dataset from: [🤗Huggingface] or [BaiduNetDisk]

🏆 Leaderboard and Evaluation Guidline

#

WebUI Perception: EC=Element Classification, AP=Attribute Perception, VG=Visual Grounding; HTML Programming: CEC=Code Error Correcting, CFE=Code Function Editing; WebUI-HTML Understanding: WHM=WebUI-HTML Matching, WHR=WebUI-HTML Retrieval; W2C=WebUI-to-Code.

Submission Instructions

Please follow the requirements below to submit your evaluation results:

  • File Format: Must be a .json file.
  • Sample Template: Download and refer to the example format here: Download Sample File
  • How to Submit: Send your .json file as an email attachment to zyllin@bjtu.edu.cn

Self-Evaluation

If you prefer to run the evaluation on your own, we provide reference code and a Docker image. See the details below:

BibTeX

@article{lin2025webuibench,
  title={WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code},
  author={Zhiyu Lin, Zhengda Zhou, Zhiyuan Zhao, Tianrui Wan, Yilun Ma, Junyu Gao, XueLong Li},
  journal={ACL findings},
  year={2025}
}