WebUIBench: Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code

The emergence of Large Language Models(LLMs) has rapidly reshaped the landscape of software engineering. AI code generation evolves from assisting developers to independently completing the entire development lifecycle (i.e., AI software engineer). With the rapid advancement of Generative AI technology, Multimodal Large Language Models(MLLMs) have the potential to act as AI software engineers capable of executing complex web application development.

Considering that the model requires a confluence of multidimensional sub-capabilities to address the challenges of various development phases, constructing a multi-view evaluation framework is crucial for accurately guiding the enhancement of development efficiency. However, existing benchmarks usually fail to provide an assessment of sub-capabilities and focus solely on webpage generation outcomes.

In this work, we draw inspiration from the principles of software engineering and further propose WebUIBench, a benchmark systematically designed to evaluate MLLMs in four key areas: WebUI Perception, HTML Programming, WebUI-HTML Understanding, and WebUI-to-Code. WebUIBench comprises 21K high-quality question-answer pairs derived from over 0.7K real-world websites. This paper has been accepted by ACL 2025 (findings).

WebUIBench consists of 5 categories of websites commonly visited by users: enterprise portals, background management systems, personal blogs, news sites, and e-commerce platforms.

For webpage data collection, our dataset consists of 719 full webpage and 2488 webpage slices from 5 categories, covering a variety of resolution modes. We open-source the screenshot (.png files), source HTML code (.html files), and element information (.json files) for these webpage. Based on this, WebUIBench includes 21,793 question-answer pairs, with an average of 10.68 question-answer pairs per webpage screenshot.

Download the dataset from: [🤗Huggingface] or [BaiduNetDisk]

WebUI Perception: EC=Element Classification, AP=Attribute Perception, VG=Visual Grounding; HTML Programming: CEC=Code Error Correcting, CFE=Code Function Editing; WebUI-HTML Understanding: WHM=WebUI-HTML Matching, WHR=WebUI-HTML Retrieval; W2C=WebUI-to-Code.

Submission Instructions

Please follow the requirements below to submit your evaluation results:

File Format: Must be a .json file.
Sample Template: Download and refer to the example format here: Download Sample File
How to Submit: Send your .json file as an email attachment to zyllin@bjtu.edu.cn

Self-Evaluation

If you prefer to run the evaluation on your own, we provide reference code and a Docker image. See the details below:

Evaluation Code Repository: View & Clone Repository
Docker Image: docker pull example/benchmark:latest
For more evaluation information please visit [Document]

BibTeX

@article{lin2025webuibench,
  title={WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code},
  author={Zhiyu Lin, Zhengda Zhou, Zhiyuan Zhao, Tianrui Wan, Yilun Ma, Junyu Gao, XueLong Li},
  journal={ACL findings},
  year={2025}
}

WebUIBench

Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code

Evaluation taxonomy(left) and Task examples(right) of WebUIBench.

Abstract

Dataset Overview and Download

Key statistics(left) and Question-Answer distribution(right) of WebUIBench.

🏆 Leaderboard and Evaluation Guidline

Submission Instructions

Self-Evaluation

BibTeX