UI-TARS — Pioneering Automated GUI Interaction with Native Agents

Overview

4 min readJan 25, 2025

UI-TARS is a next-generation native GUI (Graphical User Interface) agent model designed to interact seamlessly with graphical interfaces using human-like perception, reasoning, and action capabilities. Unlike traditional modular frameworks, UI-TARS integrates all key components — perception, reasoning, grounding, and memory — within a single vision-language model (VLM). This enables end-to-end task automation without predefined workflows or manual rules, making it a powerful tool for automating interactions across desktop, mobile, and web environments.

https://github.com/bytedance/UI-TARS

Core Features

1. Perception

Comprehensive GUI Understanding: UI-TARS processes multimodal inputs (text, images, interactions) to build a coherent understanding of interfaces.
Real-Time Interaction: It continuously monitors dynamic GUIs and responds accurately to changes in real-time.

2. Action

Unified Action Space: UI-TARS standardizes action definitions across platforms (desktop, mobile, and web).
Platform-Specific Actions: It supports additional actions like hotkeys, long press, and platform-specific gestures.

3. Reasoning

System 1 & System 2 Reasoning: Combines fast, intuitive responses with deliberate, high-level planning for complex tasks.
Task Decomposition & Reflection: Supports multi-step planning, reflection, and error correction for robust task execution.

4. Memory

Short-Term Memory: Captures task-specific context for situational awareness.
Long-Term Memory: Retains historical interactions and knowledge for improved decision-making.

Capabilities

1. Cross-Platform Interaction

UI-TARS supports desktop, mobile, and web environments with a unified action framework, making it versatile for various applications.

2. Multi-Step Task Execution

Trained to handle complex tasks through multi-step trajectories and reasoning, UI-TARS can perform tasks like:

Navigating through applications.
Filling out forms.
Executing workflows across multiple platforms.

3. Learning from Synthetic and Real Data

UI-TARS combines large-scale annotated and synthetic datasets for improved generalization and robustness, ensuring high performance in diverse scenarios.

Performance

1. Perception Capability Evaluation

UI-TARS outperforms several state-of-the-art models in benchmarks like VisualWebBench, WebSRC, and SQAshort. For example:

UI-TARS-72B achieves 82.8 in VisualWebBench, 89.3 in WebSRC, and 88.6 in SQAshort, surpassing models like GPT-4o and Claude-3.5-Sonnet.

2. Grounding Capability Evaluation

In ScreenSpot Pro, UI-TARS demonstrates superior grounding capabilities:

UI-TARS-7B achieves 89.5 in average grounding accuracy, outperforming models like GPT-4o and Aguvis-72B.

3. Offline Agent Capability Evaluation

In Multimodal Mind2Web, UI-TARS excels in cross-task, cross-website, and cross-domain evaluations:

UI-TARS-72B achieves 74.7 in cross-task element accuracy and 68.6 in step success rate.

4. Online Agent Capability Evaluation

In OSWorld (Online) and AndroidWorld (Online), UI-TARS demonstrates strong performance:

UI-TARS-72B-DPO achieves 24.6 in OSWorld (50 steps) and 46.6 in AndroidWorld.

Deployment Options

1. Cloud Deployment

UI-TARS can be deployed using HuggingFace Inference Endpoints, providing fast and scalable solutions for users.

2. Local Deployment

Transformers: Follows the same deployment process as Qwen2-VL.
vLLM: Recommended for fast deployment and inference. Requires vllm>=0.6.1.

3. UI-TARS-Desktop

A desktop version of UI-TARS is available for local use, enabling users to automate tasks on their personal devices.

Key Advantages

End-to-End Automation: UI-TARS eliminates the need for predefined workflows, enabling seamless task execution.
High Accuracy: Outperforms state-of-the-art models in perception, grounding, and task execution.
Versatility: Supports desktop, mobile, and web environments with a unified action framework.
Scalability: Available in multiple model sizes (2B, 7B, 72B) to suit different computational resources.

Use Cases

Web Automation: Automate tasks like form filling, data extraction, and navigation on websites.
Desktop Automation: Perform tasks like file management, application navigation, and workflow automation.
Mobile Automation: Automate interactions on mobile apps, such as messaging, navigation, and data entry.

Conclusion

UI-TARS represents a significant advancement in GUI automation, combining vision, language, and action capabilities into a single, powerful model. Its ability to perform complex tasks across multiple platforms with high accuracy makes it a game-changer in the field of AI-driven automation. Whether deployed in the cloud or locally, UI-TARS offers a robust and scalable solution for automating GUI interactions.

Acknowledgements

UI-TARS builds upon the foundational architecture of Qwen-2-VL, a powerful vision-language model. The project also benefits from the contributions of the open-source community, whose datasets, tools, and insights have facilitated its development.

Citation

If you find UI-TARS useful in your research, please consider citing:

@article{uitars2025,
  author    = {Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi},
  title     = {UI-TARS: Pioneering Automated GUI Interaction with Native Agents},
  journal   = {arXiv preprint arXiv:2501.12326},
  url       = {https://github.com/bytedance/UI-TARS},
  year      = {2025}
}

Support My Work

If you found this article helpful and would like to support my work, consider contributing to my efforts. Your support will enable me to:

Continue creating high-quality, in-depth content on AI and data science.
Invest in better tools and resources to improve my research and writing.
Explore new topics and share insights that can benefit the community.

You can support me via:

Buy Me a Coffee

Every contribution, no matter how small, makes a huge difference. Thank you for being a part of my journey!

If you found this article helpful, don’t forget to share it with your network. For more insights on AI and technology, follow me:

Connect with me on Medium:

https://medium.com/@TheDataScience-ProF

Connect with me on LinkedIn:

https://www.linkedin.com/in/adil-a-4b30a78a/