☕️ cappuccino

中文 | English

A local automated intelligent agent that frees your hands 🤖

Entrust your tasks to me, and enjoy a rich cup of cappuccino ☕️

By the time you return, your tasks will be silently completed 🍃

💡 Overview

Cappuccino is a GUI Agent that can control your computer to solve tedious tasks with a simple instruction. AI will generate detailed task plans and execute them. Unlike other existing solutions that parse image elements or use browser interfaces, cappuccino is a purely visual solution based on desktop screens, as we believe the parsing process easily loses spatial association information.

You can use the API directly to get started quickly or deploy LLM on local servers for greater security. Send control instructions through Python scripts or visual interface: cappuccino-client 🖥️.

✨ Features

Local Deployment: Each part of our architecture provides open-source model options for local deployment, with information transmission through local LAN to protect your privacy.
Easy to Use: We provide a React-based GUI Client to control the Agent, which is beginner-friendly.
Scalability: The current architecture supports the addition of more actuators to expand the Agent's capabilities.

🤔 Future Work

We will support more models, optimize the agent's performance, and work on developing our own small-parameter LLM to reduce deployment costs and improve running speed.

We hope more people will pay attention to our project or join us. We will further enrich our system, create a Manus-like product suitable for local deployment, and adapt to more software operations.

Your star🌟 will be the biggest motivation for us to update!

Welcome to join our community exchange group to participate in the construction or exchange of projects.

📰 Update

[2025/03/19] 🧠 The system architecture was upgraded to enable more complex tasks.
[2025/03/09] 🖥️ We introduced cappuccino-client for easier command initiation.
[2025/03/04] 💥 Deepseek-v3 is now supported as a planner.
[2025/02/27] 🏆 Now you can experience cappuccino with qwen and gpt-4o.

🎥 Demo

cappuccino_demo_windows_v1.mov

👨‍💻 Quickstart

0. Hardware preparation

At present, the project supports the deployment of Windows and Mac. Due to the differences in the shortcut keys and operation methods of the system, the experience of different systems may be different. We will carry out more system adaptation in the future.

1. Model Deployment

This project supports using vendor APIs or locally deploying LLMs. If you need local deployment, please use an OpenAI-compatible API service. We recommend using vLLM for deployment, referring to the official tutorial.

For model selection, we recommend using deepseek-v3 as the planner, qwen-vl-max as the dispatcher & validator, and qwen2.5-vl-7b as the executor.

2. Server Configuration and Startup

The following operations are performed on the computer you want to control.

2.1 Clone the Repository

git clone https://github.com/GML-FMGroup/cappuccino.git
cd cappuccino

2.2 Install Dependencies

pip install -r requirements.txt

2.3 Start the Server

cd app
python server.py

You will see your local IP and randomly generated token in the console. In this example, IP is 192.168.0.100

Generated token: 854616
Chat WebSocket: ws://192.168.0.100:8000/chat
Screenshots WebSocket: ws://192.168.0.100:8001/screenshots

3. Send Instructions

Run on another device to initiate network requests. Of course, you can also run it on the controlled terminal, but our design philosophy is to use another device to send instructions to avoid affecting the computer's operations.

Method 1: Python Scripts

Modify the IP and token in request_demo.py. For example, IP is 192.168.0.100.
Fill in LLM configuration information like API Key, vendor, etc.
Run the Python file.

python request_demo.py

Method 2: GUI Client

You can find a more detailed tutorial on using the GUI Client in cappuccino-client 🖥️.

📖 Guide

Design Architecture

We divide Cappuccino into three parts: Model, Server, Client.

Model: You can choose to use vendors like dashscope, openai, or a more secure local deployment.
Server: GUI Agent deployed on the controlled computer, enables websocket network service to receive instructions from LAN, and combines desktop screenshots with model interaction so the model can output execution instructions or plans.
Client: Used to send human instructions to the server through GUI Interface or Python Scripts.

For the design of GUI Agent, we mainly divide it into four parts: 🧠Planner, 🤖Dispatcher, ✍️Executor, 🔍Verifier.

🧠Planner: Breaks down complex user instructions into multiple tasks for step-by-step execution.
🤖Dispatcher: Combined with the functions of the desktop screen and the actuator, the task is broken down into multiple subtasks and assigned to the corresponding actuator, each subtask is an atomic operation (the minimum action unit for human control of the computer, such as: click xx, enter xx).
✍️Executor: Combines desktop screen to generate parameters for script execution based on atomic operations.
🔍Verifier: Determines whether corresponding tasks have been completed based on desktop screen.

Supported Models

Planner - API	Planner - Local	Dispatcher & Verifier - API	Dispatcher & Verifier - Local	Executor - API	Executor - Local
qwen-vl-max	deepseek-v3	qwen-vl-max	qwen2.5-vl-72b	qwen2.5-vl-7b	qwen2.5-vl-7b
gpt-4o		gpt-4o
deepseek-v3

⚠️ Notice

Please ensure the model name is correct and the vendor supports the model when making your selection.
Our current interface is implemented based on the openai library. Please ensure the provider or local deployment supports the provided models.
Due to the inherent instability in model outputs, if execution fails, try running again or modifying your query.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
app		app
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
request_demo.py		request_demo.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

☕️ cappuccino

💡 Overview

✨ Features

🤔 Future Work

📰 Update

🎥 Demo

👨‍💻 Quickstart

0. Hardware preparation

1. Model Deployment

2. Server Configuration and Startup

2.1 Clone the Repository

2.2 Install Dependencies

2.3 Start the Server

3. Send Instructions

Method 1: Python Scripts

Method 2: GUI Client

📖 Guide

Design Architecture

Supported Models

⚠️ Notice

About

Releases

Packages

Contributors 2

Languages

License

GML-FMGroup/cappuccino

Folders and files

Latest commit

History

Repository files navigation

☕️ cappuccino

💡 Overview

✨ Features

🤔 Future Work

📰 Update

🎥 Demo

👨‍💻 Quickstart

0. Hardware preparation

1. Model Deployment

2. Server Configuration and Startup

2.1 Clone the Repository

2.2 Install Dependencies

2.3 Start the Server

3. Send Instructions

Method 1: Python Scripts

Method 2: GUI Client

📖 Guide

Design Architecture

Supported Models

⚠️ Notice

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages