Skip to content

Cappuccino is an GUI Agent based on desktop screen. It is a Manus-like AI Agent that can be deployed locally.

License

Notifications You must be signed in to change notification settings

GML-FMGroup/cappuccino

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

☕️ cappuccino

中文 | English

A local automated intelligent agent that frees your hands 🤖

Entrust your tasks to me, and enjoy a rich cup of cappuccino ☕️

By the time you return, your tasks will be silently completed 🍃

💡 Overview

Cappuccino is a GUI Agent that can control your computer to solve tedious tasks with a simple instruction. AI will generate detailed task plans and execute them. Unlike other existing solutions that parse image elements or use browser interfaces, cappuccino is a purely visual solution based on desktop screens, as we believe the parsing process easily loses spatial association information.

You can use the API directly to get started quickly or deploy LLM on local servers for greater security. Send control instructions through Python scripts or visual interface: cappuccino-client 🖥️.

✨ Features

  • Local Deployment: Each part of our architecture provides open-source model options for local deployment, with information transmission through local LAN to protect your privacy.
  • Easy to Use: We provide a React-based GUI Client to control the Agent, which is beginner-friendly.
  • Scalability: The current architecture supports the addition of more actuators to expand the Agent's capabilities.

🤔 Future Work

We will support more models, optimize the agent's performance, and work on developing our own small-parameter LLM to reduce deployment costs and improve running speed.

We hope more people will pay attention to our project or join us. We will further enrich our system, create a Manus-like product suitable for local deployment, and adapt to more software operations.

Your star🌟 will be the biggest motivation for us to update!

cappuccino_group

Welcome to join our community exchange group to participate in the construction or exchange of projects.

📰 Update

  • [2025/03/19] 🧠 The system architecture was upgraded to enable more complex tasks.
  • [2025/03/09] 🖥️ We introduced cappuccino-client for easier command initiation.
  • [2025/03/04] 💥 Deepseek-v3 is now supported as a planner.
  • [2025/02/27] 🏆 Now you can experience cappuccino with qwen and gpt-4o.

🎥 Demo

cappuccino_demo_windows_v1.mov

👨‍💻 Quickstart

0. Hardware preparation

At present, the project supports the deployment of Windows and Mac. Due to the differences in the shortcut keys and operation methods of the system, the experience of different systems may be different. We will carry out more system adaptation in the future.

1. Model Deployment

This project supports using vendor APIs or locally deploying LLMs. If you need local deployment, please use an OpenAI-compatible API service. We recommend using vLLM for deployment, referring to the official tutorial.

For model selection, we recommend using deepseek-v3 as the planner, qwen-vl-max as the dispatcher & validator, and qwen2.5-vl-7b as the executor.

2. Server Configuration and Startup

The following operations are performed on the computer you want to control.

2.1 Clone the Repository

git clone https://github.com/GML-FMGroup/cappuccino.git
cd cappuccino

2.2 Install Dependencies

pip install -r requirements.txt

2.3 Start the Server

cd app
python server.py

You will see your local IP and randomly generated token in the console. In this example, IP is 192.168.0.100

Generated token: 854616
Chat WebSocket: ws://192.168.0.100:8000/chat
Screenshots WebSocket: ws://192.168.0.100:8001/screenshots

3. Send Instructions

Run on another device to initiate network requests. Of course, you can also run it on the controlled terminal, but our design philosophy is to use another device to send instructions to avoid affecting the computer's operations.

Method 1: Python Scripts

  1. Modify the IP and token in request_demo.py. For example, IP is 192.168.0.100.
  2. Fill in LLM configuration information like API Key, vendor, etc.
  3. Run the Python file.
python request_demo.py

Method 2: GUI Client

You can find a more detailed tutorial on using the GUI Client in cappuccino-client 🖥️.

📖 Guide

Design Architecture

We divide Cappuccino into three parts: Model, Server, Client.

  • Model: You can choose to use vendors like dashscope, openai, or a more secure local deployment.
  • Server: GUI Agent deployed on the controlled computer, enables websocket network service to receive instructions from LAN, and combines desktop screenshots with model interaction so the model can output execution instructions or plans.
  • Client: Used to send human instructions to the server through GUI Interface or Python Scripts.

For the design of GUI Agent, we mainly divide it into four parts: 🧠Planner, 🤖Dispatcher, ✍️Executor, 🔍Verifier.

  • 🧠Planner: Breaks down complex user instructions into multiple tasks for step-by-step execution.
  • 🤖Dispatcher: Combined with the functions of the desktop screen and the actuator, the task is broken down into multiple subtasks and assigned to the corresponding actuator, each subtask is an atomic operation (the minimum action unit for human control of the computer, such as: click xx, enter xx).
  • ✍️Executor: Combines desktop screen to generate parameters for script execution based on atomic operations.
  • 🔍Verifier: Determines whether corresponding tasks have been completed based on desktop screen.

Supported Models

Planner - API Planner - Local Dispatcher & Verifier - API Dispatcher & Verifier - Local Executor - API Executor - Local
qwen-vl-max deepseek-v3 qwen-vl-max qwen2.5-vl-72b qwen2.5-vl-7b qwen2.5-vl-7b
gpt-4o gpt-4o
deepseek-v3

⚠️ Notice

  • Please ensure the model name is correct and the vendor supports the model when making your selection.
  • Our current interface is implemented based on the openai library. Please ensure the provider or local deployment supports the provided models.
  • Due to the inherent instability in model outputs, if execution fails, try running again or modifying your query.

About

Cappuccino is an GUI Agent based on desktop screen. It is a Manus-like AI Agent that can be deployed locally.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages