Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add pretty docs #6

Merged
merged 1 commit into from
Feb 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
265 changes: 4 additions & 261 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,270 +3,13 @@
> 🌈️ Where keebler elves and schedulers live, somewhere in the clouds, and with marshmallows

[![PyPI version](https://badge.fury.io/py/rainbow-scheduler.svg)](https://badge.fury.io/py/rainbow-scheduler)
![img/rainbow.png](img/rainbow.png)
![docs/img/rainbow.png](docs/img/rainbow.png)

This is a prototype that will use a Go [gRPC](https://grpc.io/) server/client to demonstrate multi-cluster scheduling. This won't be doing anything intelligent with respect to scheduling (but could) but instead:
This is a prototype that will use a Go [gRPC](https://grpc.io/) server/client to demonstrate multi-cluster scheduling.
For more information:

- Will expose an API that can take job requests, where a request is a simple command and resources.
- Clusters can register to it, meaning they are allowed to ask for work.
- Users will submit jobs (from anywhere) to the API, targeting a specific cluster (again, no scheduling here)
- The cluster will run a client that periodically checks for new jobs to run.
- ⭐️ [Documentation](https://converged-computing.github.io/rainbow) ⭐️

This will just be a prototype that demonstrates we can do a basic interaction from multiple places, and obviously will have a lot of room for improvement.
We can run the client alongside any flux instance that has access to this service (and is given some shared secret).

## Components

- The main server (and optionally, a client) are implemented in Go, here
- Under [python](python) we also have a client that is intended to run from a flux instance, another scheduler, or anywhere really. We haven't implemented the same server in entirety because it's assumed if you plan to run a server, Go is the better choice (and from a container we will provide). That said, the skeleton is there, but unimplemented for the most part.

## Development

### proto

We are using [Protocol Buffers](https://developers.google.com/protocol-buffers/) "Protobuf" to define the API (how the payloads are shared and the methods for communication between client and server). These are defined in [api/v1/sample.proto](api/v1/sample.proto).
You can read more about Protobuf [here](https://github.com/golang/protobuf), I first saw / used them with fluence and am still pretty new.

```shell
make proto
```

That will download protoc and needed tools into a local "bin" and then generate the bindings.

## Getting Started

### Setup

Ensure you have your dependencies:

```bash
make tidy
```

In two terminals, start the server in one:

```bash
make server
```
```console
go run cmd/server/server.go
2024/02/12 19:38:58 creating 🌈️ server...
2024/02/12 19:38:58 ✨️ creating rainbow.db...
2024/02/12 19:38:58 rainbow.db file created
2024/02/12 19:38:58 create cluster table...
2024/02/12 19:38:58 cluster table created
2024/02/12 19:38:58 create jobs table...
2024/02/12 19:38:58 jobs table created
2024/02/12 19:38:58 starting scheduler server: rainbow v0.1.0-draft
2024/02/12 19:38:58 server listening: [::]:50051
```

### Register

And then mock a registration:

```bash
make register
```
```console
go run cmd/rainbow/rainbow.go register
2024/02/12 22:17:43 🌈️ starting client (localhost:50051)...
2024/02/12 22:17:43 registering cluster: keebler
2024/02/12 22:17:43 status: REGISTER_SUCCESS
2024/02/12 22:17:43 secret: 54c4568a-14f2-465f-aa1e-5e6e0e3efd33
2024/02/12 22:17:43 token: 67e0f258-96c3-4d88-8253-287a95653138
```

In the above:

- `token` is what is given to clients to submit jobs
- `secret` is a secret just for your cluster / instance / place you can receive jobs to receive them!

You'll see this from the server:

```console
2024/02/12 22:17:43 📝️ received register: keebler
2024/02/12 22:17:43 SELECT count(*) from clusters WHERE name = 'keebler': (0)
2024/02/12 22:17:43 INSERT into clusters (name, token, secret) VALUES ("keebler", "67e0f258-96c3-4d88-8253-287a95653138", "54c4568a-14f2-465f-aa1e-5e6e0e3efd33"): (1)
```

In the above, we are providing a cluster name (keebler) and it is being registered to the database, and a token, secret and status returned. Note that if we want to submit a job to the "keebler" cluster, from anywhere, we need this token! Let's try that next.

### Submit Job

To submit a job, we need the client `token` associated with a cluster.

```bash
# Look at help
go run ./cmd/rainbow/rainbow.go submit --help
```
```
usage: rainbow submit [-h|--help] [-s|--secret "<value>"] [-n|--nodes
<integer>] [-t|--tasks <integer>] [-c|--command "<value>"]
[--job-name "<value>"] [--host "<value>"] [--cluster-name
"<value>"]

Submit a job to a rainbow scheduler

Arguments:

-h --help Print help information
--token Client token to submit jobs with.. Default:
chocolate-cookies
-n --nodes Number of nodes to request. Default: 1
-t --tasks Number of tasks to request (per node? total?)
-c --command Command to submit. Default: chocolate-cookies
--job-name Name for the job (defaults to first command)
--host Scheduler server address (host:port). Default:
localhost:50051
--cluster-name Name of cluster to register. Default: keebler
```

Let's try doing that.

```bash
go run ./cmd/rainbow/rainbow.go submit --token "712747b7-b2a9-4bea-b630-056cd64856e6" --command hostname
```
```console
2024/02/11 21:43:17 🌈️ starting client (localhost:50051)...
2024/02/11 21:43:17 submit job: hostname
2024/02/11 21:43:17 status:SUBMIT_SUCCESS
```

Hooray! On the server log side we see...

```console
SELECT * from clusters WHERE name LIKE "keebler" LIMIT 1: keebler
2024/02/11 21:43:17 📝️ received job hostname for cluster keebler
```

Now we have a job in the database, and it's oriented for a specific cluster.
We can next (as the cluster) request to receive some number of max jobs. Let's
emulate that.

### Request Jobs

> Also List Jobs

We now are pretending to be the cluster that originally registered, and we want to request some number of max jobs
to look at. This doesn't mean we have to run them, but we want to ask for some small set to consider for running.
Right now this just does a query for the count, but in the future we can have actual filters / query parameters
for the jobs (nodes, time, etc.) that we want to ask for. Have some fun and submit a few jobs above, and then request
to see them:

```console
$ go run ./cmd/rainbow/rainbow.go request --request-secret 3cc06871-0990-4dc2-94d5-eec653c5d7a0 --cluster-name keebler --max-jobs 3
2024/02/12 23:29:59 🌈️ starting client (localhost:50051)...
2024/02/12 23:29:59 request jobs: 3
2024/02/12 23:29:59 🌀️ Found 3 jobs!
2024/02/12 23:29:59 1 : {"id":1,"cluster":"keebler","name":"hostname","nodes":1,"tasks":0,"command":"hostname"}
2024/02/12 23:29:59 2 : {"id":2,"cluster":"keebler","name":"sleep","nodes":1,"tasks":0,"command":"sleep 10"}
2024/02/12 23:29:59 3 : {"id":3,"cluster":"keebler","name":"dinosaur","nodes":1,"tasks":0,"command":"dinosaur things"}
```

And on the server side:

```console
2024/02/12 23:27:29 SELECT * from clusters WHERE name LIKE "keebler" LIMIT 1: keebler
2024/02/12 23:27:29 🌀️ requesting 3 max jobs for cluster keebler
```

Note that if you don't define the max jobs (so it is essentially 0) you will get all jobs. This is akin to listing jobs.
Awesome! Next we can put that logic in a flux instance (from the Python grpc to start) and then have Flux
accept some number of them. The response back to the rainbow scheduler will be those to accept, which will then be removed from the database. For another day.


### Accept Jobs

A derivative of the above is to request and accept jobs. This can be done with the example client above, and adding `--accept N`.

```console
$ go run ./cmd/rainbow/rainbow.go request --request-secret 3cc06871-0990-4dc2-94d5-eec653c5d7a0 --cluster-name keebler --max-jobs 3 --accept 1
```
```console
2024/02/13 12:29:29 🌀️ Found 3 jobs!
2024/02/13 12:29:29 1 : {"id":1,"cluster":"keebler","name":"hostname","nodes":1,"tasks":0,"command":"hostname"}
2024/02/13 12:29:29 2 : {"id":2,"cluster":"keebler","name":"sleep","nodes":1,"tasks":0,"command":"sleep 10"}
2024/02/13 12:29:29 3 : {"id":3,"cluster":"keebler","name":"dinosaur","nodes":1,"tasks":0,"command":"dinosaur things"}
2024/02/13 12:29:29 ✅️ Accepting 1 jobs!
2024/02/13 12:29:29 1
2024/02/13 12:29:29 status:RESULT_TYPE_SUCCESS
```

What this does is randomly select from the set you receive, and send back a response to the server to accept it, meaning the identifier is removed from the database. The server shows the following:

```console
2024/02/13 12:29:29 🌀️ accepting 1 for cluster keebler
2024/02/13 12:29:29 DELETE FROM jobs WHERE cluster = 'keebler' AND idJob in (1): (1)
```

The logic you would expect is there - that you can't accept greater than the number available.
You could try asking for a high level of max jobs again, and see that there is one fewer than before. It was deleted from the database.

## Development

### Build

You can build the binaries:

```console
$ make build
mkdir -p /home/vanessa/Desktop/Code/rainbow/bin
GO111MODULE="on" go build -o /home/vanessa/Desktop/Code/rainbow/bin/rainbow cmd/rainbow/rainbow.go
GO111MODULE="on" go build -o /home/vanessa/Desktop/Code/rainbow/bin/rainbow-scheduler cmd/server/server.go
```

Note that the `rainbow-scheduler` starts the server, and `rainbow` is the set of client commands.

```console
$ ls bin/
protoc-gen-go protoc-gen-go-grpc rainbow rainbow-scheduler
```

They are placed in the local bin, as shown above.

### Python

To build Python GRPC, ensure you have the grpc-tools installed:

```bash
pip install grpcio-tools
```

Then:

```bash
make python
```

and cd into [python/v1](python/v1) and follow the README instructions there.


## Container Images

We provide make commands to build:

- **ghcr.io/converged-computing/rainbow-scheduler**: the scheduler (the `rainbow` client and `rainbow-scheduler` binaries in an ubuntu base, intended to be run as the scheduler image)
- **ghcr.io/converged-computing/rainbow-flux**: the client (includes flux) for interacting with a scheduler.

Both images above have both binaries, it's just that the second has flux added. We can add more schedulers or other entities that can
accept jobs as needed. You can build in any of the following ways:

```bash
# both images, default registry
make docker

# scheduler
make docker-ubuntu

# client with flux
make docker-flux

# customize the registry for any command above
REGISTRY=vanessa make docker
```

Further instructions will be added for running these containers in the next round of work - likely we will have a basic kind setup that demonstrates the orchestration.

## TODO

Expand Down
Empty file added docs/.nojekyll
Empty file.
60 changes: 52 additions & 8 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,59 @@
# Multi-Cluster Proof of Concept
# Rainbow Scheduler

We can design a "tiny" setup of a more production setup as a proof of concept. Namely, we want to show that it's possible to submit jobs (from anywhere) that are directed to run on different clusters. We want to untangle this work from requiring specific workflow tools that might add additional toil or error, and direct development in a direction that makes things ultiamtely harder. That should be fairly easy to do I think.
The rainbow scheduler is a combined scheduler and client to allow for multi-cluster scheduling, meaning submission and management of jobs across environments. It is currently in a prototype state.

![img/rainbow-scheduler.png](img/rainbow-scheduler.png)
## Prototype Design

Our current design does not have a scheduler yet, and simply:

In the above:
- Exposes an API that can take job requests, where a request is a simple command and resources.
- Clusters can register to it, meaning they are allowed to ask for work.
- Users will submit jobs (from anywhere) to the API, targeting a specific cluster (again, no scheduling here)
- The cluster will run a client that periodically checks for new jobs to run.

- The **"scheduler"** can be thought of like a rabbitmq (or other task) queue, but with bells and whistles, and under our control. It will eventually have a scheduler that has high level information about clusters, but to start is just a simple database and endpoints to support job submission and registration. For registration, a secret is required, and then a cluster-specific token sent back for subsequent requests. This will need to be further hardened but is OK for a dummy proof of concept.
- Any **Flux instance** is allowed to hit the register endpoint and request to register with a specific cluster identifier (A or B in the diagram above) and is required to provide the secret. It receives back a token that can be used for subsequent requests. For this first dummy prototype, we will have a simple loop running in the instance that checks the scheduler for jobs assigned to it.
- Any **standalone client** (including the flux instances themselves) can then submit jobs, and request them to be run on any known cluster. This means that instance A can submit to B (and vice versa) and the standalone client can submit to A or B.
This is currently a prototype that demonstrates we can do a basic interaction from multiple places, and obviously will have a lot of room for improvement.
We can run the client alongside any flux instance that has access to this service (and is given some shared secret).

The reason I want to prototype the above is that we will want a simple design to test with additional compatibility metadata, and (when actual scheduler bindings are ready) we can add a basic graph to the scheduler above. As we develop we can harden the endpoints / authentication, etc.
For more details on the design, see [design.md](design.md)

## Components

- The main server (and optionally, a client) are implemented in Go, here
- Under [python](https://github.com/converged-computing/rainbow/tree/main/python/v1) we also have a client that is intended to run from a flux instance, another scheduler, or anywhere really. We haven't implemented the same server in entirety because it's assumed if you plan to run a server, Go is the better choice (and from a container we will provide). That said, the skeleton is there, but unimplemented for the most part.
- See [examples](https://github.com/converged-computing/rainbow/tree/main/docs/examples) for basic documentation and ways to deploy (containers and Kubernetes with kind, for example).

## Setup

Ensure you have your dependencies:

```bash
make tidy
```

In two terminals, start the server in one:

```bash
make server
```
```console
go run cmd/server/server.go
2024/02/12 19:38:58 creating 🌈️ server...
2024/02/12 19:38:58 ✨️ creating rainbow.db...
2024/02/12 19:38:58 rainbow.db file created
2024/02/12 19:38:58 create cluster table...
2024/02/12 19:38:58 cluster table created
2024/02/12 19:38:58 create jobs table...
2024/02/12 19:38:58 jobs table created
2024/02/12 19:38:58 starting scheduler server: rainbow v0.1.0-draft
2024/02/12 19:38:58 server listening: [::]:50051
```

Note that we also provide [containers](https://github.com/orgs/converged-computing/packages?repo_name=rainbow) for running the scheduler, or a client with Flux. For more advanced examples, continue reading commands below or check out our [examples](https://github.com/converged-computing/rainbow/tree/main/docs/examples).

## Commands

Read more about the commands shown above [here](commands.md#commands).

## Development

Read our [developer guide](#developer.md)
41 changes: 41 additions & 0 deletions docs/_coverpage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@

![logo](img/rainbow-circle-small.png)

# Rainbow Scheduler <small>docs</small>

- Register clusters to accept jobs
- Submit jobs to the rainbow scheduler
- Poll from clusters to accept and run

<style>
section.cover .cover-main > p:last-child a:last-child {
background-color: #ffffff;
color: black !important;
}

.github-corner svg {
color: #fff;
fill: #224852 !important;
}

section.cover .cover-main>p:last-child a {
border: 1px solid #ffffff !important;
color: white !important;
}

section.cover .cover-main {
margin: 20px 16px 0;
}

.cover {
background linear-gradient(to left bottom, hsl(182.12deg 57.05% 29.22%) 0%,hsl(250.96deg 50.34% 71.57%) 100%) !important;
color: white;
}

.cover-main span {
color: whitesmoke !important;
}
</style>

[GitHub](https://github.com/converged-computing/rainbow)
[Get Started](#rainbow-scheduler)
Loading