converged-computing · vsoch · Feb 14, 2024 · Feb 14, 2024
diff --git a/README.md b/README.md
@@ -3,270 +3,13 @@
 > 🌈️ Where keebler elves and schedulers live, somewhere in the clouds, and with marshmallows 
 
 [![PyPI version](https://badge.fury.io/py/rainbow-scheduler.svg)](https://badge.fury.io/py/rainbow-scheduler)
-![img/rainbow.png](img/rainbow.png)
+![docs/img/rainbow.png](docs/img/rainbow.png)
 
-This is a prototype that will use a Go [gRPC](https://grpc.io/) server/client to demonstrate multi-cluster scheduling. This won't be doing anything intelligent with respect to scheduling (but could) but instead:
+This is a prototype that will use a Go [gRPC](https://grpc.io/) server/client to demonstrate multi-cluster scheduling. 
+For more information:
 
-- Will expose an API that can take job requests, where a request is a simple command and resources.
-- Clusters can register to it, meaning they are allowed to ask for work.
-- Users will submit jobs (from anywhere) to the API, targeting a specific cluster (again, no scheduling here)
-- The cluster will run a client that periodically checks for new jobs to run.
+ - ⭐️ [Documentation](https://converged-computing.github.io/rainbow) ⭐️
 
-This will just be a prototype that demonstrates we can do a basic interaction from multiple places, and obviously will have a lot of room for improvement.
-We can run the client alongside any flux instance that has access to this service (and is given some shared secret).
-
-## Components
-
- - The main server (and optionally, a client) are implemented in Go, here
- - Under [python](python) we also have a client that is intended to run from a flux instance, another scheduler, or anywhere really. We haven't implemented the same server in entirety because it's assumed if you plan to run a server, Go is the better choice (and from a container we will provide). That said, the skeleton is there, but unimplemented for the most part.
-
-## Development
-
-### proto
-
-We are using [Protocol Buffers](https://developers.google.com/protocol-buffers/)  "Protobuf" to define the API (how the payloads are shared and the methods for communication between client and server). These are defined in [api/v1/sample.proto](api/v1/sample.proto). 
-You can read more about Protobuf [here](https://github.com/golang/protobuf), I first saw / used them with fluence and am still pretty new.
-
-```shell
-make proto
-```
-
-That will download protoc and needed tools into a local "bin" and then generate the bindings.
-
-## Getting Started
-
-### Setup
-
-Ensure you have your dependencies:
-
-```bash
-make tidy
-```
-
-In two terminals, start the server in one:
-
-```bash
-make server
-```
-```console
-go run cmd/server/server.go
-2024/02/12 19:38:58 creating 🌈️ server...
-2024/02/12 19:38:58 ✨️ creating rainbow.db...
-2024/02/12 19:38:58    rainbow.db file created
-2024/02/12 19:38:58    create cluster table...
-2024/02/12 19:38:58    cluster table created
-2024/02/12 19:38:58    create jobs table...
-2024/02/12 19:38:58    jobs table created
-2024/02/12 19:38:58 starting scheduler server: rainbow v0.1.0-draft
-2024/02/12 19:38:58 server listening: [::]:50051
-```
-
-### Register
-
-And then mock a registration:
-
-```bash
-make register
-```
-```console
-go run cmd/rainbow/rainbow.go register
-2024/02/12 22:17:43 🌈️ starting client (localhost:50051)...
-2024/02/12 22:17:43 registering cluster: keebler
-2024/02/12 22:17:43 status: REGISTER_SUCCESS
-2024/02/12 22:17:43 secret: 54c4568a-14f2-465f-aa1e-5e6e0e3efd33
-2024/02/12 22:17:43  token: 67e0f258-96c3-4d88-8253-287a95653138
-```
-
-In the above:
-
-- `token` is what is given to clients to submit jobs
-- `secret` is a secret just for your cluster / instance / place you can receive jobs to receive them!
-
-You'll see this from the server:
-
-```console
-2024/02/12 22:17:43 📝️ received register: keebler
-2024/02/12 22:17:43 SELECT count(*) from clusters WHERE name = 'keebler': (0)
-2024/02/12 22:17:43 INSERT into clusters (name, token, secret) VALUES ("keebler", "67e0f258-96c3-4d88-8253-287a95653138", "54c4568a-14f2-465f-aa1e-5e6e0e3efd33"): (1)
-```
-
-In the above, we are providing a cluster name (keebler) and it is being registered to the database, and a token, secret and status returned. Note that if we want to submit a job to the "keebler" cluster, from anywhere, we need this token! Let's try that next.
-
-### Submit Job
-
-To submit a job, we need the client `token` associated with a cluster.
-
-```bash
-# Look at help
-go run ./cmd/rainbow/rainbow.go submit --help
-```
-```
-usage: rainbow submit [-h|--help] [-s|--secret "<value>"] [-n|--nodes
-               <integer>] [-t|--tasks <integer>] [-c|--command "<value>"]
-               [--job-name "<value>"] [--host "<value>"] [--cluster-name
-               "<value>"]
-
-               Submit a job to a rainbow scheduler
-
-Arguments:
-
-  -h  --help          Print help information
-      --token         Client token to submit jobs with.. Default:
-                      chocolate-cookies
-  -n  --nodes         Number of nodes to request. Default: 1
-  -t  --tasks         Number of tasks to request (per node? total?)
-  -c  --command       Command to submit. Default: chocolate-cookies
-      --job-name      Name for the job (defaults to first command)
-      --host          Scheduler server address (host:port). Default:
-                      localhost:50051
-      --cluster-name  Name of cluster to register. Default: keebler
-```
-
-Let's try doing that.
-
-```bash
-go run ./cmd/rainbow/rainbow.go submit --token "712747b7-b2a9-4bea-b630-056cd64856e6" --command hostname
-```
-```console
-2024/02/11 21:43:17 🌈️ starting client (localhost:50051)...
-2024/02/11 21:43:17 submit job: hostname
-2024/02/11 21:43:17 status:SUBMIT_SUCCESS
-```
-
-Hooray! On the server log side we see...
-
-```console
-SELECT * from clusters WHERE name LIKE "keebler" LIMIT 1: keebler
-2024/02/11 21:43:17 📝️ received job hostname for cluster keebler
-```
-
-Now we have a job in the database, and it's oriented for a specific cluster.
-We can next (as the cluster) request to receive some number of max jobs. Let's
-emulate that.
-
-### Request Jobs
-
-> Also List Jobs
-
-We now are pretending to be the cluster that originally registered, and we want to request some number of max jobs
-to look at. This doesn't mean we have to run them, but we want to ask for some small set to consider for running.
-Right now this just does a query for the count, but in the future we can have actual filters / query parameters
-for the jobs (nodes, time, etc.) that we want to ask for. Have some fun and submit a few jobs above, and then request
-to see them:
-
-```console
-$ go run ./cmd/rainbow/rainbow.go request --request-secret 3cc06871-0990-4dc2-94d5-eec653c5d7a0 --cluster-name keebler --max-jobs 3
-2024/02/12 23:29:59 🌈️ starting client (localhost:50051)...
-2024/02/12 23:29:59 request jobs: 3
-2024/02/12 23:29:59 🌀️ Found 3 jobs!
-2024/02/12 23:29:59 1 : {"id":1,"cluster":"keebler","name":"hostname","nodes":1,"tasks":0,"command":"hostname"}
-2024/02/12 23:29:59 2 : {"id":2,"cluster":"keebler","name":"sleep","nodes":1,"tasks":0,"command":"sleep 10"}
-2024/02/12 23:29:59 3 : {"id":3,"cluster":"keebler","name":"dinosaur","nodes":1,"tasks":0,"command":"dinosaur things"}
-```
-
-And on the server side:
-
-```console
-2024/02/12 23:27:29 SELECT * from clusters WHERE name LIKE "keebler" LIMIT 1: keebler
-2024/02/12 23:27:29 🌀️ requesting 3 max jobs for cluster keebler
-```
-
-Note that if you don't define the max jobs (so it is essentially 0) you will get all jobs. This is akin to listing jobs.
-Awesome! Next we can put that logic in a flux instance (from the Python grpc to start) and then have Flux
-accept some number of them. The response back to the rainbow scheduler will be those to accept, which will then be removed from the database. For another day.
-
-
-### Accept Jobs
-
-A derivative of the above is to request and accept jobs. This can be done with the example client above, and adding `--accept N`.
-
-```console
-$ go run ./cmd/rainbow/rainbow.go request --request-secret 3cc06871-0990-4dc2-94d5-eec653c5d7a0 --cluster-name keebler --max-jobs 3 --accept 1
-```
-```console
-2024/02/13 12:29:29 🌀️ Found 3 jobs!
-2024/02/13 12:29:29 1 : {"id":1,"cluster":"keebler","name":"hostname","nodes":1,"tasks":0,"command":"hostname"}
-2024/02/13 12:29:29 2 : {"id":2,"cluster":"keebler","name":"sleep","nodes":1,"tasks":0,"command":"sleep 10"}
-2024/02/13 12:29:29 3 : {"id":3,"cluster":"keebler","name":"dinosaur","nodes":1,"tasks":0,"command":"dinosaur things"}
-2024/02/13 12:29:29 ✅️ Accepting 1 jobs!
-2024/02/13 12:29:29    1
-2024/02/13 12:29:29 status:RESULT_TYPE_SUCCESS
-```
-
-What this does is randomly select from the set you receive, and send back a response to the server to accept it, meaning the identifier is removed from the database. The server shows the following:
-
-```console
-2024/02/13 12:29:29 🌀️ accepting 1 for cluster keebler
-2024/02/13 12:29:29 DELETE FROM jobs WHERE cluster = 'keebler' AND idJob in (1): (1)
-```
-
-The logic you would expect is there - that you can't accept greater than the number available.
-You could try asking for a high level of max jobs again, and see that there is one fewer than before. It was deleted from the database.
-
-## Development
-
-### Build
-
-You can build the binaries:
-
-```console
-$ make build
-mkdir -p /home/vanessa/Desktop/Code/rainbow/bin
-GO111MODULE="on" go build -o /home/vanessa/Desktop/Code/rainbow/bin/rainbow cmd/rainbow/rainbow.go
-GO111MODULE="on" go build -o /home/vanessa/Desktop/Code/rainbow/bin/rainbow-scheduler cmd/server/server.go
-```
-
-Note that the `rainbow-scheduler` starts the server, and `rainbow` is the set of client commands.
-
-```console
-$ ls bin/
-protoc-gen-go  protoc-gen-go-grpc  rainbow  rainbow-scheduler
-```
-
-They are placed in the local bin, as shown above.
-
-### Python
-
-To build Python GRPC, ensure you have the grpc-tools installed:
-
-```bash
-pip install grpcio-tools
-```
-
-Then:
-
-```bash
-make python
-```
-
-and cd into [python/v1](python/v1) and follow the README instructions there.
-
-
-## Container Images
-
-We provide make commands to build:
-
-- **ghcr.io/converged-computing/rainbow-scheduler**: the scheduler (the `rainbow` client and `rainbow-scheduler` binaries in an ubuntu base, intended to be run as the scheduler image)
-- **ghcr.io/converged-computing/rainbow-flux**: the client (includes flux) for interacting with a scheduler.
-
-Both images above have both binaries, it's just that the second has flux added. We can add more schedulers or other entities that can
-accept jobs as needed. You can build in any of the following ways:
-
-```bash
-# both images, default registry
-make docker
-
-# scheduler
-make docker-ubuntu
-
-# client with flux
-make docker-flux
-
-# customize the registry for any command above
-REGISTRY=vanessa make docker
-```
-
-Further instructions will be added for running these containers in the next round of work - likely we will have a basic kind setup that demonstrates the orchestration.
 
 ## TODO
 

diff --git a/docs/.nojekyll b/docs/.nojekyll
diff --git a/docs/README.md b/docs/README.md
@@ -1,15 +1,59 @@
-# Multi-Cluster Proof of Concept
+# Rainbow Scheduler
 
-We can design a "tiny" setup of a more production setup as a proof of concept. Namely, we want to show that it's possible to submit jobs (from anywhere) that are directed to run on different clusters. We want to untangle this work from requiring specific workflow tools that might add additional toil or error, and direct development in a direction that makes things ultiamtely harder. That should be fairly easy to do I think.
+The rainbow scheduler is a combined scheduler and client to allow for multi-cluster scheduling, meaning submission and management of jobs across environments. It is currently in a prototype state.
 
-![img/rainbow-scheduler.png](img/rainbow-scheduler.png)
+## Prototype Design
 
+Our current design does not have a scheduler yet, and simply:
 
-In the above:
+- Exposes an API that can take job requests, where a request is a simple command and resources.
+- Clusters can register to it, meaning they are allowed to ask for work.
+- Users will submit jobs (from anywhere) to the API, targeting a specific cluster (again, no scheduling here)
+- The cluster will run a client that periodically checks for new jobs to run.
 
-- The **"scheduler"** can be thought of like a rabbitmq (or other task) queue, but with bells and whistles, and under our control. It will eventually have a scheduler that has high level information about clusters, but to start is just a simple database and endpoints to support job submission and registration. For registration, a secret is required, and then a cluster-specific token sent back for subsequent requests. This will need to be further hardened but is OK for a dummy proof of concept.
-- Any **Flux instance** is allowed to hit the register endpoint and request to register with a specific cluster identifier (A or B in the diagram above) and is required to provide the secret. It receives back a token that can be used for subsequent requests. For this first dummy prototype, we will have a simple loop running in the instance that checks the scheduler for jobs assigned to it.
-- Any **standalone client** (including the flux instances themselves) can then submit jobs, and request them to be run on any known cluster. This means that instance A can submit to B (and vice versa) and the standalone client can submit to A or B.
+This is currently a prototype that demonstrates we can do a basic interaction from multiple places, and obviously will have a lot of room for improvement.
+We can run the client alongside any flux instance that has access to this service (and is given some shared secret).
 
-The reason I want to prototype the above is that we will want a simple design to test with additional compatibility metadata, and (when actual scheduler bindings are ready) we can add a basic graph to the scheduler above. As we develop we can harden the endpoints / authentication, etc.
+For more details on the design, see [design.md](design.md)
 
+## Components
+
+ - The main server (and optionally, a client) are implemented in Go, here
+ - Under [python](https://github.com/converged-computing/rainbow/tree/main/python/v1) we also have a client that is intended to run from a flux instance, another scheduler, or anywhere really. We haven't implemented the same server in entirety because it's assumed if you plan to run a server, Go is the better choice (and from a container we will provide). That said, the skeleton is there, but unimplemented for the most part.
+ - See [examples](https://github.com/converged-computing/rainbow/tree/main/docs/examples) for basic documentation and ways to deploy (containers and Kubernetes with kind, for example).
+
+## Setup
+
+Ensure you have your dependencies:
+
+```bash
+make tidy
+```
+
+In two terminals, start the server in one:
+
+```bash
+make server
+```
+```console
+go run cmd/server/server.go
+2024/02/12 19:38:58 creating 🌈️ server...
+2024/02/12 19:38:58 ✨️ creating rainbow.db...
+2024/02/12 19:38:58    rainbow.db file created
+2024/02/12 19:38:58    create cluster table...
+2024/02/12 19:38:58    cluster table created
+2024/02/12 19:38:58    create jobs table...
+2024/02/12 19:38:58    jobs table created
+2024/02/12 19:38:58 starting scheduler server: rainbow v0.1.0-draft
+2024/02/12 19:38:58 server listening: [::]:50051
+```
+
+Note that we also provide [containers](https://github.com/orgs/converged-computing/packages?repo_name=rainbow) for running the scheduler, or a client with Flux. For more advanced examples, continue reading commands below or check out our [examples](https://github.com/converged-computing/rainbow/tree/main/docs/examples). 
+
+## Commands
+
+Read more about the commands shown above [here](commands.md#commands).
+
+## Development
+
+Read our [developer guide](#developer.md)
diff --git a/docs/_coverpage.md b/docs/_coverpage.md
@@ -0,0 +1,41 @@
+
+![logo](img/rainbow-circle-small.png)
+
+# Rainbow Scheduler <small>docs</small>
+
+- Register clusters to accept jobs
+- Submit jobs to the rainbow scheduler
+- Poll from clusters to accept and run
+
+<style>
+section.cover .cover-main > p:last-child a:last-child {
+    background-color: #ffffff;
+    color: black !important;
+}
+
+.github-corner svg {
+    color: #fff;
+    fill: #224852 !important;
+}
+
+section.cover .cover-main>p:last-child a {
+    border: 1px solid #ffffff !important;
+    color: white !important;
+}
+
+section.cover .cover-main {
+    margin: 20px 16px 0;
+}
+
+.cover {
+    background linear-gradient(to left bottom, hsl(182.12deg 57.05% 29.22%) 0%,hsl(250.96deg 50.34% 71.57%) 100%) !important;
+    color: white;
+}
+
+.cover-main span {
+    color: whitesmoke !important;
+}
+</style>
+
+[GitHub](https://github.com/converged-computing/rainbow)
+[Get Started](#rainbow-scheduler)