Skip to content

p-doom/datawhale

Repository files navigation

datawhale

License Discord

Datawhale is an open-source replacement to Datadog and Weights & Biases, combining system- (GPU/CPU/I/O/Networking) and experiment-level (loss, lr, norms, gradients, optimizer state, activations, attention maps) observability (metrics/logs) for AI research.

Datawhale uses an established open-source software stack (Grafana, Loki, Prometheus, node_exporter, dgcm_exporter) and can be seen as a configuration template for this stack.

Datawhale is designed to accommodate workflows from individual researchers working locally or on academic clusters (often without sudo rights) to entire research orgs with clusters in the cloud.

Installation

Datawhale supports single-server as well as multi-server configurations. Remote clusters usually require the multi-server installation path, such that jobs can be monitored across multiple nodes from a central location. Datawhale distinguishes between one server and multiple clients. A familiar analogy are the login (server) and compute (client) nodes on an HPC cluster.

Installing datawhale is as simple as cloning the repository and running setup.sh:

git clone https://github.com/p-doom/datawhale.git
# server installation
bash scripts/setup.sh ROLE=server MODE=standalone
# client installation
bash scripts/setup.sh ROLE=client MODE=standalone

We currently only support the standalone installation path. The server installation downloads prebuilt binaries of Grafana, Loki and Prometheus, while the client installation downloads node_exporter, and, depending on the availability, dcgm_exporter or nvml_exporter. Notably, this does not require package manager access, sudo rights, or cluster-side Docker support, and can thus be run on any (academic|on-prem|cloud) cluster. Docker-based installation support is on the roadmap. We currently only support amd64-linux, but the repository should be easily extendable to other architectures and operating systems.

You can run datawhale using deploy.sh:

bash scripts/deploy.sh ROLE=<server|client> MODE=standalone

About

Observability tooling for AI research. 0 egress.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published