Skip to content
This repository has been archived by the owner on Sep 9, 2020. It is now read-only.

docs: add new policy external check and rework some structure. #105

Merged
merged 2 commits into from
Nov 13, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,12 @@

[![Build Status](https://travis-ci.org/jrasell/sherpa.svg?branch=master)](https://travis-ci.org/jrasell/sherpa) [![Go Report Card](https://goreportcard.com/badge/github.com/jrasell/sherpa)](https://goreportcard.com/report/github.com/jrasell/sherpa) [![GoDoc](https://godoc.org/github.com/jrasell/sherpa?status.svg)](https://godoc.org/github.com/jrasell/sherpa)

Sherpa is a job scaler for [HashiCorp Nomad](https://www.nomadproject.io/) and aims to be highly flexible so it can support a wide range of architectures and budgets.
Sherpa is a highly available, fast, and flexible horizontal job scaling for [HashiCorp Nomad](https://www.nomadproject.io/). It is capable of running in a number of different modes to suit different requirements, and can scale based on Nomad resource metrics or external sources.

### Features
* __Scale jobs based on Nomad resource consumption and external metrics:__ The Sherpa autoscaler can use a mixture of Nomad resource checks, and external metric values to make scaling decisions. Both are optional to provide flexibility. Jobs can also be scaled via the CLI and API in either a manual manner, or by using webhooks sent from external applications such as Prometheus Alertmanager.
* __Highly available and fault tolerant:__ Sherpa performs leadership locking and quick fail-over, allowing multiple instances to run safely. During availability issues, or deployment Sherpa servers will gracefully handle leadership changes resulting in uninterrupted scaling.
* __Operator friendly:__ Sherpa is designed to be easy to understand and work with as an operator. Scaling state in particular can contain metadata, providing insights into exactly why a scaling activity took place. A simple UI is also available to provide an easy method of checking scaling activities.

## Download & Install

Expand Down
45 changes: 11 additions & 34 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,15 @@
# Sherpa Documentation

Sherpa is a fast and flexible job scaler for [HashiCorp Nomad](https://www.nomadproject.io/) capable of running in a number of different modes to suit your needs.

## Table of contents
1. [API](./api) documentation
1. [CLI](./commands) documentation
1. [Sherpa server](./configuration) configuration documentation
1. [Guides](./guides) to provide additional information on Sherpa behaviour and workflows

## Server Run Modes

The Sherpa server can be configured in a number of ways, to provide flexibility and scalability across Nomad deployments. For all the configuration options, please take a look at the [server command](./commands/server.md) documentation. Below outlines the Sherpa server run variations and suggests where they could be most viable.

### Autoscaling Run Types

Sherpa can perform autoscaling or act as a proxy for the CLI or other external sources. Each run mode has different pros and cons as detailed below.

#### Scaling Proxy

Sherpa can act as a scaling proxy, taking requests via the scaling API endpoints and performing the required actions. The actions can be triggered by external sources such as [Prometheus AlertManager](https://prometheus.io/docs/alerting/alertmanager/), where rules are configured on telemetry data points. When an alert is triggered, the system can then send a Sherpa API request via webhooks. This solution is the most scalable, delegating the resource analysis to systems designed for this type of work.
Sherpa is a highly available, fast, and flexible horizontal job scaling for [HashiCorp Nomad](https://www.nomadproject.io/). It is capable of running in a number of different modes to suit different requirements, and can scale based on Nomad resource metrics or external sources.

### Key Features
* __Scale jobs based on Nomad resource consumption and external metrics:__ The Sherpa autoscaler can use a mixture of Nomad resource checks, and external metric values to make scaling decisions. Both are optional to provide flexibility. Jobs can also be scaled via the CLI and API in either a manual manner, or by using webhooks sent from external applications such as Prometheus Alertmanager.
* __Highly available and fault tolerant:__ Sherpa performs leadership locking and quick fail-over, allowing multiple instances to run safely. During availability issues, or deployment Sherpa servers will gracefully handle leadership changes resulting in uninterrupted scaling.
* __Operator friendly:__ Sherpa is designed to be easy to understand and work with as an operator. Scaling state in particular can contain metadata, providing insights into exactly why a scaling activity took place. A simple UI is also available to provide an easy method of checking scaling activities.

#### Built-in Autoscaler

The built-in autoscaler is ideal for smaller, development or cost limited setups. It runs on an internal time and will asses the resource usage of all job groups which have an active scaling policy. It is important to remember that the internal autoscaler will put additional load onto the Nomad servers. This is caused by the fact that analysing the memory and cpu consumption of a job requires X Nomad API calls.

### Policy Run Types

Policies are a method of controlling how and when job groups are autoscaled. When using the built-in autoscaler, strict checking is enabled which means job group will only be scaled if they have an associated scaling policy.

#### API Policy Engine

Scaling policies can be written, updated and deleted via the API and CLI. These policies are then stored in one of the available [backends](./guides/policies.md#policy-storage-backend) which has been enabled. The in-memory backend is not suitable for any environment other than development as the policies are lost when Sherpa is stopped. The Consul backend is ideal for non-dev environments and policies will be persisted after Sherpa restarts.

#### Nomad Job Meta Policy Engine

Sherpa can also pull scaling policies from Nomad jobs via the [meta stanza](https://www.nomadproject.io/docs/job-specification/meta.html). Job groups that you wish to be scalable, should be configured with the appropriate keys and values. Sherpa will automatically read these and update its scaling table when changes occur. It useful to note here, meta policies should be strictly configured inside the job group stanza. If the meta keys are configured at the job level, they will not be applied to all groups within the job.
## Table of contents
1. [API](./api) documentation.
1. [CLI](./commands) documentation.
1. [Sherpa server](./configuration) configuration documentation.
1. [Guides](./guides) provide in-depth information on Sherpa behaviour, configuration, and workflows.
1. [Demos](./demos) provides and number of self contained examples to run through, allowing for better understanding of running Sherpa.
5 changes: 5 additions & 0 deletions docs/demos/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Sherpa Demos

The Sherpa demos provide a guided way to learn and run Sherpa. Each demo is designed to run through a specific scenario or feature and hopefully are enjoyable.

* The [quick-start demo](./quick-start.md) is the most basic demo and aims to give users and introduction to Sherpa.
79 changes: 79 additions & 0 deletions docs/demos/quick-start.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Quick Start Demo
This demo aims to have you running Sherpa and scaling a job in under 5 minutes on your local machine. This will allow you to get an understanding of how the application works, and also hopefully a sense of the power Sherpa can provide to your Nomad cluster.

In order to run the demo, you will need [Consul](https://www.consul.io/downloads.html), [Nomad](https://www.nomadproject.io/downloads.html) and [Sherpa](https://github.com/jrasell/sherpa/releases) downloaded and available to run locally. Once you have these available, you should start Consul and Nomad in local dev mode:
```
$ consul agent -dev
$ nomad agent -dev
```

Once the services startup, you can start the Sherpa process. Ideally in a path to production environment you would run Sherpa as a Nomad service, however, for simplicity we will run the raw binary. The startup flags tell Sherpa to use and perform the following actions:
* use Consul as its storage backend
* run the internal autoscaler which will perform resource utilisation checks and trigger scaling if required
* allow policies to be configured via the API
* run the Sherpa web UI

Once started, the Sherpa API will be available at http://127.0.0.1:8000, the default bind configuration.
```
$ sherpa server --storage-consul-enabled --autoscaler-enabled --policy-engine-api-enabled --log-level=debug --ui
```

With Sherpa successfully started, we need to run a job on Nomad which we can scale. Nomad provides an [init](https://www.nomadproject.io/docs/commands/job/init.html) utility to write out an example job which is ideal for this situation.
```
$ nomad init
```

The job is configured to only run 1 instance of the redis container. In order to change this count to be higher and allow us to demonstrate scaling in, we can use this handy sed command:
```
$ sed "s/count = 1/count = 4/g" example.nomad > example-new.nomad
```

We can then register the job on the Nomad cluster:
```
$ nomad run example-new.nomad
```

Once the Nomad job is running, we need to configure a scaling policy for the job and group so the autoscaler has something to evaluate. The policy mostly uses default values apart from the `MinCount` which we set to a low value. You can write the policy using the Sherpa CLI or API, depending on your preference.
```
$ curl -X POST --data '{"Enabled":true,"MinCount":1}' http://127.0.0.1:8000/v1/policy/example/cache
```

From this point on the autoscaler will run every 60s (default period) and assess the resource consumption of the job example. If it believes the job groups are over or under utilised it will suggest a scaling action. If the scaling action does not break any configured thresholds, the updated job specification will be submitted to Nomad. Over the next 3 minutes you should see 3 scaling in events. You can track this via the Sherpa logs, or via the Sherpa UI which is available at http://localhost:8000/ui.

### Important Log Entries

Understanding the Sherpa logs also helps understand the process and feature set. Below are a number of excerpts from the Sherpa logs you have available along with a small description explaining them.

Sherpa performs leadership elections to ensure only one Sherpa instance performs critical tasks such as running the autoscaler. Shortly after startup you should see that the Sherpa instance has obtained cluster leadership and started the protected sub-process.
```
10:38AM INF HTTP server successfully listening addr=127.0.0.1:8000
10:38AM DBG server received leader update message leader-msg="obtained leadership"
10:38AM INF started scaling state garbage collector handler
10:38AM INF starting Sherpa internal auto-scaling engine
```

The autoscaler will log information about the calculations it made to help understand the internals. In this log we see the usage percentages are low and that the autoscaler suggests we scale the job group in. This is successfully triggered as the scaling request does not break any configured thresholds.
```
4:17PM DBG resource utilisation calculation cpu-usage-percentage=8 group=cache job=example mem-usage-percentage=0
4:17PM DBG added group scaling request job=example scaling-req={"count":1,"direction":"in","group":"cache"}
4:17PM INF successfully triggered autoscaling of job evaluation-id=2680e261-d651-3687-4404-fe6f674a50dd id=893e075d-a64d-4415-bd23-c6f73aa4f98f job=example
```

In this set of logs lines, we can see that again the autoscaler suggests that the job group be scaled in. It is found, however, that scaling the job group in would break the minimum threshold. Therefore the request will not be actioned.
```
10:40AM DBG resource utilisation calculation cpu-usage-percentage=11 group=cache job=example mem-usage-percentage=0
10:40AM DBG added group scaling request job=example scaling-req={"count":1,"direction":"in","group":"cache"}
10:40AM DBG scaling action will break job group minimum threshold group=cache job=example
```

When shutting down Sherpa, the server will perform a number of safety tasks. This includes waiting for any in flight autoscaling process to finish, and shutting down the leadership process allowing another instance to take over quickly.
```
4:17PM DBG autoscaler still has in-flight workers, will continue to check
4:17PM DBG exiting autoscaling thread as a result of shutdown request
4:17PM INF successfully drained autoscaler worker pool
4:17PM INF shutting down leadership handler cluster-member-id=ab9e3278-5a18-4965-9e83-ce97e9423e8f cluster-name=sherpa-2e651291-161d-4758-a12d-72294088214c
4:17PM DBG shutting down periodic leader refresh cluster-member-id=ab9e3278-5a18-4965-9e83-ce97e9423e8f cluster-name=sherpa-2e651291-161d-4758-a12d-72294088214c
4:17PM DBG shutting down leader elections cluster-member-id=ab9e3278-5a18-4965-9e83-ce97e9423e8f cluster-name=sherpa-2e651291-161d-4758-a12d-72294088214c
4:17PM INF successfully shutdown server and sub-processes
4:17PM INF HTTP server has been shutdown: http: Server closed
```
92 changes: 10 additions & 82 deletions docs/guides/README.md
Original file line number Diff line number Diff line change
@@ -1,84 +1,12 @@
# Sherpa Guides

Welcome to the Sherpa guides. The guides provide examples for common Sherpa workflows and actions for both users and operators of Nomad clusters running Sherpa.

## Quick Start Demo

This demo aims to have you running Sherpa and scaling a job in under 5 minutes on your local machine. This will allow you to get an understanding of how the application works, and also hopefully a sense of the power Sherpa can provide to your Nomad cluster.

In order to run the demo, you will need [Consul](https://www.consul.io/downloads.html), [Nomad](https://www.nomadproject.io/downloads.html) and [Sherpa](https://github.com/jrasell/sherpa/releases) downloaded and available to run locally. Once you have these available, you should start Consul and Nomad in local dev mode:
```
$ consul agent -dev
$ nomad agent -dev
```

Once the services startup, you can start the Sherpa process. Ideally in a path to production environment you would run Sherpa as a Nomad service, however, for simplicity we will run the raw binary. The startup flags tell Sherpa to use and perform the following actions:
* use Consul as its storage backend
* run the internal autoscaler which will perform resource utilisation checks and trigger scaling if required
* allow policies to be configured via the API
* run the Sherpa web UI

Once started, the Sherpa API will be available at http://127.0.0.1:8000, the default bind configuration.
```
$ sherpa server --storage-consul-enabled --autoscaler-enabled --policy-engine-api-enabled --log-level=debug --ui
```

With Sherpa successfully started, we need to run a job on Nomad which we can scale. Nomad provides an [init](https://www.nomadproject.io/docs/commands/job/init.html) utility to write out an example job which is ideal for this situation.
```
$ nomad init
```

The job is configured to only run 1 instance of the redis container. In order to change this count to be higher and allow us to demonstrate scaling in, we can use this handy sed command:
```
$ sed "s/count = 1/count = 4/g" example.nomad > example-new.nomad
```

We can then register the job on the Nomad cluster:
```
$ nomad run example-new.nomad
```

Once the Nomad job is running, we need to configure a scaling policy for the job and group so the autoscaler has something to evaluate. The policy mostly uses default values apart from the `MinCount` which we set to a low value. You can write the policy using the Sherpa CLI or API, depending on your preference.
```
$ curl -X POST --data '{"Enabled":true,"MinCount":1}' http://127.0.0.1:8000/v1/policy/example/cache
```

From this point on the autoscaler will run every 60s (default period) and assess the resource consumption of the job example. If it believes the job groups are over or under utilised it will suggest a scaling action. If the scaling action does not break any configured thresholds, the updated job specification will be submitted to Nomad. Over the next 3 minutes you should see 3 scaling in events. You can track this via the Sherpa logs, or via the Sherpa UI which is available at http://localhost:8000/ui.

### Important Log Entries

Understanding the Sherpa logs also helps understand the process and feature set. Below are a number of excerpts from the Sherpa logs you have available along with a small description explaining them.

Sherpa performs leadership elections to ensure only one Sherpa instance performs critical tasks such as running the autoscaler. Shortly after startup you should see that the Sherpa instance has obtained cluster leadership and started the protected sub-process.
```
10:38AM INF HTTP server successfully listening addr=127.0.0.1:8000
10:38AM DBG server received leader update message leader-msg="obtained leadership"
10:38AM INF started scaling state garbage collector handler
10:38AM INF starting Sherpa internal auto-scaling engine
```

The autoscaler will log information about the calculations it made to help understand the internals. In this log we see the usage percentages are low and that the autoscaler suggests we scale the job group in. This is successfully triggered as the scaling request does not break any configured thresholds.
```
4:17PM DBG resource utilisation calculation cpu-usage-percentage=8 group=cache job=example mem-usage-percentage=0
4:17PM DBG added group scaling request job=example scaling-req={"count":1,"direction":"in","group":"cache"}
4:17PM INF successfully triggered autoscaling of job evaluation-id=2680e261-d651-3687-4404-fe6f674a50dd id=893e075d-a64d-4415-bd23-c6f73aa4f98f job=example
```

In this set of logs lines, we can see that again the autoscaler suggests that the job group be scaled in. It is found, however, that scaling the job group in would break the minimum threshold. Therefore the request will not be actioned.
```
10:40AM DBG resource utilisation calculation cpu-usage-percentage=11 group=cache job=example mem-usage-percentage=0
10:40AM DBG added group scaling request job=example scaling-req={"count":1,"direction":"in","group":"cache"}
10:40AM DBG scaling action will break job group minimum threshold group=cache job=example
```

When shutting down Sherpa, the server will perform a number of safety tasks. This includes waiting for any in flight autoscaling process to finish, and shutting down the leadership process allowing another instance to take over quickly.
```
4:17PM DBG autoscaler still has in-flight workers, will continue to check
4:17PM DBG exiting autoscaling thread as a result of shutdown request
4:17PM INF successfully drained autoscaler worker pool
4:17PM INF shutting down leadership handler cluster-member-id=ab9e3278-5a18-4965-9e83-ce97e9423e8f cluster-name=sherpa-2e651291-161d-4758-a12d-72294088214c
4:17PM DBG shutting down periodic leader refresh cluster-member-id=ab9e3278-5a18-4965-9e83-ce97e9423e8f cluster-name=sherpa-2e651291-161d-4758-a12d-72294088214c
4:17PM DBG shutting down leader elections cluster-member-id=ab9e3278-5a18-4965-9e83-ce97e9423e8f cluster-name=sherpa-2e651291-161d-4758-a12d-72294088214c
4:17PM INF successfully shutdown server and sub-processes
4:17PM INF HTTP server has been shutdown: http: Server closed
```
Welcome to the Sherpa guides. The guides are designed provide examples for common Sherpa workflows and actions for both users and operators of Nomad clusters running Sherpa.

## Table of contents
1. [Storage](./storage.md) detailing the different backends available for storing Sherpa state.
1. [High availability](./high-availability.md) details running Sherpa in a highly available manner using leadership locking.
1. [Scaling policies](./policies.md) control the core functionality of Sherpa.
1. [Autoscaler](./autoscaler.md) process handles assessing whether a job group requires scaling based on metrics and thresholds configured within the scaling policy.
1. [Scaling state](./scaling-state.md) details the stored state as a result of a scaling activity.
1. [Web UI](./ui.md) providing details of the simple user interface available for Sherpa.
1. [Telemetry](./telemetry.md) details all available metric data-points for Sherpa and their meanings.
Loading