Skip to content

Commit

Permalink
Add terraform scripts to support alibaba cloud ACK deployment (#436)
Browse files Browse the repository at this point in the history
* Add aliyun ACK module

Signed-off-by: Aylei <rayingecho@gmail.com>

* Add terraform scripts to deploy tidb operator and tidb cluster to
Alibaba Cloud Kubernetes

* Fix empty vswitch ids and in-consistent region

Signed-off-by: Aylei <rayingecho@gmail.com>

* Fix node taints and node labels

Signed-off-by: Aylei <rayingecho@gmail.com>

* Fix attach node, skip formatting local SSD

Signed-off-by: Aylei <rayingecho@gmail.com>

* Add cloud shell hint and fix helm trigger

Signed-off-by: Aylei <rayingecho@gmail.com>

* Address review comments

Signed-off-by: Aylei <rayingecho@gmail.com>

* Set pd and tikv storage based on instance type
  • Loading branch information
aylei authored and tennix committed May 6, 2019
1 parent 6598b4d commit eebd686
Show file tree
Hide file tree
Showing 23 changed files with 1,580 additions and 0 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,8 @@ Choose one of the following tutorials:

* [Deploy TiDB by launching an AWS EKS cluster](./docs/aws-eks-tutorial.md)

* [Deploy TiDB Operator and TiDB Cluster on Alibaba Cloud Kubernetes](./deploy/alicloud/README.md)

* [Deploy TiDB in the minikube cluster](./docs/minikube-tutorial.md)

## User guide
Expand Down
6 changes: 6 additions & 0 deletions deploy/alicloud/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
.terraform/
credentials/
terraform.tfstate
terraform.tfstate.backup
.terraform.tfstate.lock.info
rendered/
103 changes: 103 additions & 0 deletions deploy/alicloud/README-CN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# 在阿里云上部署 TiDB Operator 和 TiDB 集群

## 环境需求

- [aliyun-cli](https://github.com/aliyun/aliyun-cli) >= 3.0.15 并且[配置 aliyun-cli](https://www.alibabacloud.com/help/doc-detail/90766.htm?spm=a2c63.l28256.a3.4.7b52a893EFVglq)
- [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/#install-kubectl) >= 1.12
- [helm](https://github.com/helm/helm/blob/master/docs/install.md#installing-the-helm-client) >= 2.9.1
- [jq](https://stedolan.github.io/jq/download/) >= 1.6
- [terraform](https://learn.hashicorp.com/terraform/getting-started/install.html) 0.11.*

> 你可以使用阿里云的 [云命令行](https://shell.aliyun.com) 服务来进行操作,云命令行中已经预装并配置好了所有工具。
## 概览

默认配置下,我们会创建:

- 一个新的 VPC;
- 一台 ECS 实例作为堡垒机;
- 一个托管版 ACK(阿里云 Kubernetes)集群以及一系列 worker 节点:
- 属于一个伸缩组的 2 台 ECS 实例(1核1G), 托管版 Kubernetes 的默认伸缩组中必须至少有两台实例, 用于承载整个的系统服务, 比如 CoreDNS
- 属于一个伸缩组的 3 台 `ecs.i2.xlarge` 实例, 用于部署 PD
- 属于一个伸缩组的 3 台 `ecs.i2.2xlarge` 实例, 用于部署 TiKV
- 属于一个伸缩组的 2 台 ECS 实例(16核32G)用于部署 TiDB
- 属于一个伸缩组的 1 台 ECS 实例(4核8G)用于部署监控组件
- 一块 500GB 的云盘用作监控数据存储

除了默认伸缩组之外的其它所有实例都是跨可用区部署的。而伸缩组(Auto-scaling Group)能够保证集群的健康实例数等于期望数值,因此,当发生节点故障甚至可用区故障时,伸缩组能够自动为我们创建新实例来确保服务可用性。

## 安装

设置目标 Region 和阿里云密钥(也可以在运行 `terraform` 命令时根据命令提示输入)
```shell
export TF_VAR_ALICLOUD_REGION=<YOUR_REGION>
export TF_VAR_ALICLOUD_ACCESS_KEY=<YOUR_ACCESS_KEY>
export TF_VAR_ALICLOUD_SECRET_KEY=<YOUR_SECRET_KEY>
```

使用 Terraform 进行安装:

```shell
$ git clone https://github.com/pingcap/tidb-operator
$ cd tidb-operator/deploy/alicloud
$ terraform init
$ terraform apply
```

整个安装过程大约需要 5 至 10 分钟,安装完成后会输出集群的关键信息(想要重新查看这些信息,可以运行 `terraform output`),接下来可以用 `kubectl``helm` 对集群进行操作:

```shell
$ export KUBECONFIG=$PWD/credentials/kubeconfig_<cluster_name>
$ kubectl version
$ helm ls
```

并通过堡垒机连接 TiDB 集群进行测试:

```shell
$ ssh -i credentials/bastion-key.pem root@<bastion_ip>
$ mysql -h <tidb_slb_ip> -P <tidb_port> -u root
```

## 升级 TiDB 集群

设置 `variables.tf` 中的 `tidb_version` 参数,运行 `terraform apply` 即可完成升级。

## TiDB 集群水平伸缩

设计 `variables.tf` 中的 `tikv_count``tidb_count`,运行 `terraform apply` 即可完成 TiDB 集群的水平伸缩。

## 销毁集群

```shell
$ terraform destroy
```

> 注意:监控组件挂载的云盘需要手动删除。
## 监控

访问 `<monitor_endpoint>` 就可以查看相关的 Grafana 看板。

> 出于安全考虑,假如你已经或将要给 VPC 配置 VPN,强烈推荐将 `monitor_slb_network_type` 设置为 `intranet` 来禁止监控服务的公网访问。
## 自定义

默认配置下,Terraform 脚本会创建一个新的 VPC,假如要使用现有的 VPC,可以在 `variable.tf` 中设置 `vpc_id`。注意,当使用现有 VPC 时,没有设置 vswitch 的可用区将不会部署 kubernetes 节点。

出于安全考虑,TiDB 服务的 SLB 只对内网暴露,因此默认配置下还会创建一台堡垒机用于运维操作。堡垒机上还会安装 mysql-cli 和 sysbench 以便于使用和测试。假如不需要堡垒机,可以设置 `variables.tf` 中的 `create_bastion` 参数来关闭。

实例的规格可以通过两种方式进行定义:

1. 通过声明实例规格名;
2. 通过声明实例的配置,比如 CPU 核数和内存大小。

由于阿里云在不同地域会提供不同的规格类型,并且部分规格有售罄的情况,我们推荐使用第二种办法来定义更通用的实例规格。你可以在 `variables.tf` 中找到相关的配置项。

特殊地,由于 PD 和 TiKV 节点强需求本地 SSD 存储,脚本中不允许直接声明 PD 和 TiKV 的规格名,你可以通过设置 `*_instance_type_family` 来选择 PD 或 TiKV 的规格族(只能在三个拥有本地 SSD 的规格族中选择),再通过内存大小来筛选符合需求的型号。

更多自定义配置相关的内容,请直接参考项目中的 `variables.tf` 文件。

## 限制

目前,pod cidr, service cid 和节点型号等配置在集群创建后均无法修改。
109 changes: 109 additions & 0 deletions deploy/alicloud/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Deploy TiDB Operator and TiDB Cluster on Alibaba Cloud Kubernetes

[中文](README-CN.md)

## Requirements

- [aliyun-cli](https://github.com/aliyun/aliyun-cli) >= 3.0.15 and [configure aliyun-cli](https://www.alibabacloud.com/help/doc-detail/90766.htm?spm=a2c63.l28256.a3.4.7b52a893EFVglq)
- [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/#install-kubectl) >= 1.12
- [helm](https://github.com/helm/helm/blob/master/docs/install.md#installing-the-helm-client) >= 2.9.1
- [jq](https://stedolan.github.io/jq/download/) >= 1.6
- [terraform](https://learn.hashicorp.com/terraform/getting-started/install.html) 0.11.*

> You can use the Alibaba [Cloud Shell](https://shell.aliyun.com) service, which has all the tools pre-installed and properly configured.
## Overview

The default setup will create:

- A new VPC
- An ECS instance as bastion machine
- A managed ACK(Alibaba Cloud Kubernetes) cluster with the following ECS instance worker nodes:
- An auto-scaling group of 2 * instances(1c1g) as ACK mandatory workers for system service like CoreDNS
- An auto-scaling group of 3 * `ecs.i2.xlarge` instances for PD
- An auto-scaling group of 3 * `ecs.i2.2xlarge` instances for TiKV
- An auto-scaling group of 2 * instances(16c32g) for TiDB
- An auto-scaling group of 1 * instance(4c8g) for monitoring components

In addition, the monitoring node will mount a 500GB cloud disk as data volume. All the instances except ACK mandatory workers span in multiple available zones to provide cross-AZ high availability.

The auto-scaling group will ensure the desired number of healthy instances, so the cluster can auto recover from node failure or even available zone failure.

## Setup

Configure target region and credential (you can also set these variables in `terraform` command prompt):
```shell
export TF_VAR_ALICLOUD_REGION=<YOUR_REGION>
export TF_VAR_ALICLOUD_ACCESS_KEY=<YOUR_ACCESS_KEY>
export TF_VAR_ALICLOUD_SECRET_KEY=<YOUR_SECRET_KEY>
```

Apply the stack:

```shell
$ git clone https://github.com/pingcap/tidb-operator
$ cd tidb-operator/deploy/alicloud
$ terraform init
$ terraform apply
```

`terraform apply` will take 5 to 10 minutes to create the whole stack, once complete, you can interact with the ACK cluster using `kubectl` and `helm`:

```shell
$ export KUBECONFIG=$PWD/credentials/kubeconfig_<cluster_name>
$ kubectl version
$ helm ls
```

Then you can connect the TiDB cluster via the bastion instance:

```shell
$ ssh -i credentials/bastion-key.pem root@<bastion_ip>
$ mysql -h <tidb_slb_ip> -P <tidb_port> -u root
```

## Monitoring

Visit `<monitor_endpoint>` to view the grafana dashboards.

> It is strongly recommended to set `monitor_slb_network_type` to `intranet` for security if you already have a VPN connecting to your VPC or plan to setup one.
## Upgrade TiDB cluster

To upgrade TiDB cluster, modify `tidb_version` variable to a higher version in variables.tf and run `terraform apply`.

## Scale TiDB cluster

To scale TiDB cluster, modify `tikv_count` or `tidb_count` to your desired count, and then run `terraform apply`.

## Destroy

```shell
$ terraform destroy
```

> Note: You have to manually delete the cloud disk used by monitoring node after destroying if you don't need it anymore.
## Customize

By default, the terraform script will create a new VPC. You can use an existing VPC by setting `vpc_id` to use an existing VPC. Note that kubernetes node will only be created in available zones that has vswitch existed when using existing VPC.

An ecs instance is also created by default as bastion machine to connect to the created TiDB cluster, because the TiDB service is only exposed to intranet. The bastion instance has mysql-cli and sysbench installed that helps you use and test TiDB.

If you don't have to access TiDB from internet, you could disable the creation of bastion instance by setting `create_bastion` to false in `variables.tf`

The worker node instance types are also configurable, there are two ways to configure that:

1. by specifying instance type id
2. by specifying capacity like instance cpu count and memory size

Because the Alibaba Cloud offers different instance types in different region, it is recommended to specify the capacity instead of certain type. You can configure these in the `variables.tf`, note that instance type will override capacity configurations.

There's a exception for PD and TiKV instances, because PD and TiKV required local SSD, so you cannot specify instance type for them. Instead, you can choose the type family among `ecs.i1`,`ecs.i2` and `ecs.i2g`, which has one or more local NVMe SSD, and select a certain type in the type family by specifying `instance_memory_size`.

For more customization options, please refer to `variables.tf`

## Limitations

You cannot change pod cidr, service cidr and worker instance types once the cluster created.

40 changes: 40 additions & 0 deletions deploy/alicloud/ack/data.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
data "alicloud_zones" "all" {
network_type = "Vpc"
}

data "alicloud_vswitches" "default" {
vpc_id = "${var.vpc_id}"
}

data "alicloud_instance_types" "default" {
availability_zone = "${lookup(data.alicloud_zones.all.zones[0], "id")}"
cpu_core_count = "${var.default_worker_cpu_core_count}"
}

# Workaround map to list transformation, see stackoverflow.com/questions/43893295
data "template_file" "vswitch_id" {
count = "${var.vpc_id == "" ? 0 : length(data.alicloud_vswitches.default.vswitches)}"
template = "${lookup(data.alicloud_vswitches.default.0.vswitches[count.index], "id")}"
}

# Get cluster bootstrap token
data "external" "token" {
depends_on = ["alicloud_cs_managed_kubernetes.k8s"]

# Terraform use map[string]string to unmarshal the result, transform the json to conform
program = ["bash", "-c", "aliyun --region ${var.region} cs POST /clusters/${alicloud_cs_managed_kubernetes.k8s.id}/token --body '{\"is_permanently\": true}' | jq \"{token: .token}\""]
}

data "template_file" "userdata" {
template = "${file("${path.module}/templates/user_data.sh.tpl")}"
count = "${length(var.worker_groups)}"

vars {
pre_userdata = "${lookup(var.worker_groups[count.index], "pre_userdata", var.group_default["pre_userdata"])}"
post_userdata = "${lookup(var.worker_groups[count.index], "post_userdata", var.group_default["post_userdata"])}"
open_api_token = "${lookup(data.external.token.result, "token")}"
node_taints = "${lookup(var.worker_groups[count.index], "node_taints", var.group_default["node_taints"])}"
node_labels = "${lookup(var.worker_groups[count.index], "node_labels", var.group_default["node_labels"])}"
region = "${var.region}"
}
}
Loading

0 comments on commit eebd686

Please sign in to comment.