-
Notifications
You must be signed in to change notification settings - Fork 88
How to add a new custom table on ClickHouse
Skip this part if you already have access to PyTorch Dev Infra ClickHouse cluster on https://console.clickhouse.cloud
For metamates, goto https://console.clickhouse.cloud and login with your Meta email. The portal uses SSO, so you just need to follow the step on your browser to request access. We grant read-only access by default.
The first thing to do is to take a look at https://clickhouse.com/docs/en/sql-reference/statements/create/table to get familiar with CH SQL syntax. Note that there are several available databases on our cluster, some of the most important ones are:
- The
default
database that includes all GitHub events. - The
benchmark
database for all benchmark and metrics data.
If you need a new database, please reach out to us via https://fb.workplace.com/groups/4571909969591489 (for metamates) or create an issue and book an OH with us at https://github.com/pytorch/pytorch/wiki/Dev-Infra-Office-Hours (for external partners).
All ClickHouse tables are backed by either S3 for immutable records like metrics or DynamoDB for mutable records like GitHub events. In both cases, you will need to:
- Write down your
CREATE TABLE
query with the schema and submit it for review, i.e. https://github.com/pytorch/test-infra/pull/5839. Once it's approved, you can create the table yourselves using CH cloud console if you have the necessary permission, or ping the reviewer to create it for you.- An important note is that table backed by S3 requires a
_meta Tuple(bucket String, key String)
column, which is used later by the replicator to write the S3 path
- An important note is that table backed by S3 requires a
- For immutable records on S3, make sure that the workflow that uploads the data has the permission to do so with the following process:
- Consults with PyTorch Dev Infra if you need a new S3 bucket and submit a PR to create one at https://github.com/pytorch-labs/pytorch-gha-infra/blob/main/runners/s3_bucket.tf if needed.
- There are two routes for this step, you'll only need one. The first approach is a bit simpler.
- Submit two PRs to grant the runner write permission to the bucket. One for Meta runners, i.e. https://github.com/pytorch-labs/pytorch-gha-infra/pull/533, the other for LF runners, i.e. https://github.com/pytorch/ci-infra/pull/296
- An alternative approach is to create a new OIDC role or edit an existing one to grant the same permission, i.e. https://github.com/pytorch-labs/pytorch-gha-infra/pull/358. Then, the new role can be used in your workflow, i.e. https://github.com/pytorch/executorch/pull/2449.
- Grant CH the permission to read from the bucket, i.e. https://github.com/pytorch-labs/pytorch-gha-infra/pull/536. This is not needed if the bucket is public.
- For mutable records on DynamoDB, a similar process applies but we will need to create a new DynamoDB table instead:
- Submit a PR to create a new DynamoDB table at https://github.com/pytorch-labs/pytorch-gha-infra/blob/main/runners/dynamo.tf, remember to enable streaming by setting
stream_enabled
to true. - Submit a PR to create a new OIDC role with the permission to write to the table, i.e. https://github.com/pytorch-labs/pytorch-gha-infra/pull/422.
- Use the new role in your workflow, i.e. https://github.com/pytorch/pytorch/pull/129544, and start writing into DynamoDB.
- Submit a PR to create a new DynamoDB table at https://github.com/pytorch-labs/pytorch-gha-infra/blob/main/runners/dynamo.tf, remember to enable streaming by setting
After the records are available on S3 or DynamoDB, the next step will be to update the replicator lambdas to make the them available on the corresponding CH table. There are two replicators at the moment:
- S3 replicator is at https://github.com/pytorch/test-infra/tree/main/aws/lambda/clickhouse-replicator-s3. Here is an example https://github.com/pytorch/test-infra/pull/5921, which defines a mapping from the S3 path to the corresponding CH table. S3 replicator, however, won't delete the records on CH because we assume that they are immutable.
- Similarly, the DynamoDB replicator is at https://github.com/pytorch/test-infra/tree/main/aws/lambda/clickhouse-replicator-dynamo with a mapping from DynamoDB to CH table. This replicator handles all insert, update, and delete event from DynamoDB.
This part is not yet in Terraform, so this is currently done manually via our AWS Cloud console by going to the lambda / Configuration / Triggers. Please reach out to the team if you need help here.
You can manually test the insert on CH Cloud console to make sure that the replicator works. Here is an example query for S3:
insert into
benchmark.oss_ci_benchmark_v3
select
*,
(
'ossci-benchmarks',
'v3/pytorch/pytorch/11862562618/33067186757/add_loop_inductor_dynamic_gpu.json'
) as _meta
from
s3(
'https://ossci-benchmarks.s3.amazonaws.com/v3/pytorch/pytorch/11862562618/33067186757/add_loop_inductor_dynamic_gpu.json',
'JSONEachRow',
'
`timestamp` UInt64,
`schema_version` String,
`name` String,
`repo` String,
`head_branch` String,
`head_sha` String,
`workflow_id` UInt64,
`run_attempt` UInt32,
`job_id` UInt64,
`servicelab_experiment_id` UInt64,
`servicelab_trial_id` UInt64,
`runners` Array(
Tuple(
name String,
type String,
cpu_info String,
cpu_count UInt32,
mem_info String,
avail_mem_in_gb UInt32,
gpu_info String,
gpu_count UInt32,
gpu_mem_info String,
avail_gpu_mem_in_gb UInt32,
extra_info Map(String, String)
)
),
`benchmark` Tuple(
name String,
mode String,
dtype String,
extra_info Map(String, String)
),
`model` Tuple (
name String,
type String,
backend String,
origins Array(String),
extra_info Map(String, String)
),
`inputs` Map(
String,
Tuple(dtype String, extra_info Map(String, String))
),
`dependencies` Map(
String,
Tuple(
`repo` String,
`branch` String,
`sha` String,
`version` String,
extra_info Map(String, String)
)
),
`metric` Tuple(
name String,
benchmark_values Array(Float32),
target_value Float32,
extra_info Map(String, String)
)
',
'gzip',
extra_credentials(
role_arn = 'arn:aws:iam::308535385114:role/clickhouse_role'
)
)