Standalone

This examples deploys the Secured Data Warehouse blueprint module, the required "external harness" needed to deploy it, and Dataflow Pipelines that process sensitive data using the Secured Data Warehouse infrastructure.

In the External Harness we have:

The creation of four GCP projects needed by the Secured Data Warehouse:
- Data Ingestion project.
- Non-confidential Data project.
- Data Governance project.
- Confidential Data project.
The Creation of an external Artifact Registry project for the dataflow flex templates and the build of the templates themselves, including:
- A Docker Artifact registry.
- A Python Artifact registry.
- A Python Pub/Sub to BigQuery DLP de-identification Dataflow Flex Template.
- A Python BigQuery to BigQuery DLP re-identification Dataflow Flex Template.
The Creation of two VPC Networks to deploy Dataflow Pipelines, one in the Data Ingestion project and another one in the Confidential Data project, each network having:
- A VPC Network with one subnetwork.
- A set of Firewall rules.
- The required DNS configuration for Google Private Access.
The configuration of Log Sinks in all projects with the creation of a related Logging bucket in the Data Governance project
The Cloud KMS infrastructure for the creation of a wrapped_key and crypto_key pair:
- A Cloud KMS Keyring.
- A Cloud KMS key encryption key (KEK).
- A token encryption key (TEK) for DLP Templates.

Custom names can be provided for the five projects created in this example. If custom names are not provided, the names of the projects will be:

ext-harness
sdw-data-ing
sdw-data-gov
sdw-non-conf
sdw-conf

A random suffix will be added to the end of the names to create the project ID to prevent collisions with existing projects.

In the deploy of the Secured Data Warehouse and the Dataflow Pipelines we have:

The deploy of the main module itself.
The creation of two Dataflow Pipelines using the same Dataflow template that can do both de-identification and re-identification using the wrapped_key and crypto_key pair created in the harness.
The creation of a Data Catalog taxonomy and policy tags representing security levels.
The creation of a BigQuery table with column-level security enabled using the Data Catalog policy tags.

The example waits for 10 minutes between the deploy of the de-identification Dataflow Pipeline and the start of the re-identification Dataflow Pipeline, to wait for the first job to deploy, process the 10k records, and write to the BigQuery table that will be processed by the second job.

The re-identification step is typically a separate deliberated action (with change control) to re-identify and limit who can read the data but is executed automatically in the example to showcase the BigQuery security controls.

Note: To deploy this example it is also necessary to have an existing project on which the Service Account used to deploy this example needs to be created and have the required IAM Roles granted to it. This project should not be on the same folder used to deploy the Secured Data Warehouse in accordance with the separation of concerns principle. You can use the Project Factory module and the IAM module in combination to provision a service account with the necessary roles applied.

Google Cloud Locations

This example will be deployed at the us-east4 location, to deploy in another location, change the local location in the example main.tf file. By default, the Secured Data Warehouse module has an Organization Policy that only allows the creation of resource in us-locations. To deploy in other locations, update the input trusted_locations with the appropriated location in the call to the main module.

Usage

Rename the tfvars file by running mv terraform.example.tfvars terraform.tfvars and update terraform.tfvars with values from your environment.
```
mv terraform.example.tfvars terraform.tfvars
```
Run terraform init to get the plugins.
```
terraform init
```
Run terraform plan and review the plan.
```
terraform plan
```
Run terraform apply to apply the infrastructure build.
```
terraform apply
```

Clean up

Run terraform destroy to clean up your environment. The input delete_contents_on_destroy must have been set to true in the original apply for the terraform destroy command to work.
```
terraform destroy
```

Perimeter members list

To be able to see the resources protected by the VPC Service Controls Perimeters in the Google Cloud Console you need to add your user in the variable perimeter_additional_members in the terraform.tfvars file.

Sample Data

The sample data used in this example is a Public BigQuery Table that belongs to the IRS 990 public dataset.

The de-identification Dataflow Pipeline will apply these DLP Crypto-based tokenization transformations to encrypt the data:

Deterministic encryption Transformation:
- ein.
- street.
- name.
Format-preserving encryption Transformation:
- state.

Taxonomy used

This example creates a Data Catalog taxonomy to enable BigQuery column-level access controls.

The taxonomy has four levels: Confidential, Private, Sensitive, and Public and access to a higher level grants also access to the lower levels

4_Confidential: Most sensitive data classification. Significant damage to enterprise.
- US_EMPLOYER_IDENTIFICATION_NUMBER.
- 3_Private: Data meant to be private. Likely to cause damage to enterprise.
  - PERSON_NAME.
  - STREET_ADDRESS.
  - US_STATE.
  - 2_Sensitive: Data not meant to be public.
    - INCOME_AMT.
    - REVENUE_AMT.
    - 1_Public: Data is freely accessible to the public.
      - Not tagged fields.

No user has access to read this data protected with column-level security. If they need access, the Fine-Grained Reader role needs to be added to the appropriate user or group.

Requirements

These sections describe requirements for running this example.

Software

Install the following dependencies:

Google Cloud SDK version 357.0.0 or later.
Terraform version 0.13.7 or later.
curl version 7.68.0 or later.

Cloud SDK configurations

The standalone example uses a python script to generate the wrapped_key that will be used in the DLP template to encrypt and decrypt the information. This script will run using the Application Default Credentials.

So it is important to guarantee that the Application Default Credentials are correctly configured, to acquire yours:

gcloud auth application-default login

To avoid errors in the deployment, you must also guarantee that the gcloud command is not using a project that has been deleted or is unavailable. We recommend to unset the project:

gcloud config unset project

Service Account

To provision the resources of this example, create a privileged service account, where the service account key cannot be created. In addition, consider using Cloud Monitoring to alert on this service account's activity. Grant the following roles to the service account.

Organization level:
- Access Context Manager Admin: roles/accesscontextmanager.policyAdmin
- Billing Account User: roles/billing.user
- Organization Policy Administrator: roles/orgpolicy.policyAdmin
- Organization Administrator: roles/resourcemanager.organizationAdmin
Folder Level:
- Compute Network Admin: roles/compute.networkAdmin
- Logging Admin: roles/logging.admin
- Project Creator: roles/resourcemanager.projectCreator
- Project Deleter: roles/resourcemanager.projectDeleter
- Project IAM Admin: roles/resourcemanager.projectIamAdmin
- Service Usage Admin: roles/serviceusage.serviceUsageAdmin

As an alternative to granting the service account the Billing Account User role in organization, it is possible to grant it directly in the billing account.

You can run the following gcloud command to assign Billing Account User role to the service account.

export SA_EMAIL=<YOUR-SA-EMAIL>
export BILLING_ACCOUNT=<YOUR-BILLING-ACCOUNT>

gcloud beta billing accounts add-iam-policy-binding "${BILLING_ACCOUNT}" \
--member="serviceAccount:${SA_EMAIL}" \
--role="roles/billing.user"

You can use the Project Factory module and the IAM module in combination to provision a service account with the necessary roles applied.

The user using this service account must have the necessary roles, Service Account User and Service Account Token Creator, to impersonate the service account.

You can run the following commands to assign roles to the service account:

export ORG_ID=<YOUR-ORG-ID>
export FOLDER_ID=<YOUR-FOLDER-ID>
export SA_EMAIL=<YOUR-SA-EMAIL>

gcloud organizations add-iam-policy-binding ${ORG_ID} \
--member="serviceAccount:${SA_EMAIL}" \
--role="roles/resourcemanager.organizationAdmin"

gcloud organizations add-iam-policy-binding ${ORG_ID} \
--member="serviceAccount:${SA_EMAIL}" \
--role="roles/accesscontextmanager.policyAdmin"

gcloud organizations add-iam-policy-binding ${ORG_ID} \
--member="serviceAccount:${SA_EMAIL}" \
--role="roles/billing.user"

gcloud organizations add-iam-policy-binding ${ORG_ID} \
--member="serviceAccount:${SA_EMAIL}" \
--role="roles/orgpolicy.policyAdmin"

gcloud resource-manager folders \
add-iam-policy-binding ${FOLDER_ID} \
--member="serviceAccount:${SA_EMAIL}" \
--role="roles/compute.networkAdmin"

gcloud resource-manager folders \
add-iam-policy-binding ${FOLDER_ID} \
--member="serviceAccount:${SA_EMAIL}" \
--role="roles/logging.admin"

gcloud resource-manager folders \
add-iam-policy-binding ${FOLDER_ID} \
--member="serviceAccount:${SA_EMAIL}" \
--role="roles/resourcemanager.projectCreator"

gcloud resource-manager folders \
add-iam-policy-binding ${FOLDER_ID} \
--member="serviceAccount:${SA_EMAIL}" \
--role="roles/resourcemanager.projectDeleter"

gcloud resource-manager folders \
add-iam-policy-binding ${FOLDER_ID} \
--member="serviceAccount:${SA_EMAIL}" \
--role="roles/resourcemanager.projectIamAdmin"

gcloud resource-manager folders \
add-iam-policy-binding ${FOLDER_ID} \
--member="serviceAccount:${SA_EMAIL}" \
--role="roles/serviceusage.serviceUsageAdmin"

Note: The script wrapped_key.sh will grant the cloudkms.cryptoOperator role for the Terraform Service Account during the KMS wrapped_key creation. This role provides permission for the wrapped_key.sh script to use the Generate Random Bytes functionality which is necessary to generate the token that will be used in the KMS wrapped_key. The cloudkms.cryptoOperator role will be removed from the Terraform Service Account once the KMS wrapped_key has been created.

APIs

The following APIs must be enabled in the project where the service account was created:

Access Context Manager API: accesscontextmanager.googleapis.com
App Engine Admin API: appengine.googleapis.com
Cloud Billing API: cloudbilling.googleapis.com
Cloud Build API: cloudbuild.googleapis.com
Cloud Key Management Service (KMS) API: cloudkms.googleapis.com
Cloud Pub/Sub API: pubsub.googleapis.com
Cloud Resource Manager API: cloudresourcemanager.googleapis.com
Compute Engine API: compute.googleapis.com
Dataflow API: dataflow.googleapis.com
Identity and Access Management (IAM) API: iam.googleapis.com
BigQuery API: bigquery.googleapis.com

You can run the following gcloud command to enable these APIs in the service account project.

export PROJECT_ID=<SA-PROJECT-ID>

gcloud services enable \
accesscontextmanager.googleapis.com \
appengine.googleapis.com \
bigquery.googleapis.com \
cloudbilling.googleapis.com \
cloudbuild.googleapis.com \
cloudkms.googleapis.com \
pubsub.googleapis.com \
cloudresourcemanager.googleapis.com \
compute.googleapis.com \
dataflow.googleapis.com \
iam.googleapis.com \
--project ${PROJECT_ID}

Outputs by personas

These outputs can be interesting for all user:

Blueprint type
Data ingestion dataflow bucket name
Confidential data project id
Data governance project id
Data ingestion project id
Non-confidential data project id
Service account: dataflow controller service account

These outputs can be interesting for data analyst group:

Bigquery non confidential table
Bigquery confidential table
Confidential data project id
Non-confidential data project id
Data ingestion dataflow bucket name
Taxonomy

These outputs can be interesting for data engineer group:

Data ingestion topic name
Pubsub writer service account email
Storage writer service account email
Tek wrapping keyring

These outputs can be interesting for network administrator group:

Confidential network name
Confidential network self link
Confidential subnets self link
Data ingestion network name
Data ingestion network self link
Data ingestion subnets self link

These outputs can be interesting for security administrator group:

Cmek bigquery crypto key
Cmek confidential bigquery crypto key
Cmek data ingestion crypto key
Cmek reidentification crypto key
Confidential data perimeter name
Data governance perimeter name
Data ingestion service perimeter name
Tek wrapping keyring

These outputs can be interesting for security analyst group:

Centralized logging bucket name
Taxonomy display name

Troubleshooting

If you encounter problems in the apply execution check the Troubleshooting Guide.

Inputs

Name	Description	Type	Default	Required
access_context_manager_policy_id	The id of the default Access Context Manager policy. Can be obtained by running `gcloud access-context-manager policies list --organization YOUR-ORGANIZATION_ID --format="value(name)"`.	`string`	n/a	yes
billing_account	The billing account id associated with the projects, e.g. XXXXXX-YYYYYY-ZZZZZZ.	`string`	n/a	yes
confidential_data_project_name	Custom project name for the confidential data project.	`string`	`""`	no
data_analyst_group	Google Cloud IAM group that analyzes the data in the warehouse.	`string`	n/a	yes
data_engineer_group	Google Cloud IAM group that sets up and maintains the data pipeline and warehouse.	`string`	n/a	yes
data_governance_project_name	Custom project name for the data governance project.	`string`	`""`	no
data_ingestion_project_name	Custom project name for the data ingestion project.	`string`	`""`	no
delete_contents_on_destroy	(Optional) If set to true, delete all the tables in the dataset when destroying the resource; otherwise, destroying the resource will fail if tables are present.	`bool`	`true`	no
folder_id	The folder to deploy in.	`string`	n/a	yes
network_administrator_group	Google Cloud IAM group that reviews network configuration. Typically, this includes members of the networking team.	`string`	n/a	yes
non_confidential_data_project_name	Custom project name for the non confidential data project.	`string`	`""`	no
org_id	The numeric organization id.	`string`	n/a	yes
perimeter_additional_members	The list of members to be added on perimeter access. To be able to see the resources protected by the VPC Service Controls add your user must be in this list. The service accounts created by this module do not need to be added to this list. Entries must be in the standard GCP form: `user:email@email.com` or `serviceAccount:my-service-account@email.com`.	`list(string)`	n/a	yes
security_administrator_group	Google Cloud IAM group that administers security configurations in the organization(org policies, KMS, VPC service perimeter).	`string`	n/a	yes
security_analyst_group	Google Cloud IAM group that monitors and responds to security incidents.	`string`	n/a	yes
template_project_name	Custom project name for the template project.	`string`	`""`	no
terraform_service_account	The email address of the service account that will run the Terraform code.	`string`	n/a	yes

Outputs

Name	Description
bigquery_confidential_table	The bigquery table created for the confidential project.
bigquery_non_confidential_table	The bigquery table created for the non confidential project.
blueprint_type	Type of blueprint this module represents.
centralized_logging_bucket_name	The name of the bucket created for storage logging.
cmek_bigquery_crypto_key	The Customer Managed Crypto Key for the BigQuery service.
cmek_bigquery_crypto_key_name	The Customer Managed Crypto Key name for the BigQuery service.
cmek_confidential_bigquery_crypto_key	The Customer Managed Crypto Key for the confidential BigQuery service.
cmek_confidential_bigquery_crypto_key_name	The Customer Managed Crypto Key name for the confidential BigQuery service.
cmek_data_ingestion_crypto_key	The Customer Managed Crypto Key for the data ingestion crypto boundary.
cmek_data_ingestion_crypto_key_name	The Customer Managed Crypto Key name for the data ingestion crypto boundary.
cmek_keyring_name	The Keyring name for the KMS Customer Managed Encryption Keys.
cmek_reidentification_crypto_key	The Customer Managed Crypto Key for the reidentification crypto boundary.
cmek_reidentification_crypto_key_name	The Customer Managed Crypto Key name for the reidentification crypto boundary.
confidential_data_dataflow_bucket_name	The name of the bucket created for dataflow in the confidential data pipeline.
confidential_data_perimeter_name	Access context manager service perimeter name.
confidential_data_project_id	The ID of the project created for confidential datasets and tables.
confidential_dataset	The bigquery dataset created for confidential data.
confidential_network_name	The name of the confidential VPC being created.
confidential_network_self_link	The URI of the confidential VPC being created.
confidential_subnets_self_link	The self-links of confidential subnets being created.
data_governance_perimeter_name	Access context manager service perimeter name.
data_governance_project_id	The ID of the project created for data governance.
data_ingestion_bucket_name	The name of the bucket created for the data ingestion pipeline.
data_ingestion_dataflow_bucket_name	The name of the bucket created for dataflow in the data ingestion pipeline.
data_ingestion_network_name	The name of the data ingestion VPC being created.
data_ingestion_network_self_link	The URI of the data ingestion VPC being created.
data_ingestion_project_id	The ID of the project created for the data ingstion pipeline.
data_ingestion_service_perimeter_name	Access context manager service perimeter name.
data_ingestion_subnets_self_link	The self-links of data ingestion subnets being created.
data_ingestion_topic_name	The topic created for data ingestion pipeline.
dataflow_controller_service_account_email	The regional de identification pipeline service account.
non_confidential_data_project_id	The id of the project created for non-confidential data.
non_confidential_dataset	The bigquery dataset created for non-confidential data.
pubsub_writer_service_account_email	The PubSub writer service account email. Should be used to write data to the PubSub topics the data ingestion pipeline reads from.
regional_deid_pipeline_job_id	The unique ID of this job.
regional_reid_pipeline_job_id	The unique ID of this job.
storage_writer_service_account_email	The Storage writer service account email. Should be used to write data to the buckets the data ingestion pipeline reads from.
taxonomy_display_name	The name of the taxonomy.
tek_wrapping_keyring	The name of tek wrapping key
template_project_id	The id of the flex template created.
terraform_service_account	Service account used in the Standalone example.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Standalone

Google Cloud Locations

Usage

Clean up

Perimeter members list

Sample Data

Taxonomy used

Requirements

Software

Cloud SDK configurations

Service Account

APIs

Outputs by personas

Troubleshooting

Inputs

Outputs

Files

README.md

Latest commit

History

README.md

File metadata and controls

Standalone

Google Cloud Locations

Usage

Clean up

Perimeter members list

Sample Data

Taxonomy used

Requirements

Software

Cloud SDK configurations

Service Account

APIs

Outputs by personas

Troubleshooting

Inputs

Outputs