dataculpa-bigquery

Google BigQuery connector for Data Culpa - monitor data quality in Google BigQuery automatically with Data Culpa Validator

Install

Clone the repo (or just dc-bigquery.py)
Install python dependencies (python3):

pip install google-cloud-bigquery google-cloud-bigquery-storage python-dotenv dataculpa-client

Create a BigQuery service account and access key; the connector will check the usual GOOGLE_APPLICATION_CREDENTIALS environment variable for the path to the access key JSON file. The BigQuery user will need the usual permissions to read from your desired data sets.

Configure

Run dc-bigquery.py --init example.yaml to generate a template yaml to fill in connection coordinates. The yaml will never contain secrets and is safe to put into source control.
Once you have your yaml file edited, run dc-bigquery.py --test example.yaml to test the connections to BigQuery and permissions and the Data Culpa Validator controller.

Invocation

Data ingest into Data Culpa Validator happens when calling dc-bigquery.py --run example.yaml.

The dc-bigquery.py script is intended to be invoked from cron or other orchestration systems. You can run it as frequently as you wish; you can spread out instances to isolate collections or different data sets with different yaml configuration files.

Support and Future Improvements

There are many improvements we are considering for this module. You can get in touch by writing to hello@dataculpa.com or opening issues in this repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

dataculpa-bigquery

Install

Configure

Invocation

Support and Future Improvements

Files

README.md

Latest commit

History

README.md

File metadata and controls

dataculpa-bigquery

Install

Configure

Invocation

Support and Future Improvements