🔼 source
🔽 source
Parquet files containing sensitive information can be protected by the modular encryption mechanism that encrypts and authenticates the file data and metadata - while allowing for a regular Parquet functionality (columnar projection, predicate pushdown, encoding and compression).
Existing data protection solutions (such as flat encryption of files, in-storage encryption, or use of an encrypting storage client) can be applied to Parquet files, but have various security or performance issues. An encryption mechanism, integrated in the Parquet format, allows for an optimal combination of data security, processing speed and encryption granularity.
-
Protect Parquet data and metadata by encryption, while enabling selective reads (columnar projection, predicate push-down).
-
Implement "client-side" encryption/decryption (storage client). The storage server must not see plaintext data, metadata or encryption keys.
-
Leverage authenticated encryption that allows clients to check integrity of the retrieved data - making sure the file (or file parts) have not been replaced with a wrong version, or tampered with otherwise.
-
Enable different encryption keys for different columns and for the footer.
-
Allow for partial encryption - encrypt only column(s) with sensitive data.
-
Work with all compression and encoding mechanisms supported in Parquet.
-
Support multiple encryption algorithms, to account for different security and performance requirements.
-
Enable two modes for metadata protection -
-
full protection of file metadata
-
partial protection of file metadata that allows legacy readers to access unencrypted columns in an encrypted file.
-
-
Minimize overhead of encryption - in terms of size of encrypted files, and throughput of write/read operations.
The Parquet writer generates a DEK (data encryption key) for each plaintext chunk to be encrypted, encrypts the plaintext chunk, then sends the DEK to the KMS (key management service) to be wrapped by the chosen KEK (key encryption key). The KMS returns the wrapped DEK to the Parquet writer, which stores the wrapped DEK alongside the corresponding ciphertext chunk.
To read a ciphertext chunk, the Parquet reader sends the corresponding wrapped DEK to the KMS, which unwraps it and returns the DEK to the Parquet reader. The reader decrypts the ciphertext chunk with the DEK.
-
Configure Git hooks:
pixi run -- pre-commit-install
Launch the KMS (key management service).
pixi run -- serve
Explore the KMS' OpenAPI specification. Try POSTing the JSON payload
{
"key": "rlCLtKLrH/b9GZbuZaneQB6yU6vp8tlC1R2LINMYYrM="
}
to one of the wrap endpoints and then try unwrapping the result via the corresponding unwrap
endpoint at various privilege levels. To set a privilege level, click the "Authorize" button and set
the value of the x-api-key
request header to INTERNAL
, CONFIDENTIAL
or RESTRICTED
. PUBLIC
does not require the x-api-key
request header. (plaintext < PUBLIC
< INTERNAL
< CONFIDENTIAL
< RESTRICTED
)
Write an encrypted Parquet dataset with columns of varying privilege levels to the dataset
directory.
pixi run -- write
Read the entire dataset from the dataset
directory.
pixi run -- read
Edit read_encrypted_parquet.py
and experiment with different combinations of KMS_ACCESS_TOKEN
and COLUMNS
to project. The default is:
KMS_ACCESS_TOKEN = WrappingKeyId.RESTRICTED
COLUMNS = [
"id", # minimum required privilege: none (plaintext)
"date_of_birth", # minimum required privilege: INTERNAL
"first_name", # minimum required privilege: CONFIDENTIAL
"last_name", # minimum required privilege: CONFIDENTIAL
"social_security_number", # minimum required privilege: RESTRICTED
]
RESTRICTED
is the highest privilege level and may decrypt all columns, which is why projecting all
columns earlier was successful.
Note that id
is the only plaintext column, and no access token is required to project it (i.e.
KMS_ACCESS_TOKEN = None
).
Please note that in reality KEKs should be narrowly scoped (e.g. project-specific), periodically rotated, and gated behind IAM (Identity and Access Management) more secure than static API keys.
Examples of production-grade KMS include: