-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluate removing uid by storing tags in segments #197
Comments
After it has been determined if this change should be made and dev/delta-metadata has been merged, the schemas used for the various |
To determine if this change should be made I wrote a small script that adds tags to existing data sets. The script assumes that each time series is stored in a separate Apache Parquet file. Then the name of the file is added as the first tag column. Then for each additional tag column, the script computes the import glob
import sys
import hashlib
import pyarrow
from pyarrow import parquet
if len(sys.argv) < 3:
print("usage: tag-scaler.py data_folder tag_columns")
sys.exit(1)
data_folder = sys.argv[1]
tag_columns = int(sys.argv[2])
for file_path in glob.glob(data_folder + "/*.parquet"):
table = parquet.read_table(file_path)
file_name = file_path[file_path.rfind("/") + 1: file_path.rfind(".")]
tag = file_name
for i in range(0, tag_columns):
tag_name = "tag_" + str(i)
tags = table.num_rows * [tag]
tags_array = pyarrow.array(tags)
table = table.append_column(tag_name, tags_array)
tag_bytes = bytes(tag, "UTF-8")
tag = hashlib.sha512(tag_bytes).hexdigest()
output_path = f"{file_name}_{tag_columns}.parquet"
parquet.write_table(table, output_path,
column_encoding='PLAIN',
compression='zstd',
use_dictionary=False,
write_statistics=False) The raw data set is 6.7G as Apache Parquet files compressed with Snappy. The script was executed with this data set and different amounts of tag columns to generate. The results are shown in the table below:
The results show that storing tags in the Apache Parquet files only increases the amount of storage required by at most ~10% in the worst case. Thus, it seems reasonable to move tags to the Apache Parquet with segments for the benefits described in the previous comments. However, the overhead may be larger than preferred in the short term due to the tags having to be stored for each field column, but if this becomes an issue it may be worth redesigning the segment storage format to store all segments for a time interval and set of tags together in a single Apache Parquet despite the Apache Parquet file then containing data for different columns. This should also reduce the number of files read during query processing and thus make query processing more efficient. |
A way to remedy #187 may be to store tags in segments if it does not significantly increase the amount of space and bandwidth used and then remove
univariate_id
. However, this change needs to be be evaluated in-depth first, as the schema for compressed segments would no longer be the same for model tables with tags and and it may increase the amount of storage and bandwidth required. Removingunivariate_id
s has multiple benefits in addition to fixing #187, e.g., it makes each segment self-contained which reduces the complexity of data transfer and query processing and it removes the limit on columns in model table. The problem with each model table having different schema for compressed segments could be solved by adding the model tables compressed segment schema toModelTableMetadata
.The text was updated successfully, but these errors were encountered: