Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate removing uid by storing tags in segments #197

Open
skejserjensen opened this issue Aug 8, 2024 · 2 comments
Open

Evaluate removing uid by storing tags in segments #197

skejserjensen opened this issue Aug 8, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request question Further information is requested

Comments

@skejserjensen
Copy link
Contributor

A way to remedy #187 may be to store tags in segments if it does not significantly increase the amount of space and bandwidth used and then remove univariate_id. However, this change needs to be be evaluated in-depth first, as the schema for compressed segments would no longer be the same for model tables with tags and and it may increase the amount of storage and bandwidth required. Removing univariate_ids has multiple benefits in addition to fixing #187, e.g., it makes each segment self-contained which reduces the complexity of data transfer and query processing and it removes the limit on columns in model table. The problem with each model table having different schema for compressed segments could be solved by adding the model tables compressed segment schema to ModelTableMetadata.

@skejserjensen
Copy link
Contributor Author

After it has been determined if this change should be made and dev/delta-metadata has been merged, the schemas used for the various RecordBatches in ModelarDB should be updated to not use unsigned integers and the number of schemas should be reduced to the fewest possible. While the use of unsigned integers for values like column index may make the semantics more clear, it does not seem to be worth the extra complexity needed to convert back and forth between unsigned integers in memory and signed integers on disk.

@skejserjensen
Copy link
Contributor Author

To determine if this change should be made I wrote a small script that adds tags to existing data sets. The script assumes that each time series is stored in a separate Apache Parquet file. Then the name of the file is added as the first tag column. Then for each additional tag column, the script computes the sha512 hash of the previous tag column and uses that as the next tag column. This ensures that all data points of a time series are assigned the same tags and that the generated tags are reproducible. The data points with tags are written to an Apache Parquet per time series using a configuration as close to ModelarDB-RS's for segments as possible. In summary, the script is designed to be a worst case test. The number of segments should be much smaller than the data points, the length of the strings from sha512()is generally longer than tags observed in real-life data sets, and the number of tag columns can be set higher than what generally appears in real-life data sets. The full script can be seen below:

import glob
import sys
import hashlib

import pyarrow
from pyarrow import parquet


if len(sys.argv) < 3:
    print("usage: tag-scaler.py data_folder tag_columns")
    sys.exit(1)

data_folder = sys.argv[1]
tag_columns = int(sys.argv[2])

for file_path in glob.glob(data_folder + "/*.parquet"):
    table = parquet.read_table(file_path)
    file_name = file_path[file_path.rfind("/") + 1: file_path.rfind(".")]

    tag = file_name
    for i in range(0, tag_columns):
        tag_name = "tag_" + str(i)
        tags = table.num_rows * [tag]
        tags_array = pyarrow.array(tags)
        table = table.append_column(tag_name, tags_array)

        tag_bytes = bytes(tag, "UTF-8")
        tag = hashlib.sha512(tag_bytes).hexdigest()

    output_path = f"{file_name}_{tag_columns}.parquet"
    parquet.write_table(table, output_path,
                        column_encoding='PLAIN',
                        compression='zstd',
                        use_dictionary=False,
                        write_statistics=False)

The raw data set is 6.7G as Apache Parquet files compressed with Snappy. The script was executed with this data set and different amounts of tag columns to generate. The results are shown in the table below:

Number of Tag Columns Size
0 4.9G
1 4.9G
2 4.9G
4 5.0G
8 5.1G
16 5.4G

The results show that storing tags in the Apache Parquet files only increases the amount of storage required by at most ~10% in the worst case. Thus, it seems reasonable to move tags to the Apache Parquet with segments for the benefits described in the previous comments. However, the overhead may be larger than preferred in the short term due to the tags having to be stored for each field column, but if this becomes an issue it may be worth redesigning the segment storage format to store all segments for a time interval and set of tags together in a single Apache Parquet despite the Apache Parquet file then containing data for different columns. This should also reduce the number of files read during query processing and thus make query processing more efficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants