Evaluate removing uid by storing tags in segments #197

skejserjensen · 2024-08-08T05:34:50Z

A way to remedy #187 may be to store tags in segments if it does not significantly increase the amount of space and bandwidth used and then remove univariate_id. However, this change needs to be be evaluated in-depth first, as the schema for compressed segments would no longer be the same for model tables with tags and and it may increase the amount of storage and bandwidth required. Removing univariate_ids has multiple benefits in addition to fixing #187, e.g., it makes each segment self-contained which reduces the complexity of data transfer and query processing and it removes the limit on columns in model table. The problem with each model table having different schema for compressed segments could be solved by adding the model tables compressed segment schema to ModelTableMetadata.

The text was updated successfully, but these errors were encountered:

skejserjensen · 2024-08-12T19:41:08Z

After it has been determined if this change should be made and dev/delta-metadata has been merged, the schemas used for the various RecordBatches in ModelarDB should be updated to not use unsigned integers and the number of schemas should be reduced to the fewest possible. While the use of unsigned integers for values like column index may make the semantics more clear, it does not seem to be worth the extra complexity needed to convert back and forth between unsigned integers in memory and signed integers on disk.

skejserjensen · 2025-01-09T15:03:49Z

To determine if this change should be made I wrote a small script that adds tags to existing data sets. The script assumes that each time series is stored in a separate Apache Parquet file. Then the name of the file is added as the first tag column. Then for each additional tag column, the script computes the sha512 hash of the previous tag column and uses that as the next tag column. This ensures that all data points of a time series are assigned the same tags and that the generated tags are reproducible. The data points with tags are written to an Apache Parquet per time series using a configuration as close to ModelarDB-RS's for segments as possible. In summary, the script is designed to be a worst case test. The number of segments should be much smaller than the data points, the length of the strings from sha512()is generally longer than tags observed in real-life data sets, and the number of tag columns can be set higher than what generally appears in real-life data sets. The full script can be seen below:

import glob
import sys
import hashlib

import pyarrow
from pyarrow import parquet


if len(sys.argv) < 3:
    print("usage: tag-scaler.py data_folder tag_columns")
    sys.exit(1)

data_folder = sys.argv[1]
tag_columns = int(sys.argv[2])

for file_path in glob.glob(data_folder + "/*.parquet"):
    table = parquet.read_table(file_path)
    file_name = file_path[file_path.rfind("/") + 1: file_path.rfind(".")]

    tag = file_name
    for i in range(0, tag_columns):
        tag_name = "tag_" + str(i)
        tags = table.num_rows * [tag]
        tags_array = pyarrow.array(tags)
        table = table.append_column(tag_name, tags_array)

        tag_bytes = bytes(tag, "UTF-8")
        tag = hashlib.sha512(tag_bytes).hexdigest()

    output_path = f"{file_name}_{tag_columns}.parquet"
    parquet.write_table(table, output_path,
                        column_encoding='PLAIN',
                        compression='zstd',
                        use_dictionary=False,
                        write_statistics=False)

The raw data set is 6.7G as Apache Parquet files compressed with Snappy. The script was executed with this data set and different amounts of tag columns to generate. The results are shown in the table below:

Number of Tag Columns	Size
0	4.9G
1	4.9G
2	4.9G
4	5.0G
8	5.1G
16	5.4G

The results show that storing tags in the Apache Parquet files only increases the amount of storage required by at most ~10% in the worst case. Thus, it seems reasonable to move tags to the Apache Parquet with segments for the benefits described in the previous comments. However, the overhead may be larger than preferred in the short term due to the tags having to be stored for each field column, but if this becomes an issue it may be worth redesigning the segment storage format to store all segments for a time interval and set of tags together in a single Apache Parquet despite the Apache Parquet file then containing data for different columns. This should also reduce the number of files read during query processing and thus make query processing more efficient.

This was referenced Aug 8, 2024

Some univariate_id cannot be written to Delta Lake #187

Closed

Fix #187 by storing univariate_id as int64 #198

Merged

skejserjensen mentioned this issue Aug 12, 2024

Disallow tables containing unsigned integers #199

Merged

skejserjensen added enhancement New feature or request question Further information is requested labels Aug 17, 2024

skejserjensen mentioned this issue Oct 1, 2024

Fixed type conversion error causing data transfer to fail #234

Merged

CGodiksen self-assigned this Jan 19, 2025

CGodiksen mentioned this issue Feb 12, 2025

Reimplement how we handle merging compressed segments #286

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate removing uid by storing tags in segments #197

Evaluate removing uid by storing tags in segments #197

skejserjensen commented Aug 8, 2024

skejserjensen commented Aug 12, 2024

skejserjensen commented Jan 9, 2025

Evaluate removing uid by storing tags in segments #197

Evaluate removing uid by storing tags in segments #197

Comments

skejserjensen commented Aug 8, 2024

skejserjensen commented Aug 12, 2024

skejserjensen commented Jan 9, 2025