Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1892] [Bug] Non-standard abbreviations for kibibyte, etc #477

Closed
2 tasks done
dbeatty10 opened this issue Jan 24, 2023 · 0 comments · Fixed by #482
Closed
2 tasks done

[CT-1892] [Bug] Non-standard abbreviations for kibibyte, etc #477

dbeatty10 opened this issue Jan 24, 2023 · 0 comments · Fixed by #482
Labels
type:bug Something isn't working

Comments

@dbeatty10
Copy link
Contributor

dbeatty10 commented Jan 24, 2023

Is this a new bug in dbt-bigquery?

  • I believe this is a new bug in dbt-bigquery
  • I have searched the existing issues, and I could not find an existing issue for this bug

Context

IEC 80000-13

Binary units of measurement express the size of data more accurately. When you compare the size of 100 KB to 100 KiB, the difference is relatively small, 2.35%. However, this difference grows as the size of the data values increases. When you compare the size of 100 TB to 100 TiB, the difference is 9.06% [1].

image

Current Behavior

Current code:

def format_bytes(self, num_bytes):
if num_bytes:
for unit in ["Bytes", "KB", "MB", "GB", "TB", "PB"]:
if abs(num_bytes) < 1024.0:
return f"{num_bytes:3.1f} {unit}"
num_bytes /= 1024.0
num_bytes *= 1024.0
return f"{num_bytes:3.1f} {unit}"
else:
return num_bytes

Expected Behavior

If we use logic like num_bytes /= 1024.0, then I'd expect it to align with IEC 80000-13 and use the following standardized abbreviations instead:

for unit in ["Bytes", "KiB", "MiB", "GiB", "TiB", "PiB"]:

(See the chart here for where I got the official abbreviations from.)

Alternatively, the current abbreviations for decimal multi-byte units could be kept and these substitutions could be made instead:

num_bytes /= 1000.0

and:

num_bytes *= 1000.0

Steps To Reproduce

Didn't actually reproduce -- just read the code!

Relevant log output

No response

Environment

- OS: N/A
- Python: N/A
- dbt-core: >= 1.0.0
- dbt-bigquery: >= 1.0.0

Additional Context

label: tech_debt

Today I Learned (TIL), format_bytes and format_rows_number are defined in dbt-core!
https://github.com/dbt-labs/dbt-core/blob/7b464b8a4957ec7969f19234020e110be1987923/core/dbt/utils.py#L456-L473

Context for the duplication

Removing them from dbt-core was mentioned here and here as part of the discussion for #48.

It feels to me like one of the following should have happened along with #48:

  1. a simultaneous PR should have removed format_bytes and format_rows_number from dbt-core; or
  2. #48 should have been added to dbt-core instead of dbt-bigquery.

As it stands, the implementations in dbt-core are unused (to the best that we can tell), but would probably need to be maintained to preserve backwards compatibility in case some unknown adapter is using them.

Proposal

Going forward, I'd propose that we choose one (and only one) of the following:

  1. Keep separate implementations in dbt-core and dbt-bigquery (and fix the KB vs. KiB issue in both)
  2. Remove the implementations in dbt-core (so they exist only in dbt-bigquery)
  3. Completely move the implementation in dbt-bigquery back into dbt-core
@dbeatty10 dbeatty10 added type:bug Something isn't working triage:product labels Jan 24, 2023
@github-actions github-actions bot changed the title [Bug] Non-standard abbreviations for kibibyte, etc [CT-1892] [Bug] Non-standard abbreviations for kibibyte, etc Jan 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant