Report bytes processed after running BigQuery SQL #14

watermanio · 2020-01-30T13:53:53Z

Describe the feature

As BigQuery bills users based on the volume of data processed, it's fairly important for users to be aware of this while developing models and pipelines. In the BQ UI this is fairly clearly presented after each query run, and the same information is available in the jobs.query API response.

You can find the totalBytesProcessed documented in the linked API documentation.

It would be helpful to see this output in the JSON logging - example:

{"timestamp": "2020-01-30", "message": "13:25:37 | 24 of 27 OK created incremental model dbt_andy.customers............. [CREATE TABLE (450) in 2.82s]", "channel": "dbt", "level": 11, "levelname": "INFO", "extra": {"unique_id": "model.customers", "bytes_processed": 4882}}

And also at the end of a run with the default log formatter:

13:16:52 | Finished running 23 incremental models, 4 table models, 1 hook in 35.87s
13:16:52 | 750mb of data processed

Describe alternatives you've considered

If you have enabled Export To BigQuery within your Google Cloud billing project, it would be possible to query how much data the service account your DBT user has processed. This could then be queried and monitored outside of DBT, but it would be so much helpful to have an indication of query size in realtime - this is an important optimisation when building BigQuery pipelines.

Who will this benefit?

This is very important for all users of BigQuery who have enough data stored to have a reasonable spend, but who are not so large that they have purchased commitments (flat fee / all you can eat) from Google. Even then, it's still best practice to pay attention to how much data you are processing...

The text was updated successfully, but these errors were encountered:

tnightengale · 2020-01-30T16:41:51Z

I would really like to see something like this as well. I am investigating a strategy to monitor and alert our org of query costs, whilst using dbt.

drewbanin · 2020-01-30T21:01:57Z

Thanks for opening this issue @tnightengale! I'm super into this idea (see also dbt-labs/dbt-core#401)

The questions here have always been 1) when does this code run and 2) how is the information rendered.

I'm definitely into the idea of recording the bytes billed in the structured logs and CLI output.

I also really like the idea of fetching the estimated bytes billes on BQ before queries are run, but I'm not sure how/where they should be rendered out to users. Do y'all have any opinions on this?

watermanio · 2020-01-30T22:05:04Z

when does this code run

@drewbanin I think the request for this information is already made roughly here, but the relevant bit of the query_job object is not rendered.

I think the structured log bit might be fairly straightforward, just printing the bytes_billed at the time of querying. But the total bytes billing during a DBT run would need storing and calculating somewhere new.

I love your idea of estimated bytes billed - this is also available through the BigQuery API by making a dry_run request - I see there is already a very basic WIP PR for that here: dbt-labs/dbt-core#1567

ryantuck · 2020-04-02T19:49:46Z

I can't tell if this is possible already, but it would be helpful to get this information prior to a run to estimate cost for a query like I can do via the UI.

This would essentially mimic the --dry-run flag available to the bq CLI.

My workaround currently is copying/pasting compiled SQL into the UI, which feels less-than-elegant.

drewbanin · 2020-04-07T21:23:43Z

@ryantuck this is pretty doable if the run is for a single model in a built-out dataset, but it becomes really challenging in the more general case. What happens if:

You're running dbt for the first time
You changed two models, A and B, with A ---> B. A has been changed to return 1 trillion rows, and B is selecting out of A

How can dbt leverage BQ's dry run capabilities to return realistic byte counts in cases like these? Seems to me like we'd either need to acknowledge the major caveats, or add some severe guardrails to make sure people don't misinterpret the outputs

ryantuck · 2020-04-07T22:10:35Z

@drewbanin oh, that's a good call-out. Two technical solutions come immediately to mind:

Limit the estimate-cost function to exactly one model. This seems fine, but subpar.
Make the estimate-cost function assume all models are ephemeral, which would then provide a compiled SQL query that goes fully back to existing sources so BQ can figure it out how much you need to scan. This seems like the better approach.

Haven't thought those through too much, so I might be missing something crucial :)

ekhaydarov · 2020-04-30T08:46:18Z

Would it be possible to get the run time of the job too as a second dimension? happy to help contribute

ekhaydarov · 2020-04-30T16:10:07Z

Actually, not sure it should be part of dbts remit? since this is isolated to big query it seems like a preference feature of a particular data warehouse. Its actually pretty straightforward to get your data out if you have cloud billing audit logs enabled. Here's an example query by google

You can then just put a model in dbt to make this table and youre good to go to query it as much as you want and set alerts on top of that if runtime slots exceed x seconds or bytes billed exceeds a threshold etc.

jml · 2020-04-30T17:10:34Z

At least for us switching to dbt, not having this is a step backwards. We have cloud billing audit logs enabled, but we very much value being able to see the cost of a job in the same place as the output of the job itself.

hui-zheng · 2020-05-04T22:21:13Z

Using audit log is also our current approach. Actually, recently BigQuery released job meta data using INFORMATION_SCHEMA. It makes this kind of queries even more accessible and easier. https://cloud.google.com/bigquery/docs/information-schema-jobs

tuanchris · 2021-08-27T10:54:46Z

I've got this script to parse the log file and print out the total totalBytesProcessed. Would love to see this at the end of a run.

import re

with open('./log.txt', 'r') as f:
    logs = f.read()

bytes_scanned = re.findall(', (.*?) processed', logs)

units = {"Bytes": 1, "KB": 10**3, "MB": 10**6, "GB": 10**9, "TB": 10**12}

def parse_size(size):
    number, unit = [string.strip() for string in size.split()]
    return int(float(number)*units[unit])

def sizeof_fmt(num, suffix='B'):
    for unit in ['','Ki','Mi','Gi','Ti','Pi','Ei','Zi']:
        if abs(num) < 1024.0:
            return "%3.1f%s%s" % (num, unit, suffix)
        num /= 1024.0
    return "%.1f%s%s" % (num, 'Yi', suffix)

total_bytes = list(map(parse_size, bytes_scanned))
sizeof_fmt(sum(total_bytes))

jtcohen6 · 2021-10-12T14:29:34Z

Given that this information has been available in the results object for several versions now (within adapter_response.bytes_processed), I think this one could be a good first issue.

I'm going to transfer this into dbt-bigquery repo. However, this will require an adjustment to the shared method print_results_line, defined in core/dbt/task/run.py, to enable override (or additional information) provided by the adapter.

github-actions · 2022-04-11T02:12:05Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

elyobo · 2022-10-29T01:12:39Z

Still seems worth doing bot, please reopen.

Similar to the python script above, here's a quick hack in JS (expects the log to be piped to it, e.g. save as summary.mjs and run like node summary.mjs < log.txt).

#!/usr/bin/env node

/**
 * Hack to extract data from `dbt` log data and provide summary stats.
 *
 * Capture the dbt logs and then pipe them to this, e.g.
 *
 * ```sh
 * dbt run | tee dbt.log
 * ./bin/dbt-log-summary.mjs < dbt.log
 * ```
 */

import fs from 'fs'

const PROCESSED_LINE_RE = /created sql.*model ([a-zA-Z0-9_\.]+).*\(([1-9\.]+)\s*([a-zA-Z]*) rows, ([0-9\.]+)\s*([a-zA-Z]+) processed\).*in ([0-9\.]+)\s*([a-zA-Z]*)/

const COST_PER_TiB = 5

const multipliers = {
  '': 1,
  k: 1_000,
  m: 1_000_000,
  s: 1,
  KB: 1_000,
  MB: 1_000_000,
  GB: 1_000_000_000,
  TB: 1_000_000_000_000,
  KiB: 1_024,
  MiB: 1_024 ** 2,
  GiB: 1_024 ** 3,
  TiB: 1_024 ** 4,
}

const multiply = x =>  multipliers[x]
const sum = arr => arr.reduce((acc, v) => acc + v, 0)
const round = (value, places) => Math.round((value + Number.EPSILON) * (10 ** places)) / (10 ** places)
const format = (value, places = 2) => round(value, places).toLocaleString('en-AU')
const simpleFormat = num => num.toLocaleString('en-AU', { notation: 'compact' })

// Pipe from stdin expected
const stdin = await fs.readFileSync(0, 'utf8')
const lines = stdin.split('\n')
const processed = lines
  .map(line => {
    const [
      , model, _rows, rowsUnits, _data, dataUnits, _duration, durationUnits,
    ] = line.match(PROCESSED_LINE_RE) || []
    if (!model) return

    const multiples = [rowsUnits, dataUnits, durationUnits].map(multiply)
    const [rows, data, duration] = [_rows, _data, _duration].map(Number).map((v, i) => v * multiples[i])
    
    return {
      model,
      _rows, rowsUnits, rows,
      _data, dataUnits, data,
      _duration, durationUnits, duration,
      line,
    }
  })
  .filter(Boolean)

if (!processed.length) {
  console.warn(`No data extracted from ${lines.length} line(s)!`)
  console.log(stdin)
  process.exit(1)
}

const assert = key => (item) => {
  const val = item[key]
  if (Number.isNaN(val) || val === undefined) {
    throw new Error(`${key} is invalid in ${JSON.stringify(item)}`)
  }
}

;['data', 'rows', 'duration'].map(key => processed.map(assert(key)))

const totalData = sum(processed.map(p => p.data))
const totalRows = sum(processed.map(p => p.rows))
const totalDuration = sum(processed.map(p => p.duration))

console.log(`Appx. cost: USD $${round(totalData / multipliers.TiB * COST_PER_TiB, 2)}`)
console.log(`Total data: ${simpleFormat(totalData)} bytes (${format(totalData / multipliers.TiB)} TiB)`)
console.log(`Total rows: ${simpleFormat(totalRows)}`)
console.log(
  `Total duration: ${simpleFormat(totalDuration)} seconds (${simpleFormat(Math.floor(totalDuration / 60))} minute(s), ${round(totalDuration % 60, 0)} second(s))`,
  '\n* execution time (some concurrent, not start to end time)'
)

console.log('\nmost data')
console.log(processed.sort((a, b) => b.data - a.data).slice(0, 10).map(r => r.line).join('\n'))

console.log('\nmost rows')
console.log(processed.sort((a, b) => b.rows - a.rows).slice(0, 10).map(r => r.line).join('\n'))

console.log('\nlongest duration')
console.log(processed.sort((a, b) => b.duration - a.duration).slice(0, 10).map(r => r.line).join('\n'))

elyobo · 2022-10-29T02:52:30Z

There's also no logging at all of this data usage from tests, which is a potential cost iceberg (especially with BQ and some of the tests having had SELECT * behaviour at times), exposing that information would be fantastic as well. I don't know if that's also hidden in the internals somewhere as @jtcohen6 mentions is the case for the run stuff (I guess it would be if it's being collected in the adapter?).

jtcohen6 transferred this issue from dbt-labs/dbt-core Oct 12, 2021

jtcohen6 added type:enhancement New feature or request good_first_issue Good for newcomers labels Oct 12, 2021

github-actions bot added the Stale label Apr 11, 2022

github-actions bot closed this as completed Apr 18, 2022

bruno-szdl mentioned this issue Feb 24, 2023

[CT-2176] [Feature] Report bytes processed for tests #559

Closed

3 tasks

elyobo mentioned this issue May 30, 2024

[Feature] Display the amount of processed data when running tests dbt-labs/dbt-adapters#570

Open

3 tasks

dbeatty10 closed this as not planned Won't fix, can't repro, duplicate, stale May 30, 2024

dbeatty10 added the feature:cost-reduction Issues related to cost tracking in BigQuery label May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report bytes processed after running BigQuery SQL #14

Report bytes processed after running BigQuery SQL #14

watermanio commented Jan 30, 2020

tnightengale commented Jan 30, 2020

drewbanin commented Jan 30, 2020

watermanio commented Jan 30, 2020

ryantuck commented Apr 2, 2020

drewbanin commented Apr 7, 2020

ryantuck commented Apr 7, 2020

ekhaydarov commented Apr 30, 2020

ekhaydarov commented Apr 30, 2020

jml commented Apr 30, 2020

hui-zheng commented May 4, 2020 •

edited

Loading

tuanchris commented Aug 27, 2021 •

edited

Loading

jtcohen6 commented Oct 12, 2021

github-actions bot commented Apr 11, 2022

elyobo commented Oct 29, 2022 •

edited

Loading

elyobo commented Oct 29, 2022

Report bytes processed after running BigQuery SQL #14

Report bytes processed after running BigQuery SQL #14

Comments

watermanio commented Jan 30, 2020

Describe the feature

Describe alternatives you've considered

Who will this benefit?

tnightengale commented Jan 30, 2020

drewbanin commented Jan 30, 2020

watermanio commented Jan 30, 2020

ryantuck commented Apr 2, 2020

drewbanin commented Apr 7, 2020

ryantuck commented Apr 7, 2020

ekhaydarov commented Apr 30, 2020

ekhaydarov commented Apr 30, 2020

jml commented Apr 30, 2020

hui-zheng commented May 4, 2020 • edited Loading

tuanchris commented Aug 27, 2021 • edited Loading

jtcohen6 commented Oct 12, 2021

github-actions bot commented Apr 11, 2022

elyobo commented Oct 29, 2022 • edited Loading

elyobo commented Oct 29, 2022

hui-zheng commented May 4, 2020 •

edited

Loading

tuanchris commented Aug 27, 2021 •

edited

Loading

elyobo commented Oct 29, 2022 •

edited

Loading