-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report bytes processed after running BigQuery SQL #14
Comments
I would really like to see something like this as well. I am investigating a strategy to monitor and alert our org of query costs, whilst using dbt. |
Thanks for opening this issue @tnightengale! I'm super into this idea (see also dbt-labs/dbt-core#401) The questions here have always been 1) when does this code run and 2) how is the information rendered. I'm definitely into the idea of recording the bytes billed in the structured logs and CLI output. I also really like the idea of fetching the estimated bytes billes on BQ before queries are run, but I'm not sure how/where they should be rendered out to users. Do y'all have any opinions on this? |
@drewbanin I think the request for this information is already made roughly here, but the relevant bit of the I think the structured log bit might be fairly straightforward, just printing the bytes_billed at the time of querying. But the total bytes billing during a DBT run would need storing and calculating somewhere new. I love your idea of estimated bytes billed - this is also available through the BigQuery API by making a dry_run request - I see there is already a very basic WIP PR for that here: dbt-labs/dbt-core#1567 |
I can't tell if this is possible already, but it would be helpful to get this information prior to a run to estimate cost for a query like I can do via the UI. This would essentially mimic the My workaround currently is copying/pasting compiled SQL into the UI, which feels less-than-elegant. |
@ryantuck this is pretty doable if the run is for a single model in a built-out dataset, but it becomes really challenging in the more general case. What happens if:
How can dbt leverage BQ's dry run capabilities to return realistic byte counts in cases like these? Seems to me like we'd either need to acknowledge the major caveats, or add some severe guardrails to make sure people don't misinterpret the outputs |
@drewbanin oh, that's a good call-out. Two technical solutions come immediately to mind:
Haven't thought those through too much, so I might be missing something crucial :) |
Would it be possible to get the run time of the job too as a second dimension? happy to help contribute |
Actually, not sure it should be part of dbts remit? since this is isolated to big query it seems like a preference feature of a particular data warehouse. Its actually pretty straightforward to get your data out if you have cloud billing audit logs enabled. Here's an example query by google You can then just put a model in dbt to make this table and youre good to go to query it as much as you want and set alerts on top of that if runtime slots exceed x seconds or bytes billed exceeds a threshold etc. |
At least for us switching to dbt, not having this is a step backwards. We have cloud billing audit logs enabled, but we very much value being able to see the cost of a job in the same place as the output of the job itself. |
Using audit log is also our current approach. Actually, recently BigQuery released |
I've got this script to parse the log file and print out the total import re
with open('./log.txt', 'r') as f:
logs = f.read()
bytes_scanned = re.findall(', (.*?) processed', logs)
units = {"Bytes": 1, "KB": 10**3, "MB": 10**6, "GB": 10**9, "TB": 10**12}
def parse_size(size):
number, unit = [string.strip() for string in size.split()]
return int(float(number)*units[unit])
def sizeof_fmt(num, suffix='B'):
for unit in ['','Ki','Mi','Gi','Ti','Pi','Ei','Zi']:
if abs(num) < 1024.0:
return "%3.1f%s%s" % (num, unit, suffix)
num /= 1024.0
return "%.1f%s%s" % (num, 'Yi', suffix)
total_bytes = list(map(parse_size, bytes_scanned))
sizeof_fmt(sum(total_bytes)) |
Given that this information has been available in the I'm going to transfer this into dbt-bigquery repo. However, this will require an adjustment to the shared method |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
Still seems worth doing bot, please reopen. Similar to the python script above, here's a quick hack in JS (expects the log to be piped to it, e.g. save as #!/usr/bin/env node
/**
* Hack to extract data from `dbt` log data and provide summary stats.
*
* Capture the dbt logs and then pipe them to this, e.g.
*
* ```sh
* dbt run | tee dbt.log
* ./bin/dbt-log-summary.mjs < dbt.log
* ```
*/
import fs from 'fs'
const PROCESSED_LINE_RE = /created sql.*model ([a-zA-Z0-9_\.]+).*\(([1-9\.]+)\s*([a-zA-Z]*) rows, ([0-9\.]+)\s*([a-zA-Z]+) processed\).*in ([0-9\.]+)\s*([a-zA-Z]*)/
const COST_PER_TiB = 5
const multipliers = {
'': 1,
k: 1_000,
m: 1_000_000,
s: 1,
KB: 1_000,
MB: 1_000_000,
GB: 1_000_000_000,
TB: 1_000_000_000_000,
KiB: 1_024,
MiB: 1_024 ** 2,
GiB: 1_024 ** 3,
TiB: 1_024 ** 4,
}
const multiply = x => multipliers[x]
const sum = arr => arr.reduce((acc, v) => acc + v, 0)
const round = (value, places) => Math.round((value + Number.EPSILON) * (10 ** places)) / (10 ** places)
const format = (value, places = 2) => round(value, places).toLocaleString('en-AU')
const simpleFormat = num => num.toLocaleString('en-AU', { notation: 'compact' })
// Pipe from stdin expected
const stdin = await fs.readFileSync(0, 'utf8')
const lines = stdin.split('\n')
const processed = lines
.map(line => {
const [
, model, _rows, rowsUnits, _data, dataUnits, _duration, durationUnits,
] = line.match(PROCESSED_LINE_RE) || []
if (!model) return
const multiples = [rowsUnits, dataUnits, durationUnits].map(multiply)
const [rows, data, duration] = [_rows, _data, _duration].map(Number).map((v, i) => v * multiples[i])
return {
model,
_rows, rowsUnits, rows,
_data, dataUnits, data,
_duration, durationUnits, duration,
line,
}
})
.filter(Boolean)
if (!processed.length) {
console.warn(`No data extracted from ${lines.length} line(s)!`)
console.log(stdin)
process.exit(1)
}
const assert = key => (item) => {
const val = item[key]
if (Number.isNaN(val) || val === undefined) {
throw new Error(`${key} is invalid in ${JSON.stringify(item)}`)
}
}
;['data', 'rows', 'duration'].map(key => processed.map(assert(key)))
const totalData = sum(processed.map(p => p.data))
const totalRows = sum(processed.map(p => p.rows))
const totalDuration = sum(processed.map(p => p.duration))
console.log(`Appx. cost: USD $${round(totalData / multipliers.TiB * COST_PER_TiB, 2)}`)
console.log(`Total data: ${simpleFormat(totalData)} bytes (${format(totalData / multipliers.TiB)} TiB)`)
console.log(`Total rows: ${simpleFormat(totalRows)}`)
console.log(
`Total duration: ${simpleFormat(totalDuration)} seconds (${simpleFormat(Math.floor(totalDuration / 60))} minute(s), ${round(totalDuration % 60, 0)} second(s))`,
'\n* execution time (some concurrent, not start to end time)'
)
console.log('\nmost data')
console.log(processed.sort((a, b) => b.data - a.data).slice(0, 10).map(r => r.line).join('\n'))
console.log('\nmost rows')
console.log(processed.sort((a, b) => b.rows - a.rows).slice(0, 10).map(r => r.line).join('\n'))
console.log('\nlongest duration')
console.log(processed.sort((a, b) => b.duration - a.duration).slice(0, 10).map(r => r.line).join('\n')) |
There's also no logging at all of this data usage from tests, which is a potential cost iceberg (especially with BQ and some of the tests having had |
Describe the feature
As BigQuery bills users based on the volume of data processed, it's fairly important for users to be aware of this while developing models and pipelines. In the BQ UI this is fairly clearly presented after each query run, and the same information is available in the
jobs.query
API response.You can find the
totalBytesProcessed
documented in the linked API documentation.It would be helpful to see this output in the JSON logging - example:
And also at the end of a run with the default log formatter:
Describe alternatives you've considered
If you have enabled Export To BigQuery within your Google Cloud billing project, it would be possible to query how much data the service account your DBT user has processed. This could then be queried and monitored outside of DBT, but it would be so much helpful to have an indication of query size in realtime - this is an important optimisation when building BigQuery pipelines.
Who will this benefit?
This is very important for all users of BigQuery who have enough data stored to have a reasonable spend, but who are not so large that they have purchased commitments (flat fee / all you can eat) from Google. Even then, it's still best practice to pay attention to how much data you are processing...
The text was updated successfully, but these errors were encountered: