You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Feb 12, 2022. It is now read-only.
Our current stats gathering is way too simplistic - it's only keeping a cache per client connection to a cluster for the min and max key for a table. Instead, we should:
have a system table that stores the stats
create a coprocessor that updates the stats during compaction (i.e. using the preCompactSelection, postCompactSelection, preCompact, postCompact methods)
keep a kind of histogram - the key boundary of every N bytes within a region. Perhaps we can do a delta update on minor compaction and a complete update on major compaction.
keep the min key/max key of a table in the stats table too
The text was updated successfully, but these errors were encountered:
Wow, that HyperLogLog is pretty interesting - thanks for the pointer. For stats, we're calculating it at major compression where a full pass is made through the data anyway, so I don't think it'll help there. But for COUNT DISTINCT and SELECT DISTINCT, it could definitely be useful.
It will only give out the cardinality but not the unique value itself. I'm thinking whether we can implement the combination of HyperLogLog and BloomFilter at the column value itself to determine the strategy to aggregate the data. If so, that would be awesome.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Our current stats gathering is way too simplistic - it's only keeping a cache per client connection to a cluster for the min and max key for a table. Instead, we should:
The text was updated successfully, but these errors were encountered: