-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Switch from normal sampling to random sampler for Index data visualizer table #144646
Conversation
Pinging @elastic/ml-ui (:ml) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any safety checks for small data?
newProbability === Infinity || | ||
numSampled / initialDefaultProbability < 1e7 | ||
) { | ||
if (numSampled === 0 || newProbability === Infinity || numSampled < 5) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is 5
chosen here? It seems to me it should be much bigger.
Also, numSampled === 0
already implies numSampled < 5
, so that equals 0
check is redundant.
isTopValuesSampled: | ||
field.cardinality >= SAMPLER_TOP_TERMS_THRESHOLD || samplerShardSize > 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic seems wrong now? I don't think you use SAMPLER_TOP_TERMS_THRESHOLD
anylonger when determining if a thing is sampled or not.
…nges and overallStats is refetched
For total documents we recalculate now an estimated number based on the sampling. Do we aim for doing the same for all other data too in the table so that for example the documents stats would also be estimated full numbers recalculated from the sampling probability? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data Discovery changes LGTM 👍
Stats in Sidebars' field popover and Field Statistics table match.
Currently the counts in the expanded rows are not consistent: The doc stats add up to the sampled doc count, whereas the top values add up to the total doc count. We should be consistent here - either using the total count or the sampled count in both places. The previous approached used the sampled total. |
@@ -97,13 +99,55 @@ interface Props { | |||
} | |||
|
|||
export const ChoroplethMap: FC<Props> = ({ stats, suggestion }) => { | |||
const { fieldName, isTopValuesSampled, topValues, topValuesSamplerShardSize } = stats!; | |||
const { | |||
services: { data }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, could be
services: { data }, | |
services: { data: { fieldFormats } }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed here 80872e0
(#144646)
const docsPercent = | ||
valueCount !== undefined && sampleCount !== undefined | ||
? roundToDecimalPlace((valueCount / sampleCount) * 100) | ||
: 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's quite a lot of logic here, a comment explaining what it's doing would be useful.
Do we definitely want to be showing 0%
if value valueCount
or sampleCount
are undefined? should we instead not show this item at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed here 80872e0
(#144646)
const { stats } = config; | ||
const { | ||
services: { data }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit,
services: { data }, | |
services: { data: { fieldFormats } }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed here 80872e0
(#144646)
|
||
// If field exists is docs but we don't have count stats then don't show | ||
// Otherwise if field doesn't appear in docs at all, show 0% | ||
const docsCount = | ||
const valueCount = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just notices this is the same logic as before, could it be moved to a function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed here 80872e0
(#144646)
@@ -36,8 +36,7 @@ interface Props { | |||
onAddFilter?: (field: DataViewField | string, value: string, type: '+' | '-') => void; | |||
} | |||
|
|||
function getPercentLabel(docCount: number, topValuesSampleSize: number): string { | |||
const percent = (100 * docCount) / topValuesSampleSize; | |||
function _getPercentLabel(percent: number): string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably doesn't need an underscore prefix. We sometimes do this to signify a function is private when inside another function or class, but it is clear that this is not exported so will be private to the module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed here 80872e0
(#144646)
const totalDocuments = stats.totalDocuments ?? 0; | ||
const topValuesOtherCountPercent = | ||
1 - (topValues ? topValues.reduce((acc, bucket) => acc + bucket.percent, 0) : 0); | ||
const topValuesOtherCount = Math.floor(topValuesOtherCountPercent * (sampleCount || 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although it doesn't matter logically here, for consistency it might be better to still use ??
because the point of the check is to look for undefined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed here 80872e0
(#144646)
export type FieldStatsEmbeddableSamplerOption = | ||
typeof EMBEDDABLE_SAMPLER_OPTION[keyof typeof EMBEDDABLE_SAMPLER_OPTION]; | ||
|
||
export function isRandomSamplingOption(arg: SamplingOption): arg is RandomSamplingOption { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these guards might be better living next to the types in common/types/field_stats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed here 80872e0
(#144646)
} | ||
|
||
export function buildAggregationWithSamplingOption( | ||
aggs: any, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this any
be replaced by a correct type?
maybe Record<string, estypes.AggregationsAggregationContainer>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed here 80872e0
(#144646)
} | ||
|
||
export function buildSamplerAggregation( | ||
aggs: any, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same with this any
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed here 80872e0
* Wraps the supplied aggregations in a random sampler aggregation. | ||
*/ | ||
export function buildRandomSamplerAggregation( | ||
aggs: any, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same with this any
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed here 80872e0
@peteharverson Thanks for catching that. The numbers were adding up to the sampled # of docs for the numerical and boolean top stats, but not for string/keyword top terms. This is because the buckets' doc_count numbers returning from elasticsearch itself was not for the values sampled. I've added a fix for that by scaling the numbers down to match the # sampled docs (so
@jgowdyelastic Good point! I've added debouncing here |
if (setSamplingProbability) { | ||
setSamplingProbability(closestProbability / 100); | ||
} | ||
updateSamplingProbability(closestProbability / 100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be more performant to put all of the logic in this onChange
function inside the function that is wrapped in the debounce. There's no need to be calculating the closestProbability
on every change if it's just going to be discarded.
Also there is a useful hook called useDebounce
which might work well here.
You could put the e.currentTarget.value
in a temporary state variable e.g. newProbability
and then useDebounce
could watch for changes in that variable.
id="xpack.dataVisualizer.dataGrid.field.topValues.calculatedFromSampleDescription" | ||
defaultMessage="Calculated from sample of {topValuesSamplerShardSize} documents per shard" | ||
id="xpack.dataVisualizer.dataGrid.field.topValues.calculatedFromSampleRecordsLabel" | ||
defaultMessage="Calculated from {sampledDocumentsFormatted} sample {sampledDocuments, plural, one {record} other {records}}." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed here 7d7ba42
(#144646)
{isTopValuesSampled ? ( | ||
<FormattedMessage | ||
id="xpack.dataVisualizer.dataGrid.fieldExpandedRow.choroplethMapTopValues.calculatedFromSampleRecordsLabel" | ||
defaultMessage="Calculated from {sampledDocumentsFormatted} sample {sampledDocuments, plural, one {record} other {records}}." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed here 7d7ba42
(#144646)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested latest changes and LGTM.
Switch of position of the random sampler controls cog icon to the right looks good - this is consistent now with the position of the settings control in Discover and prevents the control jumping around as the slider is moved and the chart is reloaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@elasticmachine merge upstream |
💚 Build Succeeded
Metrics [docs]Module Count
Async chunks
Unknown metric groupsESLint disabled in files
ESLint disabled line counts
References to deprecated APIs
Total ESLint disabled count
History
To update your PR or re-run it, just comment with: cc @qn895 |
const multiplier = | ||
count > sampleCount ? get(aggregations, [...aggsPath, 'probability'], 1) : 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same concerns here as above.
// Sampler agg will yield doc_count that's bigger than the actual # of sampled records | ||
// because it uses the stored _doc_count if available | ||
// https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-doc-count-field.html | ||
// therefore we need to correct it by multiplying by the sampled probability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this is doing what you are commenting.
I think this is scaling the count back DOWN via the probability (random_sampler not only uses _doc_count
, but scales sampled counting statistics by the inverted probability). In doing this, you ensure that doc count is near the sampledCount, but this has nothing to do with _doc_count.
const topValues = topValuesBuckets.map((bucket) => ({ | ||
...bucket, | ||
doc_count: sampledCount | ||
? Math.floor(bucket.doc_count * (sampledCount / realNumberOfDocuments)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to have a comment here explaining sampledCount / realNumberOfDocuments
that this was done to "scale back down" doc_count to be lower than the sampled number of documents.
The sum of the bucket values is indeed to handle the _doc_count
weirdness.
Summary
This PR switches the currently sampling method from normal sampling to random sampler for Index data visualizer table. It also lowers the threshold for when the sampling can be done in the interest of speed.
Checklist
Delete any items that are not applicable to this PR.
Risk Matrix
Delete this section if it is not applicable to this PR.
Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.
When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:
For maintainers