-
Notifications
You must be signed in to change notification settings - Fork 397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aggregated LOCOs of SmartTextVectorizer outputs #308
Conversation
Codecov Report
@@ Coverage Diff @@
## master #308 +/- ##
==========================================
+ Coverage 86.51% 86.53% +0.01%
==========================================
Files 329 329
Lines 10599 10617 +18
Branches 340 546 +206
==========================================
+ Hits 9170 9187 +17
- Misses 1429 1430 +1
Continue to review full report at Codecov.
|
core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala
Outdated
Show resolved
Hide resolved
@mweilsalesforce is it ready for review? |
|
core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala
Outdated
Show resolved
Hide resolved
(avgRecordInsightRatio + featureImportanceRatio) < 0.8, | ||
"The ratio of feature strengths between important and other features should be similar to the ratio of " + | ||
"feature importances from Spark's RandomForest") | ||
} | ||
|
||
it should "aggregate text derived features" in { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michaelweilsalesforce @Jauntbox do these test have to be that verbose & explicit? such large tests seem barely readable to comprehend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the verbose come from the creation of the dataset. Not sure we need to move this part at the beginning of the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When looking on a test - it has to be easy to understand what are the inputs, outcomes and assertions that are verified in the test; everything else have to elsewhere.
Oopsie. I forgot to add the same logic for SmartTextMapVectorizer. Working on it. |
core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala
Outdated
Show resolved
Hide resolved
/** | ||
* These are the name of the stages we want to perform an aggregation of the LOCO results over ferived features | ||
*/ | ||
private val smartTextClassName = classOf[SmartTextVectorizer[_]].getSimpleName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can there be a custom text vectorizer, if yes shouldn't we handle that too ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately custom Text Vectorizer cannot necessary return hashes as output. This aggregation is super limited and doesn't cover all the use cases yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks.
val indexToExamine = baseScore.length match { | ||
case 0 => throw new RuntimeException("model does not produce scores for insights") | ||
case 1 => 0 | ||
case 2 => 1 | ||
case n if (n > 2) => baseResult.prediction.toInt | ||
case n if n > 2 => baseResult.prediction.toInt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not following - why prediction value becomes an index?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to return the LOCO score of the predicted class.
Let's say for a row the LOCOs are 0 -> LOCO_0, 1 -> LOCO_1, 2 -> LOCO_2
, and the model predicts the class 1
on this row. Then we want to return LOCO_1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add some docs to explain it, cause it looks weird to me.
…ogrifAI into mw/Aggregation
@Jauntbox please have a look |
core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala
Show resolved
Hide resolved
core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala
Outdated
Show resolved
Hide resolved
} | ||
// Update the aggregation map | ||
for {name <- rawName} { | ||
val (indices, array) = aggregationMap.getOrElse(name, (Array.empty[Int], Array.empty[Double])) | ||
aggregationMap.update(name, (indices :+ i, sumArrays(array, diffToExamine))) | ||
val key = name + "_" + history.parentFeatureStages.mkString(",") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the key value can be quite large is do concatenate all the parent stages. what's the rationale behind it? @mweilsalesforce
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to apply different transformations on a same feature.
After thought, maybe we should aggregate all the derived features even if 2 different transformations were applied
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok fixed
positiveMaxHeap.dequeue() | ||
// Let's check the indicator value and descriptor value | ||
// If those values are empty, the field is likely to be a derived text feature (hashing tf output) | ||
if (textFeatureIndices.contains(oldInd) && history.indicatorValue.isEmpty && history.descriptorValue.isEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this check here to make sure we don't accidentally pick up text features that were determined to be categorical by the smart text vectorizer? If so, we should probably make an easier way to tell what transformations were applied to the text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not only smartVectorizer. Any feature with indicator/descriptor values and derived from a text transformation is likely to be easily interpreted once LOCO is done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, please @Jauntbox approve and merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Related issues
LOCOs of derived text features of smartTextVectorizer are hard to interpret (no indicator value).
Describe the proposed solution
Aggregate all these LOCOs (except null indicator column) from the same text feature by the mean.
Further improvement PR : add other ways to aggregate (max, min, ...). This will become a param.