Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregated LOCOs of SmartTextVectorizer outputs #308

Merged
merged 37 commits into from
Jun 3, 2019
Merged

Conversation

michaelweilsalesforce
Copy link
Contributor

@michaelweilsalesforce michaelweilsalesforce commented May 8, 2019

Related issues
LOCOs of derived text features of smartTextVectorizer are hard to interpret (no indicator value).

Describe the proposed solution
Aggregate all these LOCOs (except null indicator column) from the same text feature by the mean.

Further improvement PR : add other ways to aggregate (max, min, ...). This will become a param.

@codecov
Copy link

codecov bot commented May 8, 2019

Codecov Report

Merging #308 into master will increase coverage by 0.01%.
The diff coverage is 98.07%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #308      +/-   ##
==========================================
+ Coverage   86.51%   86.53%   +0.01%     
==========================================
  Files         329      329              
  Lines       10599    10617      +18     
  Branches      340      546     +206     
==========================================
+ Hits         9170     9187      +17     
- Misses       1429     1430       +1
Impacted Files Coverage Δ
...e/op/stages/impl/insights/RecordInsightsLOCO.scala 95.12% <98.07%> (-0.2%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 154e188...57bcf6b. Read the comment docs.

@michaelweilsalesforce michaelweilsalesforce changed the title Aggregated texts for LOCO Aggregated LOCOs of SmartTextVectorizer outputs May 10, 2019
@tovbinm
Copy link
Collaborator

tovbinm commented May 10, 2019

@mweilsalesforce is it ready for review?

@michaelweilsalesforce
Copy link
Contributor Author

@mweilsalesforce is it ready for review?
Yes, let's have a first round of review/feedback.

(avgRecordInsightRatio + featureImportanceRatio) < 0.8,
"The ratio of feature strengths between important and other features should be similar to the ratio of " +
"feature importances from Spark's RandomForest")
}

it should "aggregate text derived features" in {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michaelweilsalesforce @Jauntbox do these test have to be that verbose & explicit? such large tests seem barely readable to comprehend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the verbose come from the creation of the dataset. Not sure we need to move this part at the beginning of the class.

Copy link
Collaborator

@tovbinm tovbinm May 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When looking on a test - it has to be easy to understand what are the inputs, outcomes and assertions that are verified in the test; everything else have to elsewhere.

@michaelweilsalesforce
Copy link
Contributor Author

michaelweilsalesforce commented May 14, 2019

Oopsie. I forgot to add the same logic for SmartTextMapVectorizer. Working on it.

@tovbinm tovbinm requested a review from Jauntbox May 16, 2019 17:53
/**
* These are the name of the stages we want to perform an aggregation of the LOCO results over ferived features
*/
private val smartTextClassName = classOf[SmartTextVectorizer[_]].getSimpleName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be a custom text vectorizer, if yes shouldn't we handle that too ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately custom Text Vectorizer cannot necessary return hashes as output. This aggregation is super limited and doesn't cover all the use cases yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks.

val indexToExamine = baseScore.length match {
case 0 => throw new RuntimeException("model does not produce scores for insights")
case 1 => 0
case 2 => 1
case n if (n > 2) => baseResult.prediction.toInt
case n if n > 2 => baseResult.prediction.toInt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not following - why prediction value becomes an index?!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to return the LOCO score of the predicted class.

Let's say for a row the LOCOs are 0 -> LOCO_0, 1 -> LOCO_1, 2 -> LOCO_2, and the model predicts the class 1 on this row. Then we want to return LOCO_1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add some docs to explain it, cause it looks weird to me.

@tovbinm
Copy link
Collaborator

tovbinm commented May 30, 2019

@Jauntbox please have a look

}
// Update the aggregation map
for {name <- rawName} {
val (indices, array) = aggregationMap.getOrElse(name, (Array.empty[Int], Array.empty[Double]))
aggregationMap.update(name, (indices :+ i, sumArrays(array, diffToExamine)))
val key = name + "_" + history.parentFeatureStages.mkString(",")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the key value can be quite large is do concatenate all the parent stages. what's the rationale behind it? @mweilsalesforce

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to apply different transformations on a same feature.
After thought, maybe we should aggregate all the derived features even if 2 different transformations were applied

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok fixed

positiveMaxHeap.dequeue()
// Let's check the indicator value and descriptor value
// If those values are empty, the field is likely to be a derived text feature (hashing tf output)
if (textFeatureIndices.contains(oldInd) && history.indicatorValue.isEmpty && history.descriptorValue.isEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check here to make sure we don't accidentally pick up text features that were determined to be categorical by the smart text vectorizer? If so, we should probably make an easier way to tell what transformations were applied to the text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only smartVectorizer. Any feature with indicator/descriptor values and derived from a text transformation is likely to be easily interpreted once LOCO is done.

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please @Jauntbox approve and merge

Copy link
Contributor

@sanmitra sanmitra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@tovbinm tovbinm merged commit cc84919 into master Jun 3, 2019
@tovbinm tovbinm deleted the mw/Aggregation branch June 3, 2019 22:27
This was referenced Jul 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants