Aggregated LOCOs of SmartTextVectorizer outputs #308

michaelweilsalesforce · 2019-05-08T22:46:43Z

Related issues
LOCOs of derived text features of smartTextVectorizer are hard to interpret (no indicator value).

Describe the proposed solution
Aggregate all these LOCOs (except null indicator column) from the same text feature by the mean.

Further improvement PR : add other ways to aggregate (max, min, ...). This will become a param.

codecov · 2019-05-08T23:03:06Z

Codecov Report

Merging #308 into master will increase coverage by 0.01%.
The diff coverage is 98.07%.

@@            Coverage Diff             @@
##           master     #308      +/-   ##
==========================================
+ Coverage   86.51%   86.53%   +0.01%     
==========================================
  Files         329      329              
  Lines       10599    10617      +18     
  Branches      340      546     +206     
==========================================
+ Hits         9170     9187      +17     
- Misses       1429     1430       +1

Impacted Files	Coverage Δ
...e/op/stages/impl/insights/RecordInsightsLOCO.scala	`95.12% <98.07%> (-0.2%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 154e188...57bcf6b. Read the comment docs.

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala

tovbinm · 2019-05-10T19:56:44Z

@mweilsalesforce is it ready for review?

michaelweilsalesforce · 2019-05-10T19:58:39Z

@mweilsalesforce is it ready for review?
Yes, let's have a first round of review/feedback.

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala

tovbinm · 2019-05-10T21:13:11Z

core/src/test/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCOTest.scala

      (avgRecordInsightRatio + featureImportanceRatio) < 0.8,
      "The ratio of feature strengths between important and other features should be similar to the ratio of " +
        "feature importances from Spark's RandomForest")
  }

+  it should "aggregate text derived features" in {


@michaelweilsalesforce @Jauntbox do these test have to be that verbose & explicit? such large tests seem barely readable to comprehend.

Some of the verbose come from the creation of the dataset. Not sure we need to move this part at the beginning of the class.

When looking on a test - it has to be easy to understand what are the inputs, outcomes and assertions that are verified in the test; everything else have to elsewhere.

…ogrifAI into mw/Aggregation

michaelweilsalesforce · 2019-05-14T00:01:05Z

Oopsie. I forgot to add the same logic for SmartTextMapVectorizer. Working on it.

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala

sanmitra · 2019-05-24T21:32:06Z

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala

+  /**
+   * These are the name of the stages we want to perform an aggregation of the LOCO results over ferived features
+   */
+  private val smartTextClassName = classOf[SmartTextVectorizer[_]].getSimpleName


Can there be a custom text vectorizer, if yes shouldn't we handle that too ?

Unfortunately custom Text Vectorizer cannot necessary return hashes as output. This aggregation is super limited and doesn't cover all the use cases yet.

ok, thanks.

tovbinm · 2019-05-24T22:46:10Z

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala

    val indexToExamine = baseScore.length match {
      case 0 => throw new RuntimeException("model does not produce scores for insights")
      case 1 => 0
      case 2 => 1
-      case n if (n > 2) => baseResult.prediction.toInt
+      case n if n > 2 => baseResult.prediction.toInt


I am not following - why prediction value becomes an index?!

We want to return the LOCO score of the predicted class.

Let's say for a row the LOCOs are 0 -> LOCO_0, 1 -> LOCO_1, 2 -> LOCO_2, and the model predicts the class 1 on this row. Then we want to return LOCO_1

let's add some docs to explain it, cause it looks weird to me.

…ogrifAI into mw/Aggregation

tovbinm · 2019-05-30T18:00:57Z

@Jauntbox please have a look

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala

tovbinm · 2019-06-03T17:38:37Z

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala

        }
        // Update the aggregation map
        for {name <- rawName} {
-          val (indices, array) = aggregationMap.getOrElse(name, (Array.empty[Int], Array.empty[Double]))
-          aggregationMap.update(name, (indices :+ i, sumArrays(array, diffToExamine)))
+          val key = name + "_" + history.parentFeatureStages.mkString(",")


the key value can be quite large is do concatenate all the parent stages. what's the rationale behind it? @mweilsalesforce

It is possible to apply different transformations on a same feature.
After thought, maybe we should aggregate all the derived features even if 2 different transformations were applied

…rent maps

…ogrifAI into mw/Aggregation

Jauntbox · 2019-06-03T18:54:01Z

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala

-          positiveMaxHeap.dequeue()
+      // Let's check the indicator value and descriptor value
+      // If those values are empty, the field is likely to be a derived text feature (hashing tf output)
+      if (textFeatureIndices.contains(oldInd) && history.indicatorValue.isEmpty && history.descriptorValue.isEmpty) {


Is this check here to make sure we don't accidentally pick up text features that were determined to be categorical by the smart text vectorizer? If so, we should probably make an easier way to tell what transformations were applied to the text.

Not only smartVectorizer. Any feature with indicator/descriptor values and derived from a text transformation is likely to be easily interpreted once LOCO is done.

tovbinm

LGTM, please @Jauntbox approve and merge

sanmitra

LGTM.

Aggregated texts

ae0c05c

michaelweilsalesforce added the work in progress label May 8, 2019

michaelweilsalesforce requested review from leahmcguire and tovbinm as code owners May 8, 2019 22:46

Merge branch 'master' into mw/Aggregation

dae1c76

salesforce-cla bot added the cla:signed label May 8, 2019

gerashegalov reviewed May 9, 2019

View reviewed changes

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala Outdated Show resolved Hide resolved

michaelweilsalesforce and others added 2 commits May 10, 2019 10:13

Merge branch 'master' into mw/Aggregation

bfac6dd

sum array

f63be09

michaelweilsalesforce removed the work in progress label May 10, 2019

michaelweilsalesforce changed the title ~~Aggregated texts for LOCO~~ Aggregated LOCOs of SmartTextVectorizer outputs May 10, 2019

michaelweilsalesforce requested a review from sanmitra May 10, 2019 19:41

tovbinm reviewed May 10, 2019

View reviewed changes

michaelweilsalesforce and others added 3 commits May 13, 2019 11:21

Merge branch 'master' into mw/Aggregation

fb5c849

getSimpleName fix

4ba11d8

Merge branch 'mw/Aggregation' of https://github.com/salesforce/Transm…

f847850

…ogrifAI into mw/Aggregation

Added functionality for SmartTextMapVectorizer

1d1afc5

tovbinm requested a review from Jauntbox May 16, 2019 17:53

tovbinm and others added 4 commits May 16, 2019 10:53

Merge branch 'master' into mw/Aggregation

754891c

Adding comments

797222d

Merge branch 'master' into mw/Aggregation

79467ba

Merge branch 'master' into mw/Aggregation

250e1fd

sanmitra reviewed May 24, 2019

View reviewed changes

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala Outdated Show resolved Hide resolved

sanmitra reviewed May 24, 2019

View reviewed changes

tovbinm reviewed May 24, 2019

View reviewed changes

fixes

7e13053

tovbinm and others added 10 commits May 28, 2019 11:13

cleanup

33dfd4e

avoid .value on sparse vector

c37484a

use breeze directly

6763b64

fixin

9033712

ff

b427857

added MinMaxHeap class

44075b2

Merge branch 'master' into mw/Aggregation

92c160a

Merge branch 'master' into mw/Aggregation

5302039

Take top(2k) for PositiveAndNegative

46ea9e0

Merge branch 'mw/Aggregation' of https://github.com/salesforce/Transm…

c8c0cf9

…ogrifAI into mw/Aggregation

tovbinm approved these changes May 30, 2019

View reviewed changes

mweilsalesforce added 2 commits May 30, 2019 11:02

Fix Merge Conflict

21937db

Adding comment for Multiclass case

f964818

sanmitra reviewed May 30, 2019

View reviewed changes

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala Show resolved Hide resolved

sanmitra reviewed May 30, 2019

View reviewed changes

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala Outdated Show resolved Hide resolved

mweilsalesforce added 2 commits June 3, 2019 09:39

Aggregation based on type and not on stage anymore

ad1f32a

Fix Scalastyle

75fad54

tovbinm reviewed Jun 3, 2019

View reviewed changes

tovbinm and others added 4 commits June 3, 2019 10:45

use feature.typename

5df59b5

Stages won't appear in key names + fix bug of same key names in diffe…

2510762

…rent maps

Merge branch 'mw/Aggregation' of https://github.com/salesforce/Transm…

2e1628a

…ogrifAI into mw/Aggregation

Fix merge conflict

57bcf6b

Jauntbox reviewed Jun 3, 2019

View reviewed changes

tovbinm approved these changes Jun 3, 2019

View reviewed changes

sanmitra approved these changes Jun 3, 2019

View reviewed changes

tovbinm merged commit cc84919 into master Jun 3, 2019

tovbinm deleted the mw/Aggregation branch June 3, 2019 22:27

This was referenced Jul 10, 2019

0.6.0 Release #360

Closed

0.6.0 release #364

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregated LOCOs of SmartTextVectorizer outputs #308

Aggregated LOCOs of SmartTextVectorizer outputs #308

michaelweilsalesforce commented May 8, 2019 •

edited

Loading

codecov bot commented May 8, 2019 •

edited

Loading

tovbinm commented May 10, 2019

michaelweilsalesforce commented May 10, 2019

tovbinm May 10, 2019

michaelweilsalesforce May 13, 2019

tovbinm May 13, 2019 •

edited

Loading

michaelweilsalesforce commented May 14, 2019 •

edited

Loading

sanmitra May 24, 2019

michaelweilsalesforce May 30, 2019

sanmitra May 30, 2019

tovbinm May 24, 2019

michaelweilsalesforce May 30, 2019

tovbinm May 30, 2019

tovbinm commented May 30, 2019

tovbinm Jun 3, 2019

michaelweilsalesforce Jun 3, 2019

michaelweilsalesforce Jun 3, 2019

Jauntbox Jun 3, 2019

michaelweilsalesforce Jun 3, 2019

tovbinm left a comment

sanmitra left a comment

Aggregated LOCOs of SmartTextVectorizer outputs #308

Aggregated LOCOs of SmartTextVectorizer outputs #308

Conversation

michaelweilsalesforce commented May 8, 2019 • edited Loading

codecov bot commented May 8, 2019 • edited Loading

Codecov Report

tovbinm commented May 10, 2019

michaelweilsalesforce commented May 10, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tovbinm May 13, 2019 • edited Loading

Choose a reason for hiding this comment

michaelweilsalesforce commented May 14, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tovbinm commented May 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tovbinm left a comment

Choose a reason for hiding this comment

sanmitra left a comment

Choose a reason for hiding this comment

michaelweilsalesforce commented May 8, 2019 •

edited

Loading

codecov bot commented May 8, 2019 •

edited

Loading

tovbinm May 13, 2019 •

edited

Loading

michaelweilsalesforce commented May 14, 2019 •

edited

Loading