Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FormAnalyzer] scoring password hints #720

Merged
merged 21 commits into from
Jan 22, 2025
Merged

Conversation

dbajpeyi
Copy link
Collaborator

@dbajpeyi dbajpeyi commented Dec 16, 2024

Reviewer: @alistairjcbrown
Asana: https://app.asana.com/0/1203822806345703/1208965540697669/f

Description

  1. Use password hints to score form, when signals are weak.
  2. Moves input attribute evaluation in evaluateForm instead of in the constructor. This way all inputs get evaluated for.
  3. Hindustan times login page gets regressed., but it's a net win.

Steps to test

https://sleeper.com/create?type=league is the site that was originally broken.
Score before:
image

Score after:
Screenshot 2024-12-20 at 16 17 27

@dbajpeyi dbajpeyi force-pushed the dbajpeyi/feature/password-hints branch from 201c856 to 0640b6f Compare December 19, 2024 14:00
@dbajpeyi dbajpeyi changed the title wip: scoring password hints towards signup [FormAnalyzing] scoring password hints and sibling headers Dec 19, 2024
@dbajpeyi dbajpeyi changed the title [FormAnalyzing] scoring password hints and sibling headers [FormAnalyzer] scoring password hints and sibling headers Dec 19, 2024
@dbajpeyi dbajpeyi marked this pull request as ready for review December 20, 2024 15:05
@dbajpeyi dbajpeyi force-pushed the dbajpeyi/feature/password-hints branch from 5efa4cd to 71d51da Compare December 20, 2024 15:06
@dbajpeyi dbajpeyi force-pushed the dbajpeyi/feature/password-hints branch from 71d51da to a110052 Compare December 20, 2024 15:07
@dbajpeyi dbajpeyi force-pushed the dbajpeyi/feature/password-hints branch from a110052 to 7402c0d Compare December 20, 2024 15:11
@dbajpeyi dbajpeyi changed the title [FormAnalyzer] scoring password hints and sibling headers [FormAnalyzer] scoring password hints Dec 20, 2024
@dbajpeyi dbajpeyi force-pushed the dbajpeyi/feature/password-hints branch from 83b7d1c to d494250 Compare December 20, 2024 15:53
@dbajpeyi dbajpeyi force-pushed the dbajpeyi/feature/password-hints branch from d494250 to cb29187 Compare December 20, 2024 15:55
Copy link
Member

@alistairjcbrown alistairjcbrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just playing with these changes in the unit tests — added a comment on the regex as I don’t think it’s doing what we want it to

@@ -229,6 +229,15 @@ class FormAnalyzer {
});
}

evaluatePasswordHints() {
if (this.form.textContent) {
const hasPasswordHints = safeRegexTest(this.matching.getDDGMatcherRegex('passwordHintsRegex'), this.form.textContent, 200);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 200 characters enough, given this.form.textContent will also contain any whitespace?

Example: appleid_change-password.html — contains a set of "Password Requirements"
The text content of this is 1816 characters, the majority of which is whitespace

Copy link
Collaborator Author

@dbajpeyi dbajpeyi Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably normalize the text content. Lemme give that a shot, the 200 limit was strictly for sleeper.com, but we will get probably way more wins with normalized text and extending the limit a little bit.

Copy link
Collaborator Author

@dbajpeyi dbajpeyi Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With some string replacement in textContent, and increased limit to 500 which seems reasonable, I was able to make it work. aa812e8

PS: this change improved couple of our test forms already 😃 Which is very cool!

@dbajpeyi dbajpeyi force-pushed the dbajpeyi/feature/password-hints branch from 441e383 to 9a396d9 Compare January 8, 2025 18:43
@dbajpeyi dbajpeyi force-pushed the dbajpeyi/feature/password-hints branch from 9a396d9 to 5272584 Compare January 9, 2025 12:30
@dbajpeyi
Copy link
Collaborator Author

dbajpeyi commented Jan 9, 2025

@alistairjcbrown Some tests I did for the regexes. I captured the textContent from the code, and tested it against the regex on https://regex101.com:

mylo_signup.html

Screenshot 2025-01-09 at 13 33 12

appleid_change-password.html

Screenshot 2025-01-09 at 13 34 22

sleeper.com

Screenshot 2025-01-09 at 13 35 03

@dbajpeyi dbajpeyi force-pushed the dbajpeyi/feature/password-hints branch 2 times, most recently from 8305138 to ed37372 Compare January 10, 2025 11:14
Copy link
Member

@alistairjcbrown alistairjcbrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a couple of questions, and a thought on the regex based on the matcher screenshots you’d added (thanks!)

@@ -230,6 +230,16 @@ class FormAnalyzer {
});
}

evaluatePasswordHints() {
const textContent = this.form.textContent?.replace(/\s+/g, ' ');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any risk here that this causes a performance issue when text content is very large?

Copy link
Collaborator Author

@dbajpeyi dbajpeyi Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially! Actually using this existing utility:

const removeExcessWhitespace = (string = '') => {

I've added a 1000 cutoff limit, had to modify the utility a bit for custom cutoff limit.

66bfaad#diff-29e05a441db72f7ca04df5e51bdccfcdebc988497045a25fd89fc587247c4ed2R836

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbajpeyi Do we have a sense if 1000 is a solid number to rely on? (too big, too small, just right?)

In #720 (comment) it looks like the Apple form has 1800+ characters (mostly whitespace), so 1000 might be too conservative given it won't be run on that form. However, this question comes out of not having a good sense of where that balance lies between performance and usefulness -- do we have data we can lean on? (e.g. we've a bunch of forms in this repo -- what character size are they generally? Or do we have a rule of thumb on the largest size of string we want to run a regex on based on how often this code executes?)

Copy link
Collaborator Author

@dbajpeyi dbajpeyi Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removeExcessWhitespaces call from the test run:

With 1000 limit, it is being called ~196/555 (35%) of the forms.
With 1800 limit, it is is being called ~233/555 (41%) of our forms in the suite, this is only just a bit more.

I a tiny bit difference in test suite run time (~10.5 -> 11.4s). I won't say it's that bad though.

The hints are not scored with 1000 limit for the apple form but that one is scored fairly strongly already (score: 9), and the other forms have still improved with 1000 (1 new and 2 existing forms). This for me seems like a good enough improvement, and I rather want to keep on the lower side. If we see more examples, we can increase the threshold to accommodate more. But I think 1000 is a good threshold for now, as it covers most forms that can be improved, without much overhead.

We generally cut texts to TEXT_LENGTH_CUTOFF currently which is set to 100. But in this case, I think we want to go longer. I don't see a very specific reason for the 100 number, I think it was arbitrarily set.

Copy link
Collaborator Author

@dbajpeyi dbajpeyi Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave it another thought, and some more test run and I noticed a tiny bit slower tests vs main. I am sticking to 750 now - which still works for the 3 forms I mentioned, 642d018.

@@ -308,7 +308,7 @@
{ "html": "weblogin_utoronto_ca_login.html", "generated": true, "title": "weblogin | University of Toronto", "comment": "rank: 1280" },
{ "html": "customerportal_mastercard_com_login.html", "generated": true, "title": "Sign in - Mastercard", "comment": "rank: 1362" },
{ "html": "www_khanacademy_org_login.html", "generated": true, "title": "Khan Academy", "comment": "rank: 1379" },
{ "html": "accounts_hindustantimes_com_login.html", "generated": true, "title": "Livemint Subscription Plan and Pricing", "comment": "rank: 1467" },
{ "html": "accounts_hindustantimes_com_login.html", "generated": true, "title": "Livemint Subscription Plan and Pricing", "comment": "rank: 1467", "expectedFailures": ["emailAddress", "emailAddress"] },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ Do we have context on why this regressed?

Copy link
Collaborator Author

@dbajpeyi dbajpeyi Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I got confused a bit, because an extra emailAddress failure was triggered, when I merged 9b719c5, although it didn't break any other PRs on rebase.

But I get it now, two parts to it:

  1. First one seems to have happened (already when this PR was created), because change where evaluateElAttributes is called
  2. Second one seems to be happening, because now we merged 9b719c5, so probably it caused some additional regression. Basically the contact form used to be identities.emailAdress but now is username:
Screenshot 2025-01-10 at 14 56 47

This attribute gets skipped from the score:
Screenshot 2025-01-10 at 14 59 34

Let me see if we can compensate for these changes in evaluateElAttribute.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sadly it won't be super simple to fix, as the conservative regex doesn't have "contact" in it. I mean we can update that, but beats the purpose of keeping it conservative.

I think we can just move on here at the moment, because the other attributes here - inpLogin, and the login URL make it even harder to score it correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First one happened (already when this PR was created), because change where evaluateElAttributes is called

Should we revert that change? Or does it add something more benfitial than the regression it causes?

Second one is happening, because now we merged 9b719c5

I'm not sure I'm following -- 9b719c5 is in main and we don't have this regression there, right? How does that change impact this branch?

Copy link
Collaborator Author

@dbajpeyi dbajpeyi Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we revert that change? Or does it add something more benfitial than the regression it causes?

Yes, I've reverted it now 729735b. I think the impact of it is not clear, and I can explore evaluating attributes of ALL inputs in a different PR at a different time.

I'm not sure I'm following -- 9b719c5 is in main and we don't have this regression there, right? How does that change impact this branch?

Yes, but that change affects this PR only, because of the fact that in this PR we started evaluating attributes for ALL inputs, not just the first one. Anyway as mentioned, I've reverted it now since the impact of evaluating attributes of all inputs is not clear. I think there was a form that it made a bit better, but still not worth it with the regression.

@dbajpeyi dbajpeyi force-pushed the dbajpeyi/feature/password-hints branch from 787fafd to 3e9b49a Compare January 10, 2025 13:07
@alistairjcbrown
Copy link
Member

Good sync up call — we flagged up that there’s an issue with the test setup. We’re using innerHTML to inject the form into the page, but inner HTML retains the whitespace when textContent is called (whereas in the wild, the whitespace is removed). What this means is that the character limit we’re setting will filter most of the forms out in test, but not in production 😱

Let’s see if we can update how our tests work to better align with a site in the wild — I’d love it if we could figure out a way to 1-1 align with what we’re seeing in production, but if not, a quick a dirty way may be to just remove all whitespace at the start of the line + new lines; it won’t be 100% but may be close enough.

@dbajpeyi
Copy link
Collaborator Author

dbajpeyi commented Jan 17, 2025

@alistairjcbrown I tried using the https://github.com/kangax/html-minifier but quickly gave up as it failed parsing some of the html, and didn't wanna dig in more.

I've added a simpler approach that replaces newlines and spaces between tags. 3760990. I checked our forms that are effected, and the we do get fairly close to reality.

Also, this way we are down to 200 and covering apple form too. Lemme know what you think. I think getting to a real world 1-1 is hard, but we're closer.

*/
const cleanFormHTML = (html) => {
// Collapse both whitespace and newlines between tags, preserving content within tags
return html.replace(/>\s*[\r\n]+\s*</g, '><');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d be worried about removing all whitespace between tags which might miss some new lines but also collapse words together we don’t expect.

Using the Apple password work as an example — document.querySelector('form.modal-form').textContent

https://account.apple.com/account/manage (copy is slightly different from our stored form)

PasswordLast updated 12 November 2020Change your password current passwordStrength:0%Password Requirements:at least 8 charactersat least 1 numberat least 1 uppercase letterat least 1 lowercase letterAt least one special characterAvoid using a password that you use with other websites or which might be easy for someone else to guess. new password confirm new passwordSign out of Apple devices and websites associated with your Apple Account. Learn more

Using the above

PasswordLast updated 11 March 2023Change your passwordcurrent
                    passwordStrength:0%.Password Requirements:at least 8 charactersat least 1 numberat least 1 uppercase letterat least 1 lowercase letterAvoid using a password that you use with other websites or which might be easy for someone else to guess.new
                            passwordconfirm
                    new passwordSign out of Apple devices and websites associated with your Apple ID. Learn more

From a quick stab at this, I think we can get pretty good results — remove the leading whitespace, remove new lines between elements, and then replace all remaining new lines with a space. It’s not 100%, any other enhancements we could add?

PasswordLast updated 11 March 2023Change your passwordcurrent passwordStrength:0%.Password Requirements:at least 8 charactersat least 1 numberat least 1 uppercase letterat least 1 lowercase letterAvoid using a password that you use with other websites or which might be easy for someone else to guess.new passwordconfirm new passwordSign out of Apple devices and websites associated with your Apple ID. Learn more
Suggested change
return html.replace(/>\s*[\r\n]+\s*</g, '><');
return html.
.replace(/^\s*/mg, "") // Remove leading whitespace on each line
.replace(/>[\r\n]+</mg, "><") // Remove new lines which exist between elements
.replace(/[\r\n]+/mg, " "); // Replace any existing new lines with a space

@GioSensation
Copy link
Member

Can we try to see if using XMLSerializer solves the whitespace problem? Or maybe DOMPaser. Haven't tried them myself, but they might do the job for us.

Otherwise you can copy here the whitespace removal method we have in removeExcessWhitespace with the difference that you don't need the safeguard for size because this only runs in the test suite. I'd prefer if we could use some official API like the ones mentioned above, though, to avoid changing the contents of the forms too much.

@dbajpeyi
Copy link
Collaborator Author

dbajpeyi commented Jan 22, 2025

Neither XMLSerializer's (.serializeToString) or JSDom/DOMParser (.serialize) seem do what we want want after a quick dab on it. I am moving with @alistairjcbrown's suggestion (included in d73a9c8), I think that gets us closest to what we want.

Also discussed with @alistairjcbrown that removeExcessWhitespace is rather meant to be a general utility for all texts (based on business logic, rather than test logic), so having something more specific for HTML is what we need here.

Copy link
Member

@alistairjcbrown alistairjcbrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :shipit:

@dbajpeyi dbajpeyi merged commit 0587536 into main Jan 22, 2025
1 check passed
@dbajpeyi dbajpeyi deleted the dbajpeyi/feature/password-hints branch January 22, 2025 16:45
graeme added a commit to duckduckgo/BrowserServicesKit that referenced this pull request Jan 31, 2025
Task/Issue URL:
https://app.asana.com/0/1209268905749226/1209268905749226
Autofill Release:
https://github.com/duckduckgo/duckduckgo-autofill/releases/tag/16.2.0


## Description
Updates Autofill to version
[16.2.0](https://github.com/duckduckgo/duckduckgo-autofill/releases/tag/16.2.0).

### Autofill 16.2.0 release notes
## What's Changed
* [Matching] Update regex to match for 'confirm password' by @dbajpeyi
in duckduckgo/duckduckgo-autofill#719
* Update password-related json files (2025-01-06) by @daxmobile in
duckduckgo/duckduckgo-autofill#730
* [FormAnalyzer] Use conservative regex for attributes by @dbajpeyi in
duckduckgo/duckduckgo-autofill#735
* [FormAnalyzer] Don't flip scores on buttons with specific text by
@dbajpeyi in duckduckgo/duckduckgo-autofill#725
* [FormAnalyzer] add external link check for submit buttons by @dbajpeyi
in duckduckgo/duckduckgo-autofill#736
* [FormAnalyzer] scoring password hints by @dbajpeyi in
duckduckgo/duckduckgo-autofill#720
* [Matching] only allow matching on password types by @dbajpeyi in
duckduckgo/duckduckgo-autofill#740
* [Autofill utils] Form wrapper failsafe documentation by @dbajpeyi in
duckduckgo/duckduckgo-autofill#741
* [Form] phone and credit card for recategorization by @dbajpeyi in
duckduckgo/duckduckgo-autofill#733


**Full Changelog**:
duckduckgo/duckduckgo-autofill@16.1.0...16.2.0

## Steps to test
This release has been tested during autofill development. For smoke test
steps see [this
task](https://app.asana.com/0/1198964220583541/1200583647142330/f).

Co-authored-by: dbajpeyi <3018923+dbajpeyi@users.noreply.github.com>
Co-authored-by: Graeme Arthur <2030310+graeme@users.noreply.github.com>
CDRussell pushed a commit to duckduckgo/Android that referenced this pull request Feb 3, 2025
Task/Issue URL:
https://app.asana.com/0/1209268903907280/1209268903907280
Autofill Release:
https://github.com/duckduckgo/duckduckgo-autofill/releases/tag/16.2.0


## Description
Updates Autofill to version
[16.2.0](https://github.com/duckduckgo/duckduckgo-autofill/releases/tag/16.2.0).

### Autofill 16.2.0 release notes
## What's Changed
* [Matching] Update regex to match for 'confirm password' by @dbajpeyi
in duckduckgo/duckduckgo-autofill#719
* Update password-related json files (2025-01-06) by @daxmobile in
duckduckgo/duckduckgo-autofill#730
* [FormAnalyzer] Use conservative regex for attributes by @dbajpeyi in
duckduckgo/duckduckgo-autofill#735
* [FormAnalyzer] Don't flip scores on buttons with specific text by
@dbajpeyi in duckduckgo/duckduckgo-autofill#725
* [FormAnalyzer] add external link check for submit buttons by @dbajpeyi
in duckduckgo/duckduckgo-autofill#736
* [FormAnalyzer] scoring password hints by @dbajpeyi in
duckduckgo/duckduckgo-autofill#720
* [Matching] only allow matching on password types by @dbajpeyi in
duckduckgo/duckduckgo-autofill#740
* [Autofill utils] Form wrapper failsafe documentation by @dbajpeyi in
duckduckgo/duckduckgo-autofill#741
* [Form] phone and credit card for recategorization by @dbajpeyi in
duckduckgo/duckduckgo-autofill#733


**Full Changelog**:
duckduckgo/duckduckgo-autofill@16.1.0...16.2.0

## Steps to test
This release has been tested during autofill development. For smoke test
steps see [this
task](https://app.asana.com/0/1198964220583541/1200583647142330/f).

Co-authored-by: dbajpeyi <3018923+dbajpeyi@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants