-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FormAnalyzer] scoring password hints #720
Conversation
201c856
to
0640b6f
Compare
5efa4cd
to
71d51da
Compare
71d51da
to
a110052
Compare
a110052
to
7402c0d
Compare
83b7d1c
to
d494250
Compare
d494250
to
cb29187
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just playing with these changes in the unit tests — added a comment on the regex as I don’t think it’s doing what we want it to
src/Form/FormAnalyzer.js
Outdated
@@ -229,6 +229,15 @@ class FormAnalyzer { | |||
}); | |||
} | |||
|
|||
evaluatePasswordHints() { | |||
if (this.form.textContent) { | |||
const hasPasswordHints = safeRegexTest(this.matching.getDDGMatcherRegex('passwordHintsRegex'), this.form.textContent, 200); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 200 characters enough, given this.form.textContent
will also contain any whitespace?
Example: appleid_change-password.html
— contains a set of "Password Requirements"
The text content of this is 1816 characters, the majority of which is whitespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably normalize the text content. Lemme give that a shot, the 200 limit was strictly for sleeper.com, but we will get probably way more wins with normalized text and extending the limit a little bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With some string replacement in textContent
, and increased limit to 500 which seems reasonable, I was able to make it work. aa812e8
PS: this change improved couple of our test forms already 😃 Which is very cool!
441e383
to
9a396d9
Compare
9a396d9
to
5272584
Compare
@alistairjcbrown Some tests I did for the regexes. I captured the mylo_signup.html![]() appleid_change-password.html![]() sleeper.com![]() |
8305138
to
ed37372
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a couple of questions, and a thought on the regex based on the matcher screenshots you’d added (thanks!)
src/Form/FormAnalyzer.js
Outdated
@@ -230,6 +230,16 @@ class FormAnalyzer { | |||
}); | |||
} | |||
|
|||
evaluatePasswordHints() { | |||
const textContent = this.form.textContent?.replace(/\s+/g, ' '); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any risk here that this causes a performance issue when text content is very large?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially! Actually using this existing utility:
duckduckgo-autofill/src/Form/matching.js
Line 836 in 85b0b7b
const removeExcessWhitespace = (string = '') => { |
I've added a 1000 cutoff limit, had to modify the utility a bit for custom cutoff limit.
66bfaad#diff-29e05a441db72f7ca04df5e51bdccfcdebc988497045a25fd89fc587247c4ed2R836
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbajpeyi Do we have a sense if 1000 is a solid number to rely on? (too big, too small, just right?)
In #720 (comment) it looks like the Apple form has 1800+ characters (mostly whitespace), so 1000 might be too conservative given it won't be run on that form. However, this question comes out of not having a good sense of where that balance lies between performance and usefulness -- do we have data we can lean on? (e.g. we've a bunch of forms in this repo -- what character size are they generally? Or do we have a rule of thumb on the largest size of string we want to run a regex on based on how often this code executes?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removeExcessWhitespaces
call from the test run:
With 1000 limit, it is being called ~196/555 (35%) of the forms.
With 1800 limit, it is is being called ~233/555 (41%) of our forms in the suite, this is only just a bit more.
I a tiny bit difference in test suite run time (~10.5 -> 11.4s). I won't say it's that bad though.
The hints are not scored with 1000 limit for the apple form but that one is scored fairly strongly already (score: 9), and the other forms have still improved with 1000 (1 new and 2 existing forms). This for me seems like a good enough improvement, and I rather want to keep on the lower side. If we see more examples, we can increase the threshold to accommodate more. But I think 1000 is a good threshold for now, as it covers most forms that can be improved, without much overhead.
We generally cut texts to TEXT_LENGTH_CUTOFF
currently which is set to 100. But in this case, I think we want to go longer. I don't see a very specific reason for the 100 number, I think it was arbitrarily set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave it another thought, and some more test run and I noticed a tiny bit slower tests vs main
. I am sticking to 750
now - which still works for the 3 forms I mentioned, 642d018.
test-forms/index.json
Outdated
@@ -308,7 +308,7 @@ | |||
{ "html": "weblogin_utoronto_ca_login.html", "generated": true, "title": "weblogin | University of Toronto", "comment": "rank: 1280" }, | |||
{ "html": "customerportal_mastercard_com_login.html", "generated": true, "title": "Sign in - Mastercard", "comment": "rank: 1362" }, | |||
{ "html": "www_khanacademy_org_login.html", "generated": true, "title": "Khan Academy", "comment": "rank: 1379" }, | |||
{ "html": "accounts_hindustantimes_com_login.html", "generated": true, "title": "Livemint Subscription Plan and Pricing", "comment": "rank: 1467" }, | |||
{ "html": "accounts_hindustantimes_com_login.html", "generated": true, "title": "Livemint Subscription Plan and Pricing", "comment": "rank: 1467", "expectedFailures": ["emailAddress", "emailAddress"] }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ Do we have context on why this regressed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I got confused a bit, because an extra emailAddress
failure was triggered, when I merged 9b719c5, although it didn't break any other PRs on rebase.
But I get it now, two parts to it:
- First one seems to have happened (already when this PR was created), because change where evaluateElAttributes is called
- Second one seems to be happening, because now we merged 9b719c5, so probably it caused some additional regression. Basically the contact form used to be
identities.emailAdress
but now is username:
data:image/s3,"s3://crabby-images/acb87/acb871d83df07469aad7629adcdf979881a39d87" alt="Screenshot 2025-01-10 at 14 56 47"
This attribute gets skipped from the score:
Let me see if we can compensate for these changes in evaluateElAttribute
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sadly it won't be super simple to fix, as the conservative regex doesn't have "contact" in it. I mean we can update that, but beats the purpose of keeping it conservative.
I think we can just move on here at the moment, because the other attributes here - inpLogin
, and the login URL make it even harder to score it correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First one happened (already when this PR was created), because change where
evaluateElAttributes
is called
Should we revert that change? Or does it add something more benfitial than the regression it causes?
Second one is happening, because now we merged
9b719c5
I'm not sure I'm following -- 9b719c5
is in main
and we don't have this regression there, right? How does that change impact this branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we revert that change? Or does it add something more benfitial than the regression it causes?
Yes, I've reverted it now 729735b. I think the impact of it is not clear, and I can explore evaluating attributes of ALL inputs in a different PR at a different time.
I'm not sure I'm following -- 9b719c5 is in main and we don't have this regression there, right? How does that change impact this branch?
Yes, but that change affects this PR only, because of the fact that in this PR we started evaluating attributes for ALL inputs, not just the first one. Anyway as mentioned, I've reverted it now since the impact of evaluating attributes of all inputs is not clear. I think there was a form that it made a bit better, but still not worth it with the regression.
787fafd
to
3e9b49a
Compare
Good sync up call — we flagged up that there’s an issue with the test setup. We’re using Let’s see if we can update how our tests work to better align with a site in the wild — I’d love it if we could figure out a way to 1-1 align with what we’re seeing in production, but if not, a quick a dirty way may be to just remove all whitespace at the start of the line + new lines; it won’t be 100% but may be close enough. |
@alistairjcbrown I tried using the https://github.com/kangax/html-minifier but quickly gave up as it failed parsing some of the html, and didn't wanna dig in more. I've added a simpler approach that replaces newlines and spaces between tags. 3760990. I checked our forms that are effected, and the we do get fairly close to reality. Also, this way we are down to 200 and covering apple form too. Lemme know what you think. I think getting to a real world 1-1 is hard, but we're closer. |
src/Form/input-classifiers.test.js
Outdated
*/ | ||
const cleanFormHTML = (html) => { | ||
// Collapse both whitespace and newlines between tags, preserving content within tags | ||
return html.replace(/>\s*[\r\n]+\s*</g, '><'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d be worried about removing all whitespace between tags which might miss some new lines but also collapse words together we don’t expect.
Using the Apple password work as an example — document.querySelector('form.modal-form').textContent
https://account.apple.com/account/manage
(copy is slightly different from our stored form)
PasswordLast updated 12 November 2020Change your password current passwordStrength:0%Password Requirements:at least 8 charactersat least 1 numberat least 1 uppercase letterat least 1 lowercase letterAt least one special characterAvoid using a password that you use with other websites or which might be easy for someone else to guess. new password confirm new passwordSign out of Apple devices and websites associated with your Apple Account. Learn more
Using the above
PasswordLast updated 11 March 2023Change your passwordcurrent
passwordStrength:0%.Password Requirements:at least 8 charactersat least 1 numberat least 1 uppercase letterat least 1 lowercase letterAvoid using a password that you use with other websites or which might be easy for someone else to guess.new
passwordconfirm
new passwordSign out of Apple devices and websites associated with your Apple ID. Learn more
From a quick stab at this, I think we can get pretty good results — remove the leading whitespace, remove new lines between elements, and then replace all remaining new lines with a space. It’s not 100%, any other enhancements we could add?
PasswordLast updated 11 March 2023Change your passwordcurrent passwordStrength:0%.Password Requirements:at least 8 charactersat least 1 numberat least 1 uppercase letterat least 1 lowercase letterAvoid using a password that you use with other websites or which might be easy for someone else to guess.new passwordconfirm new passwordSign out of Apple devices and websites associated with your Apple ID. Learn more
return html.replace(/>\s*[\r\n]+\s*</g, '><'); | |
return html. | |
.replace(/^\s*/mg, "") // Remove leading whitespace on each line | |
.replace(/>[\r\n]+</mg, "><") // Remove new lines which exist between elements | |
.replace(/[\r\n]+/mg, " "); // Replace any existing new lines with a space |
Can we try to see if using Otherwise you can copy here the whitespace removal method we have in |
Neither Also discussed with @alistairjcbrown that removeExcessWhitespace is rather meant to be a general utility for all texts (based on business logic, rather than test logic), so having something more specific for HTML is what we need here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Task/Issue URL: https://app.asana.com/0/1209268905749226/1209268905749226 Autofill Release: https://github.com/duckduckgo/duckduckgo-autofill/releases/tag/16.2.0 ## Description Updates Autofill to version [16.2.0](https://github.com/duckduckgo/duckduckgo-autofill/releases/tag/16.2.0). ### Autofill 16.2.0 release notes ## What's Changed * [Matching] Update regex to match for 'confirm password' by @dbajpeyi in duckduckgo/duckduckgo-autofill#719 * Update password-related json files (2025-01-06) by @daxmobile in duckduckgo/duckduckgo-autofill#730 * [FormAnalyzer] Use conservative regex for attributes by @dbajpeyi in duckduckgo/duckduckgo-autofill#735 * [FormAnalyzer] Don't flip scores on buttons with specific text by @dbajpeyi in duckduckgo/duckduckgo-autofill#725 * [FormAnalyzer] add external link check for submit buttons by @dbajpeyi in duckduckgo/duckduckgo-autofill#736 * [FormAnalyzer] scoring password hints by @dbajpeyi in duckduckgo/duckduckgo-autofill#720 * [Matching] only allow matching on password types by @dbajpeyi in duckduckgo/duckduckgo-autofill#740 * [Autofill utils] Form wrapper failsafe documentation by @dbajpeyi in duckduckgo/duckduckgo-autofill#741 * [Form] phone and credit card for recategorization by @dbajpeyi in duckduckgo/duckduckgo-autofill#733 **Full Changelog**: duckduckgo/duckduckgo-autofill@16.1.0...16.2.0 ## Steps to test This release has been tested during autofill development. For smoke test steps see [this task](https://app.asana.com/0/1198964220583541/1200583647142330/f). Co-authored-by: dbajpeyi <3018923+dbajpeyi@users.noreply.github.com> Co-authored-by: Graeme Arthur <2030310+graeme@users.noreply.github.com>
Task/Issue URL: https://app.asana.com/0/1209268903907280/1209268903907280 Autofill Release: https://github.com/duckduckgo/duckduckgo-autofill/releases/tag/16.2.0 ## Description Updates Autofill to version [16.2.0](https://github.com/duckduckgo/duckduckgo-autofill/releases/tag/16.2.0). ### Autofill 16.2.0 release notes ## What's Changed * [Matching] Update regex to match for 'confirm password' by @dbajpeyi in duckduckgo/duckduckgo-autofill#719 * Update password-related json files (2025-01-06) by @daxmobile in duckduckgo/duckduckgo-autofill#730 * [FormAnalyzer] Use conservative regex for attributes by @dbajpeyi in duckduckgo/duckduckgo-autofill#735 * [FormAnalyzer] Don't flip scores on buttons with specific text by @dbajpeyi in duckduckgo/duckduckgo-autofill#725 * [FormAnalyzer] add external link check for submit buttons by @dbajpeyi in duckduckgo/duckduckgo-autofill#736 * [FormAnalyzer] scoring password hints by @dbajpeyi in duckduckgo/duckduckgo-autofill#720 * [Matching] only allow matching on password types by @dbajpeyi in duckduckgo/duckduckgo-autofill#740 * [Autofill utils] Form wrapper failsafe documentation by @dbajpeyi in duckduckgo/duckduckgo-autofill#741 * [Form] phone and credit card for recategorization by @dbajpeyi in duckduckgo/duckduckgo-autofill#733 **Full Changelog**: duckduckgo/duckduckgo-autofill@16.1.0...16.2.0 ## Steps to test This release has been tested during autofill development. For smoke test steps see [this task](https://app.asana.com/0/1198964220583541/1200583647142330/f). Co-authored-by: dbajpeyi <3018923+dbajpeyi@users.noreply.github.com>
Reviewer: @alistairjcbrown
Asana: https://app.asana.com/0/1203822806345703/1208965540697669/f
Description
evaluateForm
instead of in the constructor. This way all inputs get evaluated for.Steps to test
https://sleeper.com/create?type=league is the site that was originally broken.
data:image/s3,"s3://crabby-images/e4956/e49569f650ededa450d33ae38f7c78352a4ca2e2" alt="image"
Score before:
Score after:
data:image/s3,"s3://crabby-images/a9ca0/a9ca00cc45ad27904edf7e10cfdb148a94803508" alt="Screenshot 2024-12-20 at 16 17 27"