Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-evaluate CleanLinks default whitelisted domains and patterns #20

Closed
jawz101 opened this issue Jun 6, 2018 · 13 comments
Closed

Re-evaluate CleanLinks default whitelisted domains and patterns #20

jawz101 opened this issue Jun 6, 2018 · 13 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@jawz101
Copy link

jawz101 commented Jun 6, 2018

Skip Links Matching with
\/ServiceLogin|imgres\?|searchbyimage\?|watch%3Fv|auth\?client_id|signup|bing\.com\/widget|oauth|openid\.ns|\.mcstatic\.com|sVidLoc|[Ll]ogout|submit\?url=|magnet:|google\.com\/recaptcha\/


Remove from Links
(?:ref|aff)\\w*|utm_\\w+|(?:merchant|programme|media)ID

I'm not getting this regex to even work

I also like ClearURLs implementation of a mechanism to update the ruleset from github. And reviewing theirs we're clearly missing a ton.


Skip Domains
accounts.google.com,docs.google.com,translate.google.com,login.live.com,plus.google.com,twitter.com,static.ak.facebook.com,www.linkedin.com,www.virustotal.com,account.live.com,admin.brightcove.com,www.mywot.com,webcache.googleusercontent.com,web.archive.org,accounts.youtube.com,signin.ebay.com

Looking at Diego's commits these rules are all at least 4 years old. Sites change. I'd rather start with a clean slate and see what is applicable today.

@Cimbali
Copy link
Owner

Cimbali commented Jun 6, 2018

Most login domains seem to still work the same (at least google, ebay, maybe more), the other ones we should re-evaluate them indeed. Let us contribute to this thread all domains/regexes that are still valid or not valid anymore.

Rulesets sound good in theory, but I'm not sure I can commit to updating the list regularly. Plus you still have to allow all suers to whitelist to their own taste. So let's just adjust defaults for now, I'll think about it.

About removing tracking elements from links, this is not really the core functionality of CleanLinks, we rather remove redirections. This does cause some problems, e.g. if we whitelist a website we also stop cleaning the tracking elements in the link, even though we might just want to see a redirect page without being tracked. Anyhow, if this does not work please open another issue about it.

@Cimbali Cimbali changed the title readdress the Link Cleaning defaults Re-evaluate CleanLinks default whitelisted domains and patterns Jun 6, 2018
@Cimbali Cimbali added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Jun 6, 2018
@alekksander
Copy link

alekksander commented Jun 8, 2018

but in practice – isn't tracking id usually removed as the link is prevented from redirection?
i have to admit i might not fully understand what clearurls does, but if i do, it is rather complementing with cleanlinks (as well as other privacy concerned addons as ublock, umatrix etc), rather than duplicating it's functionality. is it so?

@Cimbali
Copy link
Owner

Cimbali commented Jun 8, 2018

We remove tracking ids of every link we examine, whether we remove a redirection or not. However when a link is whitelisted, by pattern or domain, we leave it as is, including tracking ids.

ClearURLs seems to do a more in-depth job at removing tracking ids than we do, as they have a bunch of (domain-specific) rules and we just have a regex matching parameters. So the addons are in essence complementary even if there is some slight overlaps: they block some domains like uMatrix, we strip some ids like they do, etc.

@jawz101
Copy link
Author

jawz101 commented Jun 8, 2018

That's what gets me. I wish I could simply rely on one extension to handle URL stuff. I didn't even mention the slew of other complementary or duplicitous extensions all trying to do similar things. It's to the point where I really don't understand the differences amongst them.

@jawz101
Copy link
Author

jawz101 commented Jun 8, 2018

HTTPS Everywhere
Google Search Link Fix
Skip Redirect
Google Redirects Fixer & Tracking Remover
Redirector
Link Cleaner
Don't Track Me Google
Neat URL
Pure URL
Tracking Token Stripper
Remove Redirect
Link Redirect Trace
Open Link Directly (No Redirect)
ClearURLs
Untrack Me
Consistent HTTPS
Universal Bypass
Canonical
Referer Control
Referer Modifier

...and a bunch of extensions to convert text links to clickable links. I'm tired of listing them.

and then a bunch of about:config preferences:
browser.urlbar.decodeURLsOnCopy
network.IDN_show_punycode
network.http.referer.XOriginPolicy
network.http.referer.XOriginTrimmingPolicy
network.http.referer.defaultPolicy
network.http.referer.defaultPolicy.pbmode
network.http.referer.hideOnionSource
network.http.referer.spoofSource
network.http.referer.trimmingPolicy
network.http.sendRefererHeader
security.fileuri.strict_origin_policy

I just wish I had the least amount of extensions to do at least as much that makes sense.

@jawz101
Copy link
Author

jawz101 commented Jun 8, 2018

I'm not putting this on you, @Cimbali . It just seems like it's confusing that so many people have tried to achieve something with handling links and URLs. It would be nice if so many ideas were put into one solution.

For example, I can tell if a referrer is the same as stripping utm parameters off of a url. I just don't understand the differences amongst all of these things.

It's like, if there was one extension that acted as the engine to do a bunch of these things and then the community to figure out how to build the rule set. I guess I hoped to have this talk with @diegocr one day.

@Cimbali
Copy link
Owner

Cimbali commented Jun 8, 2018

Tbh I just resurrected an add-on I used to use to remove redirects, handling the rest with https everywhere & uMatrix.

I mean you do have a point but there are 2 ways around it:

  • either you get a huge extension that does everything, but that's hard to maintain,
  • or every extension does one thing and does it well in which case they're complementary.

I think there's not a lot of extensions that remove redirects as we do, so we seem to be gearing more towards the second option here, even though you're right that it's not a great way to build a strong community that can contribute rules etc.

@Cimbali
Copy link
Owner

Cimbali commented Jun 9, 2018

Here's a quick oversight of what the add-ons you cite do:

addon removes tracking parameters removes redirects type of rules notes
Google Search Link Fix partial Site-specific: google, yandex Use unobfuscated URL present before click
Google Redirects Fixer & Tracking Remover yes partial Site-specific: google on request only
Link Cleaner partial partial Site-specific: facebook, steam and reddit for redirects, item pages of aliexpress and amazon, utm_* parameters
Don't Track Me Google partial Site-specific: google
Neat URL yes
LeanURL partial Parameter-specific: utm_*
Pure URL partial Parameter-specific: utm_* and Site-specific: facebook, yandex
Tracking Token Stripper partial Parameter-specific: utm_*
Skip Redirect yes rewriting URLs in page
Remove Redirect yes intercepting requests
Open Link Directly (No Redirect) partial Site-specific: google, yahoo
ClearURLs yes Gitlab-hosted rules file
Untrack Me yes not open source ?
CleanLinks partial yes Parameter-specific: utm_*
Referer Control modifies HTTP referer headers
Referer Modifier modifies HTTP referer headers
HTTPS Everywhere online rules upgrades http requests to https
Consistent HTTPS Stops downgrading https to http on the same domain
Canonical manually redirects to canonical link in page (if exists: link[rel="canonical"])
Universal Bypass hardcoded automatically redirect from URL shorteners
Link Redirect Trace List/trace redirects
Redirector User-defined redirects

I think Neat URL, Lean URL and Pure URL are forks of each other and mostly share the same codebase. They most probably identify target URLs in the same way we do.

@jawz101
Copy link
Author

jawz101 commented Jun 9, 2018

Thank you so much for that breakdown. You threw cleanlinks into the comparison but then you didn't put that it removes tracking parameters?(ex: utm_) I thought it does- well, at least there's a rule for it, right?

If so, I wonder if I can go through some of those similar add-ons and try figuring out a more thorough regex pattern.

@Cimbali
Copy link
Owner

Cimbali commented Jun 9, 2018

I just misaligned some columns. Fixed now :)

I'm wondering how many sites really have a <link rel="canonical"> tag, because that's definitely something we can leverage to auto-detect useless parameters.

Any differences between the URL and the canonical link can be recorded, and stripped on the next visits.

@jawz101
Copy link
Author

jawz101 commented Jun 9, 2018

I know Amazon uses them. I've tried that add-on in the past and it was changing links. I actually use smile.amazon.com so I can donate to the Electronic Frontier Foundation but when I clicked on product pages it was redirecting me to its regular amazon.com url.

@Cimbali
Copy link
Owner

Cimbali commented Jun 12, 2018

In 5d71d2a I've separated the parameter parsing from the redirect cleaning, so we should be able to improve on that from there, e.g. adding per-domain lists of parameters to strip.

@Cimbali
Copy link
Owner

Cimbali commented Mar 12, 2020

As it stands I’ve imported a number of rules in the new defaults ruleset, from our previous default rules and from ClearUrls

The following domains were whitelisted fully and are now no more:

docs.google.com
plus.google.com
twitter.com
static.ak.facebook.com
www.virustotal.com
admin.brightcove.com
www.mywot.com
www.linkedin.com

The other domains in the old skipdom options are handled more precisely by the new rules.
Similarly, these regex patterns are dropped:

paths =
imgres?
watch%3Fv
auth?client_id
signup
bing.com/widget
openid.ns
.mcstatic.com
sVidLoc
[Ll]ogout
submit?url=
^https?://www.amazon.[a-z.]+/.*/voting/cast/

Again, in most cases another rule is probably overriding these, if they are necessary.

Finally, exceptions in ClearUrl apply to the whole rule, whereas we whitelist the individual query parameters. Therefore these following exceptions have not been integrated, and I’m posting them here for reference:

ClearUrls rule TLD domain path parameters
facebook .* facebook /plugins/
facebook .* facebook /dialog/share
facebook .* facebook /groups/member_bio/bio_dialog/
facebook .* facebook /photo.php$
facebook .* facebook /ajax/
facebook .* facebook /privacy/specific_audience_selector_dialog/
facebook .* facebook /photo/download/
global .* facebook /groups/member_bio/bio_dialog/
global .* facebook /login_alerts/
global .* facebook /should_add_browser/
global .* facebook /ajax/
amazon .* amazon ^/gp.*/redirector.html/
amazon .* amazon ^/hz/reviews-render/ajax/
amazon .* amazon ^/gp.*/cart/ajax-update.html/
amazon .* amazon ^/message-us$
google .* mail.google ^/mail/u/0
google .* google ^(/upload)?/drive/
google .* docs.google /
google .* accounts.google
google .* hangouts.google ^/webchat zx
google .* client-channel.google ^/client-channel zx
google .* google ^/complete/search$ gs_[a-za-z]
google .* google ^/s$ gs_[a-za-z]
google .* news.google ^/ hl
google .* google ^/setprefs$ hl
google .* google ^/appsactivity/
google .* google ^/aclk$
google .* drive.google ^/videoplayback
global .* myaccount.google
global .* ([a-za-z0-9-.]*.)?amazon /message-us$
global .ca tangerine
global .cc clastarti
global .cc streamguard
global .com (authorization.)?td ^
global .com bugtracker.fairphone ref[_]?
global .com cloudflare
global .com gcsip ref[_]?
global .com gitlab
global .com login.meijer ^/.*$ ref
global .com sso.serverplan ^/manage2fa/check$ ref
global .com support.steampowered ^
global .de cyberport /adscript.php
global .de kreditkarten-banking.lbb
global .* git..* /.*/-/branches$
global .* git..* /commit/.*/pipelines$
global .io prismic
global .net ([a-za-z0-9-.]*.)?v-player /player.aspx$
global .net battle ^/login
global .net tweakers ^/ext/lt.dsp$ ref[_]?
global .nl privacy.vakmedianet ^ ref
global .org matrix ^/_matrix/
global .* * ^/refs/switch ref[_]?
global .ru tinkoff
bing .* bing ^/ws/redirect/
indeed .com indeed ^/rc/clk
mozilla .org mozilla ^/api

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants