Re-evaluate CleanLinks default whitelisted domains and patterns #20

jawz101 · 2018-06-06T15:02:32Z

I'm not getting this regex to even work

I also like ClearURLs implementation of a mechanism to update the ruleset from github. And reviewing theirs we're clearly missing a ton.

Skip Domains
accounts.google.com,docs.google.com,translate.google.com,login.live.com,plus.google.com,twitter.com,static.ak.facebook.com,www.linkedin.com,www.virustotal.com,account.live.com,admin.brightcove.com,www.mywot.com,webcache.googleusercontent.com,web.archive.org,accounts.youtube.com,signin.ebay.com

Looking at Diego's commits these rules are all at least 4 years old. Sites change. I'd rather start with a clean slate and see what is applicable today.

The text was updated successfully, but these errors were encountered:

Cimbali · 2018-06-06T17:56:46Z

Most login domains seem to still work the same (at least google, ebay, maybe more), the other ones we should re-evaluate them indeed. Let us contribute to this thread all domains/regexes that are still valid or not valid anymore.

Rulesets sound good in theory, but I'm not sure I can commit to updating the list regularly. Plus you still have to allow all suers to whitelist to their own taste. So let's just adjust defaults for now, I'll think about it.

About removing tracking elements from links, this is not really the core functionality of CleanLinks, we rather remove redirections. This does cause some problems, e.g. if we whitelist a website we also stop cleaning the tracking elements in the link, even though we might just want to see a redirect page without being tracked. Anyhow, if this does not work please open another issue about it.

alekksander · 2018-06-08T08:36:40Z

but in practice – isn't tracking id usually removed as the link is prevented from redirection?
i have to admit i might not fully understand what clearurls does, but if i do, it is rather complementing with cleanlinks (as well as other privacy concerned addons as ublock, umatrix etc), rather than duplicating it's functionality. is it so?

Cimbali · 2018-06-08T16:42:58Z

We remove tracking ids of every link we examine, whether we remove a redirection or not. However when a link is whitelisted, by pattern or domain, we leave it as is, including tracking ids.

ClearURLs seems to do a more in-depth job at removing tracking ids than we do, as they have a bunch of (domain-specific) rules and we just have a regex matching parameters. So the addons are in essence complementary even if there is some slight overlaps: they block some domains like uMatrix, we strip some ids like they do, etc.

jawz101 · 2018-06-08T22:06:43Z

That's what gets me. I wish I could simply rely on one extension to handle URL stuff. I didn't even mention the slew of other complementary or duplicitous extensions all trying to do similar things. It's to the point where I really don't understand the differences amongst them.

jawz101 · 2018-06-08T22:27:45Z

HTTPS Everywhere
Google Search Link Fix
Skip Redirect
Google Redirects Fixer & Tracking Remover
Redirector
Link Cleaner
Don't Track Me Google
Neat URL
Pure URL
Tracking Token Stripper
Remove Redirect
Link Redirect Trace
Open Link Directly (No Redirect)
ClearURLs
Untrack Me
Consistent HTTPS
Universal Bypass
Canonical
Referer Control
Referer Modifier

...and a bunch of extensions to convert text links to clickable links. I'm tired of listing them.

and then a bunch of about:config preferences:
browser.urlbar.decodeURLsOnCopy
network.IDN_show_punycode
network.http.referer.XOriginPolicy
network.http.referer.XOriginTrimmingPolicy
network.http.referer.defaultPolicy
network.http.referer.defaultPolicy.pbmode
network.http.referer.hideOnionSource
network.http.referer.spoofSource
network.http.referer.trimmingPolicy
network.http.sendRefererHeader
security.fileuri.strict_origin_policy

I just wish I had the least amount of extensions to do at least as much that makes sense.

jawz101 · 2018-06-08T22:30:11Z

I'm not putting this on you, @Cimbali . It just seems like it's confusing that so many people have tried to achieve something with handling links and URLs. It would be nice if so many ideas were put into one solution.

For example, I can tell if a referrer is the same as stripping utm parameters off of a url. I just don't understand the differences amongst all of these things.

It's like, if there was one extension that acted as the engine to do a bunch of these things and then the community to figure out how to build the rule set. I guess I hoped to have this talk with @diegocr one day.

Cimbali · 2018-06-08T23:34:29Z

Tbh I just resurrected an add-on I used to use to remove redirects, handling the rest with https everywhere & uMatrix.

I mean you do have a point but there are 2 ways around it:

either you get a huge extension that does everything, but that's hard to maintain,
or every extension does one thing and does it well in which case they're complementary.

I think there's not a lot of extensions that remove redirects as we do, so we seem to be gearing more towards the second option here, even though you're right that it's not a great way to build a strong community that can contribute rules etc.

Cimbali · 2018-06-09T00:56:08Z

Here's a quick oversight of what the add-ons you cite do:

addon	removes tracking parameters	removes redirects	type of rules	notes
Google Search Link Fix		partial	Site-specific: google, yandex	Use unobfuscated URL present before click
Google Redirects Fixer & Tracking Remover	yes	partial	Site-specific: google	on request only
Link Cleaner	partial	partial	Site-specific: facebook, steam and reddit for redirects, item pages of aliexpress and amazon, utm_* parameters
Don't Track Me Google	partial		Site-specific: google
Neat URL	yes
LeanURL	partial		Parameter-specific: utm_*
Pure URL	partial		Parameter-specific: utm_* and Site-specific: facebook, yandex
Tracking Token Stripper	partial		Parameter-specific: utm_*
Skip Redirect		yes		rewriting URLs in page
Remove Redirect		yes		intercepting requests
Open Link Directly (No Redirect)		partial	Site-specific: google, yahoo
ClearURLs	yes		Gitlab-hosted rules file
Untrack Me	yes			not open source ?
CleanLinks	partial	yes	Parameter-specific: utm_*
Referer Control				modifies HTTP referer headers
Referer Modifier				modifies HTTP referer headers

HTTPS Everywhere			online rules	upgrades http requests to https
Consistent HTTPS				Stops downgrading https to http on the same domain
Canonical				manually redirects to canonical link in page (if exists: link[rel="canonical"])
Universal Bypass			hardcoded	automatically redirect from URL shorteners
Link Redirect Trace				List/trace redirects
Redirector				User-defined redirects

I think Neat URL, Lean URL and Pure URL are forks of each other and mostly share the same codebase. They most probably identify target URLs in the same way we do.

jawz101 · 2018-06-09T03:37:40Z

Thank you so much for that breakdown. You threw cleanlinks into the comparison but then you didn't put that it removes tracking parameters?(ex: utm_) I thought it does- well, at least there's a rule for it, right?

If so, I wonder if I can go through some of those similar add-ons and try figuring out a more thorough regex pattern.

Cimbali · 2018-06-09T08:06:22Z

I just misaligned some columns. Fixed now :)

I'm wondering how many sites really have a <link rel="canonical"> tag, because that's definitely something we can leverage to auto-detect useless parameters.

Any differences between the URL and the canonical link can be recorded, and stripped on the next visits.

jawz101 · 2018-06-09T14:50:55Z

I know Amazon uses them. I've tried that add-on in the past and it was changing links. I actually use smile.amazon.com so I can donate to the Electronic Frontier Foundation but when I clicked on product pages it was redirecting me to its regular amazon.com url.

Cimbali · 2018-06-12T13:33:05Z

In 5d71d2a I've separated the parameter parsing from the redirect cleaning, so we should be able to improve on that from there, e.g. adding per-domain lists of parameters to strip.

Cimbali · 2020-03-12T18:06:18Z

As it stands I’ve imported a number of rules in the new defaults ruleset, from our previous default rules and from ClearUrls

The following domains were whitelisted fully and are now no more:

docs.google.com
plus.google.com
twitter.com
static.ak.facebook.com
www.virustotal.com
admin.brightcove.com
www.mywot.com
www.linkedin.com

The other domains in the old skipdom options are handled more precisely by the new rules.
Similarly, these regex patterns are dropped:

paths =
imgres?
watch%3Fv
auth?client_id
signup
bing.com/widget
openid.ns
.mcstatic.com
sVidLoc
[Ll]ogout
submit?url=
^https?://www.amazon.[a-z.]+/.*/voting/cast/

Again, in most cases another rule is probably overriding these, if they are necessary.

Finally, exceptions in ClearUrl apply to the whole rule, whereas we whitelist the individual query parameters. Therefore these following exceptions have not been integrated, and I’m posting them here for reference:

ClearUrls rule	TLD	domain	path	parameters
facebook	.*	facebook	/plugins/
facebook	.*	facebook	/dialog/share
facebook	.*	facebook	/groups/member_bio/bio_dialog/
facebook	.*	facebook	/photo.php$
facebook	.*	facebook	/ajax/
facebook	.*	facebook	/privacy/specific_audience_selector_dialog/
facebook	.*	facebook	/photo/download/

global	.*	facebook	/groups/member_bio/bio_dialog/
global	.*	facebook	/login_alerts/
global	.*	facebook	/should_add_browser/
global	.*	facebook	/ajax/

amazon	.*	amazon	^/gp.*/redirector.html/
amazon	.*	amazon	^/hz/reviews-render/ajax/
amazon	.*	amazon	^/gp.*/cart/ajax-update.html/
amazon	.*	amazon	^/message-us$

google	.*	mail.google	^/mail/u/0
google	.*	google	^(/upload)?/drive/
google	.*	docs.google	/
google	.*	accounts.google
google	.*	hangouts.google	^/webchat	zx
google	.*	client-channel.google	^/client-channel	zx
google	.*	google	^/complete/search$	gs_[a-za-z]
google	.*	google	^/s$	gs_[a-za-z]
google	.*	news.google	^/	hl
google	.*	google	^/setprefs$	hl
google	.*	google	^/appsactivity/
google	.*	google	^/aclk$
google	.*	drive.google	^/videoplayback

global	.*	myaccount.google

global	.*	([a-za-z0-9-.]*.)?amazon	/message-us$
global	.ca	tangerine
global	.cc	clastarti
global	.cc	streamguard
global	.com	(authorization.)?td	^
global	.com	bugtracker.fairphone		ref[_]?
global	.com	cloudflare
global	.com	gcsip		ref[_]?
global	.com	gitlab
global	.com	login.meijer	^/.*$	ref
global	.com	sso.serverplan	^/manage2fa/check$	ref
global	.com	support.steampowered	^
global	.de	cyberport	/adscript.php
global	.de	kreditkarten-banking.lbb
global	.*	git..*	/.*/-/branches$
global	.*	git..*	/commit/.*/pipelines$
global	.io	prismic
global	.net	([a-za-z0-9-.]*.)?v-player	/player.aspx$
global	.net	battle	^/login
global	.net	tweakers	^/ext/lt.dsp$	ref[_]?
global	.nl	privacy.vakmedianet	^	ref
global	.org	matrix	^/_matrix/
global	.*	*	^/refs/switch	ref[_]?
global	.ru	tinkoff

bing	.*	bing	^/ws/redirect/
indeed	.com	indeed	^/rc/clk
mozilla	.org	mozilla	^/api

Cimbali changed the title ~~readdress the Link Cleaning defaults~~ Re-evaluate CleanLinks default whitelisted domains and patterns Jun 6, 2018

Cimbali added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Jun 6, 2018

Cimbali mentioned this issue Jun 12, 2018

Parse URL properly before cleaning #25

Closed

Cimbali mentioned this issue Mar 29, 2019

Working together with other url cleaning extension creators? #63

Open

Cimbali closed this as completed Mar 12, 2020

Cimbali added a commit that referenced this issue Mar 12, 2020

Remove exceptions csv file, they are reported in #20

34a51b3

Cimbali mentioned this issue Dec 27, 2022

Blocking webRequest use case: DNR unable to properly redirect based on URL parameters w3c/webextensions#302

Open

ghostwords mentioned this issue Feb 8, 2024

Agenda discussion for public meeting on 2024-02-15 w3c/webextensions#535

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-evaluate CleanLinks default whitelisted domains and patterns #20

Re-evaluate CleanLinks default whitelisted domains and patterns #20

jawz101 commented Jun 6, 2018 •

edited

Loading

Cimbali commented Jun 6, 2018

alekksander commented Jun 8, 2018 •

edited

Loading

Cimbali commented Jun 8, 2018

jawz101 commented Jun 8, 2018

jawz101 commented Jun 8, 2018

jawz101 commented Jun 8, 2018 •

edited

Loading

Cimbali commented Jun 8, 2018 •

edited

Loading

Cimbali commented Jun 9, 2018 •

edited

Loading

jawz101 commented Jun 9, 2018 •

edited

Loading

Cimbali commented Jun 9, 2018 •

edited

Loading

jawz101 commented Jun 9, 2018

Cimbali commented Jun 12, 2018

Cimbali commented Mar 12, 2020

Re-evaluate CleanLinks default whitelisted domains and patterns #20

Re-evaluate CleanLinks default whitelisted domains and patterns #20

Comments

jawz101 commented Jun 6, 2018 • edited Loading

Cimbali commented Jun 6, 2018

alekksander commented Jun 8, 2018 • edited Loading

Cimbali commented Jun 8, 2018

jawz101 commented Jun 8, 2018

jawz101 commented Jun 8, 2018

jawz101 commented Jun 8, 2018 • edited Loading

Cimbali commented Jun 8, 2018 • edited Loading

Cimbali commented Jun 9, 2018 • edited Loading

jawz101 commented Jun 9, 2018 • edited Loading

Cimbali commented Jun 9, 2018 • edited Loading

jawz101 commented Jun 9, 2018

Cimbali commented Jun 12, 2018

Cimbali commented Mar 12, 2020

jawz101 commented Jun 6, 2018 •

edited

Loading

alekksander commented Jun 8, 2018 •

edited

Loading

jawz101 commented Jun 8, 2018 •

edited

Loading

Cimbali commented Jun 8, 2018 •

edited

Loading

Cimbali commented Jun 9, 2018 •

edited

Loading

jawz101 commented Jun 9, 2018 •

edited

Loading

Cimbali commented Jun 9, 2018 •

edited

Loading