Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsafe parsing of URLs #8747

Closed
ghost opened this issue Sep 23, 2016 · 14 comments
Closed

Unsafe parsing of URLs #8747

ghost opened this issue Sep 23, 2016 · 14 comments
Labels
url Issues and PRs related to the legacy built-in url module.

Comments

@ghost
Copy link

ghost commented Sep 23, 2016

  • Version: 0.10.36
  • Platform: Mac OS X
  • Subsystem: url.parse

Some cases in which I would expect url.parse to behave differently:

IPv4 address

url.parse('1.2.3.4') = {
  protocol: null,
  slashes: null,
  auth: null,
  host: null,
  port: null,
  hostname: null,  // <- should be 1.2.3.4
  hash: null,
  search: null,
  query: null,
  pathname: '1.2.3.4', // <- no pathname
  path: '1.2.3.4', // <- no path
  href: '1.2.3.4'
}

IPv4 address with port

url.parse('1.2.3.4:8080') = {
  protocol: '1.2.3.4:',
  slashes: null,
  auth: null,
  host: '8080', // host is 1.2.3.4 not 8080
  port: null, // no port
  hostname: '8080', // host is 1.2.3.4 not 8080
  hash: null,
  search: null,
  query: null,
  pathname: null,
  path: null,
  href: '1.2.3.4:8080'
}

IPv6 address

url.parse('::1') = {
  protocol: null,
  slashes: null,
  auth: null,
  host: null, // should be ::1
  port: null,
  hostname: null, // should be ::1
  hash: null,
  search: null,
  query: null,
  pathname: '::1', // no pathname
  path: '::1', // no path
  href: '::1'
}

IPv6 address with port

url.parse('::1:8080') = {
  protocol: null,
  slashes: null,
  auth: null,
  host: null, // should be ::1
  port: null, // should be 8080
  hostname: null, // should be ::1
  hash: null,
  search: null,
  query: null,
  pathname: '::1:8080', // no pathname
  path: '::1:8080', // no path
  href: '::1:8080'
}

Hostname

url.parse('asdf.com') = {
  protocol: null,
  slashes: null,
  auth: null,
  host: null, // should be asdf.com
  port: null,
  hostname: null, // should be asdf.com
  hash: null,
  search: null,
  query: null,
  pathname: 'asdf.com', // no pathname
  path: 'asdf.com', // no path
  href: 'asdf.com'
}

Hostname with port

url.parse('asdf.com:8080') = {
  protocol: 'asdf.com:',
  slashes: null,
  auth: null,
  host: '8080',
  port: null,
  hostname: '8080',
  hash: null,
  search: null,
  query: null,
  pathname: null,
  path: null,
  href: 'asdf.com:8080'
}
@jasnell
Copy link
Member

jasnell commented Sep 23, 2016

Yes, there are a number of known issues with the algorithm in url.parse(). It's unlikely that those are going to get fully resolved, however. There is another effort to introduce a new, more robust URL parser that is based on the WHATWG URL Parsing spec that should hopefully be introduced as an experimental feature in the near future.

@jasnell jasnell added the url Issues and PRs related to the legacy built-in url module. label Sep 23, 2016
@ghost
Copy link
Author

ghost commented Sep 23, 2016

In short, if you omit auth (e.g: user:pass@) or the protocol (protocol://) it doesn't work correctly.

@ghost ghost changed the title Unsafe parsing Unsafe parsing of URLs Sep 23, 2016
@ghost
Copy link
Author

ghost commented Sep 23, 2016

As a workaround I am going to use https://github.com/jsdom/whatwg-url

@bnoordhuis
Copy link
Member

It's simpler than that. Without a protocol: prefix, url.parse cannot know whether the input is a URL or a file path and it will conservatively assume the latter.

@lpinca
Copy link
Member

lpinca commented Sep 23, 2016

In the browser you'll get an error with those URLs if you don't provide a base URL:

> new URL('1.2.3.4');
Uncaught TypeError: Failed to construct 'URL': Invalid URL(…)

@ghost
Copy link
Author

ghost commented Sep 23, 2016

@bnoordhuis: no.

If you do:

url.parse('user:pass@hostname.com') = {
  protocol: 'user:',
  slashes: null,
  auth: 'pass',
  host: 'hostname.com',
  port: null,
  hostname: 'hostname.com',
  hash: null,
  search: null,
  query: null,
  pathname: null,
  path: null,
  href: 'user:pass@hostname.com'
}

It is parsed differently, and there's still an absence of protocol: prefix.
This is what I said.

@lpinca
Copy link
Member

lpinca commented Sep 23, 2016

user: is the protocol.

@ghost
Copy link
Author

ghost commented Sep 23, 2016

user: is not the protocol, it is part of the authentication segment of the URI delimited by @.
In this case it is being recognized as the protocol, but that behavior is not correct.

url.parse should put user:pass as the value for auth.

@bnoordhuis
Copy link
Member

You misunderstand @lpinca. user: is the protocol, it's right there in the output. :-) Try prefixing it with e.g. ftp:// - only then does it get parsed as an auth section.

@gibfahn
Copy link
Member

gibfahn commented Sep 24, 2016

@arboreal84 If you're willing to create a PR to update this info in the documentation so that other people don't run into the same issue then that'd be great!

@ChALkeR
Copy link
Member

ChALkeR commented Sep 24, 2016

@arboreal84 The current url implementation has some inconsistencies and will be likely replaced (complemented) with a browser-compatible WHATWG URL implementation: nodejs/node-eps#28.

That said, note that most of your examples are in fact behaving exactly as they should and are consistent with WHATWG URL or are not valid URLs at all and should fail in WHATWG URL.

For example, if user:pass@hostname.com is treated as a URL, user: is the protocol.
I.e., in browsers:

> new URL('user:pass@hostname.com')
URL {
  hash: ""
  host: ""
  hostname: ""
  href: "user:pass@hostname.com"
  origin: "user://"
  password: ""
  pathname: "pass@hostname.com"
  port: ""
  protocol: "user:"
  search: ""
  searchParams: URLSearchParams
  username: ""
  __proto__: URL
}

You should consult URL specification probably.

@ChALkeR
Copy link
Member

ChALkeR commented Sep 24, 2016

More specifically, per spec:

  • 1.2.3.4 — not an URL.
  • 1.2.3.4:8080 — not an URL.
  • ::1 — not an URL.
  • ::1:8080 — not an URL.
  • asdf.com — not an URL.
  • asdf.com:8080 — URL, protocol: asdf.com:.
  • user:pass@hostname.com — URL, protocol: user:.

@silverwind
Copy link
Contributor

silverwind commented Sep 24, 2016

Yep, the protocol is pretty much mandatory for parsing URLs with either the url or whatwg-url modules. If you want to parse protocoll-less strings, prepend and remove the protocol yourself or use something like url-parse-lax.

@ghost
Copy link
Author

ghost commented Jul 13, 2017

As a follow up to this issue: now the url module provides a class that implements the WHATWG standard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
url Issues and PRs related to the legacy built-in url module.
Projects
None yet
Development

No branches or pull requests

6 participants