Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for challenge where part of formula is stored in a div #206

Closed
wants to merge 8 commits into from
74 changes: 63 additions & 11 deletions cfscrape/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,14 @@

from requests.sessions import Session

from collections import OrderedDict

try:
from urlparse import urlparse
from urlparse import urlunparse
except ImportError:
from urllib.parse import urlparse
from urllib.parse import urlunparse

__version__ = "1.9.7"

Expand All @@ -24,8 +28,6 @@
"Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0"
]

DEFAULT_USER_AGENT = random.choice(DEFAULT_USER_AGENTS)

BUG_REPORT = """\
Cloudflare may have changed their technique, or there may be a bug in the script.

Expand All @@ -45,12 +47,13 @@

class CloudflareScraper(Session):
def __init__(self, *args, **kwargs):
self.delay = kwargs.pop("delay", 8)
self.default_delay = 8
self.delay = kwargs.pop("delay", self.default_delay)
super(CloudflareScraper, self).__init__(*args, **kwargs)

if "requests" in self.headers["User-Agent"]:
# Set a random User-Agent if no custom User-Agent has been set
self.headers["User-Agent"] = DEFAULT_USER_AGENT
self.headers["User-Agent"] = random.choice(DEFAULT_USER_AGENTS)

def is_cloudflare_challenge(self, resp):
return (
Expand All @@ -61,6 +64,19 @@ def is_cloudflare_challenge(self, resp):
)

def request(self, method, url, *args, **kwargs):
self.headers = (

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right here you are overwriting the existing headers that are present. For example, after sticking a debugger in here above this line. self.headers already exists and has contents.

> /app/sickrage/lib/cfscrape/__init__.py(70)request()
     69         self.headers = (
---> 70             OrderedDict(
     71                 [

ipdb> self.headers
{'Connection': 'keep-alive', u'Content-Type': u'application/json', u'Accept-Encoding': u'gzip,deflate', 'Accept': '*/*', u'User-Agent': u'SickChill.CE.1/(Linux; 4.18.20-unRAID; 469f26e1-6018-11e9-9ad7-0242ac110002)'}

For my request to be successful, I need the Content-Type header to remain. Other than that - the commit works well, hopefully we can get this in soon.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, could you explain this change? It also changes some other headers that are needed, I also assume that, for the servers that respect it, we don't want to limit them to English. The 'Accept' also may varry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CodyWoolaver These headers seemed to be working best with Cloudflare, but I agree that breaking support for custom headers doesn't make sense. Can you share a site where you have problems with my current version.

The following gist contains a version which tries to restore support for custom headers by only using best working headers as long as cloudflare has not been bypassed:
https://gist.github.com/lukele/ce188004545192c0d92064e85138f0ab

Copy link

@CodyWoolaver CodyWoolaver Apr 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you could set those headers for specifically interacting with cloudflare and preserve the interactions with the intended site. That or you could merge in the specific headers you have picked and that way the intended headers could still make it through.

The site that this caused issues for was actually a local install of deluge and their webapi. There is a hard requirement to have Content-Type when interacting with it (no idea why). The client accessing it is https://github.com/SickChill/SickChill/tree/master/lib/cfscrape and uses a copy of cfscrape.

Let me know if I can help in any way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you could set those headers for specifically interacting with cloudflare and preserve the interactions with the intended site.

That's basically what I'm doing in the gist. The only problem is that Cloudflare did require for example the user-agent not to change between requests, so merging might be more problematic. Did you try the gist by any chance?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense - to preserve the User-Agent as it is a variable that should not frequently change. Is there any risk with picking a couple of headers and making them stay that way?

For example force the header values of User-Agent Connection Upgrade-Insecure-Requests to be consistent and do kwargs['headers'].update(<contents>) and that way the user agent, and other important headers remain and we can still pass in custom headers.

I have not had a chance to test it out yet, I won't be in a position to do so for several hours. I can get back to you then.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also like to mention, that I completely removed lines 67-79 of this PR in the code running at home and was able to get through without a problem.

Copy link

@CodyWoolaver CodyWoolaver Apr 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any more movement on this @lukele? The proposed fix you provided did work, however I still believe we should not be completely ignoring the headers if they are passed in. Having a set of headers that we require to go to cloudflare makes sense, but it should be an update on the object, not a replacement of.
@Anorov do you have any input on this? I know some people are looking for a more prominent fix, rather than trying to cherry-pick this pr :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CodyWoolaver sorry for the late reply. Best have a look at cloudscraper (https://github.com/VeNoMouS/cloudscraper). It‘s based on cfscrape but more actively maintained for now. It also includes some good fixes regarding the sometimes presented captcha challenge.

OrderedDict(
[
('User-Agent', self.headers['User-Agent']),
('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),
('Accept-Language', 'en-US,en;q=0.5'),
('Accept-Encoding', 'gzip, deflate'),
('Connection', 'close'),
('Upgrade-Insecure-Requests', '1')
]
)
)

resp = super(CloudflareScraper, self).request(method, url, *args, **kwargs)

# Check if Cloudflare anti-bot is on
Expand All @@ -83,9 +99,9 @@ def solve_cf_challenge(self, resp, **original_kwargs):
headers["Referer"] = resp.url

try:
params["s"] = re.search(r'name="s"\svalue="(?P<s_value>[^"]+)', body).group('s_value')
params["jschl_vc"] = re.search(r'name="jschl_vc" value="(\w+)"', body).group(1)
params["pass"] = re.search(r'name="pass" value="(.+?)"', body).group(1)
params["s"] = re.search(r'name="s"\svalue="(?P<s_value>[^"]+)', body).group('s_value')
except Exception as e:
# Something is wrong with the page.
# This may indicate Cloudflare has changed their anti-bot
Expand All @@ -96,20 +112,28 @@ def solve_cf_challenge(self, resp, **original_kwargs):
# Solve the Javascript challenge
params["jschl_answer"] = self.solve_challenge(body, domain)

# Check if the default delay has been overridden. If not, use the delay required by
# cloudflare.
if self.delay == self.default_delay:
try:
self.delay = float(re.search(r"submit\(\);\r?\n\s*},\s*([0-9]+)", body).group(1)) / float(1000)
except:
pass

# Requests transforms any request into a GET after a redirect,
# so the redirect has to be handled manually here to allow for
# performing other types of requests even as the first request.
method = resp.request.method
cloudflare_kwargs["allow_redirects"] = False

end_time = time.time()
time.sleep(self.delay - (end_time - start_time)) # Cloudflare requires a delay before solving the challenge
# Cloudflare requires a delay before solving the challenge
time.sleep(self.delay - (end_time - start_time))

redirect = self.request(method, submit_url, **cloudflare_kwargs)

redirect_location = urlparse(redirect.headers["Location"])
if not redirect_location.netloc:
redirect_url = "%s://%s%s" % (parsed_url.scheme, domain, redirect_location.path)
redirect_url = urlunparse((parsed_url.scheme, domain, redirect_location.path, redirect_location.params, redirect_location.query, redirect_location.fragment))
return self.request(method, redirect_url, **original_kwargs)
return self.request(method, redirect.headers["Location"], **original_kwargs)

Expand All @@ -120,8 +144,15 @@ def solve_challenge(self, body, domain):
except Exception:
raise ValueError("Unable to identify Cloudflare IUAM Javascript on website. %s" % BUG_REPORT)

js = re.sub(r"a\.value = (.+ \+ t\.length(\).toFixed\(10\))?).+", r"\1", js)
js = re.sub(r"\s{3,}[a-z](?: = |\.).+", "", js).replace("t.length", str(len(domain)))
js = re.sub(r"a\.value = (.+\.toFixed\(10\);).+", r"\1", js)
# Match code that accesses the DOM and remove it, but without stripping too much.
try:
solution_name = re.search("s,t,o,p,b,r,e,a,k,i,n,g,f,\s*(.+)\s*=", js).groups(1)
match = re.search("(.*};)\n\s*(t\s*=(.+))\n\s*(;%s.*)" % (solution_name), js, re.M | re.I | re.DOTALL).groups()
js = match[0] + match[-1]
except Exception:
raise ValueError("Error parsing Cloudflare IUAM Javascript challenge. %s" % BUG_REPORT)
js = js.replace("t.length", str(len(domain)))

# Strip characters that could be used to exit the string context
# These characters are not currently used in Cloudflare's arithmetic snippet
Expand All @@ -130,9 +161,30 @@ def solve_challenge(self, body, domain):
if "toFixed" not in js:
raise ValueError("Error parsing Cloudflare IUAM Javascript challenge. %s" % BUG_REPORT)

# 2019-03-20: Cloudflare sometimes stores part of the challenge in a div which is later
# added using document.getElementById(x).innerHTML, so it is necessary to simulate that
# method and value.
try:
# Find the id of the div in the javascript code.
k = re.search(r"k\s+=\s+'([^']+)';", body).group(1)
# Find the div with that id and store its content.
val = re.search(r'<div(.*)id="%s"(.*)>(.*)</div>' % (k), body).group(3)
except Exception:
# If not available, either the code has been modified again, or the old
# style challenge is used.
k = ''
val = ''

# Use vm.runInNewContext to safely evaluate code
# The sandboxed code cannot use the Node.js standard library
js = "console.log(require('vm').runInNewContext('%s', Object.create(null), {timeout: 5000}));" % js
# Add the atob method which is now used by Cloudflares code, but is not available in all node versions.
simulate_document_js = 'var document= {getElementById: function(x) { return {innerHTML:"%s"};}}' % (val)
atob_js = 'var atob = function(str) {return Buffer.from(str, "base64").toString("binary");}'
# t is not defined, so we have to define it and set it to the domain name.
js = '%s;%s;var t="%s";%s' % (simulate_document_js,atob_js,domain,js)
buffer_js = "var Buffer = require('buffer').Buffer"
# Pass Buffer into the new context, so it is available for atob.
js = "%s;console.log(require('vm').runInNewContext('%s', {'Buffer':Buffer,'g':String.fromCharCode}, {timeout: 5000}));" % (buffer_js, js)

try:
result = subprocess.check_output(["node", "-e", js]).strip()
Expand Down