Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache is not working correctly #86

Closed
rock321987 opened this issue May 30, 2018 · 6 comments
Closed

Cache is not working correctly #86

rock321987 opened this issue May 30, 2018 · 6 comments

Comments

@rock321987
Copy link

rock321987 commented May 30, 2018

Cache seems to work incorrectly(or not at all 😁). To simulate the process I used following piece of code

from reppy.robots import Robots
from reppy.ttl import HeaderWithDefaultPolicy
from expiringdict import ExpiringDict

policy = HeaderWithDefaultPolicy(default=1800, minimum=600)
robotsCache = ExpiringDict(max_len=62000, max_age_seconds=4800)

urlRobot = 'http://example.com/robots.txt'
url = 'http://example.com/some/path/'
USER_AGENT = 'my-user-agent'


for i in range(1, 10):
    robots = Robots.fetch(urlRobot, ttl_policy=policy)
    print(robots.allowed(url, USER_AGENT))

print('HeaderWithDefaultPolicy ends here')

for i in range(1, 10):
    if urlRobot in robotsCache:
        robots = robotsCache[urlRobot]
    else:
        robots = Robots.fetch(urlRobot)
        robotsCache[urlRobot] = robots
    
    print(robots.allowed(url, USER_AGENT))

The speed is clearly visible in both cases. Cache used in library is making URL request in each time of iteration. I disconnected my internet when it was running through HeaderWithDefaultPolicy, and it froze. When I killed the process it exited with urllib error. So it was making calls and cache was not working.

Correct me if I am wrong

And when will this be available in python3.6?

@dlecocq
Copy link

dlecocq commented May 30, 2018

The top-level Robots class doesn't do any caching itself, but two classes exist for that: RobotsCache and AgentCache from reppy.cache. The first caches a representation of the entire robots.txt response, and the second only caches the rules relevant to a particular user agent:

from reppy.cache import RobotsCache
cache = RobotsCache(capacity=100)
cache.allowed('http://example.com/foo/bar', 'my-user-agent')

from reppy.cache import AgentCache
cache = AgentCache(agent='my-user-agent', capacity=100)
cache.allowed('http://example.com/foo/bar')

More details are available under the caching section section.

Regarding python 3.6, it appears that 3.6 is included in the Travis build matrix, so it should work for you today in 3.6.

@b4hand
Copy link
Contributor

b4hand commented May 30, 2018

Yes, like @dlecocq said Robots.fetch always fetches a URL. If you want caching behavior you need to use one of the cache classes. Also, I can confirm that Python 3.6 works as I have used it with reppy.

@b4hand b4hand closed this as completed May 30, 2018
@rock321987
Copy link
Author

Thank you both for the prompt reply.

i) setup.py doesn't mention about python3.6 and I was also getting error in installation that's why I got confused. I fixed that. It works fine. You can add same in setup.py

ii) @b4hand If Robots.fetch always fetches url, then what's the use of HeaderWithDefaultPolicy. From what I understand, HeaderWithDefaultPolicy is caching the object and it checks from the cache first before fetching the url. If the use case is different, it will be great if you can explain.

@dlecocq Also I needed a time based caching so I was using HeaderWithDefaultPolicy. As it didn't work out I have to use ExpiringDict. RobotsCache doesn't provide time based cache. I know this can be hacked to work like that, but if there is a time based cache please point it out

Thanks both for prompt replies.

@b4hand
Copy link
Contributor

b4hand commented May 31, 2018

The policy is intended to be used with the cache classes. It doesn't change the fact that fetch always fetches the robots.txt file. If you look at the code, the call to requests.get is unconditional in the FetchMethod:

with closing(requests.get(url, *args, **kwargs)) as res:

DEFAULT_TTL_POLICY probably doesn't belong on the Robots class. I think that's a relic before the policy stuff was split out into separate classes. I can see where the confusion arises, but that's simply not how these classes are intended to be used. If you want reppy to cache, you must use one of the Cache objects.

@b4hand
Copy link
Contributor

b4hand commented May 31, 2018

In fact, the methods ttl, expires and expired probably all belong on the ExpiringObject class rather than the main Robots class since it doesn't actually use or rely on those fields.

@b4hand
Copy link
Contributor

b4hand commented May 31, 2018

Given the above analysis it seems pretty clear to me that a refactoring would be desirable, but this would also be an interface breaking change. I may take a look and see how impacting that change would be, but it's probably not a high priority right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants