Cache is not working correctly #86

rock321987 · 2018-05-30T11:45:12Z

Cache seems to work incorrectly(or not at all 😁). To simulate the process I used following piece of code

from reppy.robots import Robots
from reppy.ttl import HeaderWithDefaultPolicy
from expiringdict import ExpiringDict

policy = HeaderWithDefaultPolicy(default=1800, minimum=600)
robotsCache = ExpiringDict(max_len=62000, max_age_seconds=4800)

urlRobot = 'http://example.com/robots.txt'
url = 'http://example.com/some/path/'
USER_AGENT = 'my-user-agent'


for i in range(1, 10):
    robots = Robots.fetch(urlRobot, ttl_policy=policy)
    print(robots.allowed(url, USER_AGENT))

print('HeaderWithDefaultPolicy ends here')

for i in range(1, 10):
    if urlRobot in robotsCache:
        robots = robotsCache[urlRobot]
    else:
        robots = Robots.fetch(urlRobot)
        robotsCache[urlRobot] = robots
    
    print(robots.allowed(url, USER_AGENT))

The speed is clearly visible in both cases. Cache used in library is making URL request in each time of iteration. I disconnected my internet when it was running through HeaderWithDefaultPolicy, and it froze. When I killed the process it exited with urllib error. So it was making calls and cache was not working.

Correct me if I am wrong

And when will this be available in python3.6?

The text was updated successfully, but these errors were encountered:

dlecocq · 2018-05-30T16:26:35Z

The top-level Robots class doesn't do any caching itself, but two classes exist for that: RobotsCache and AgentCache from reppy.cache. The first caches a representation of the entire robots.txt response, and the second only caches the rules relevant to a particular user agent:

from reppy.cache import RobotsCache
cache = RobotsCache(capacity=100)
cache.allowed('http://example.com/foo/bar', 'my-user-agent')

from reppy.cache import AgentCache
cache = AgentCache(agent='my-user-agent', capacity=100)
cache.allowed('http://example.com/foo/bar')

More details are available under the caching section section.

Regarding python 3.6, it appears that 3.6 is included in the Travis build matrix, so it should work for you today in 3.6.

b4hand · 2018-05-30T16:31:15Z

Yes, like @dlecocq said Robots.fetch always fetches a URL. If you want caching behavior you need to use one of the cache classes. Also, I can confirm that Python 3.6 works as I have used it with reppy.

rock321987 · 2018-05-31T00:06:08Z

Thank you both for the prompt reply.

i) setup.py doesn't mention about python3.6 and I was also getting error in installation that's why I got confused. I fixed that. It works fine. You can add same in setup.py

ii) @b4hand If Robots.fetch always fetches url, then what's the use of HeaderWithDefaultPolicy. From what I understand, HeaderWithDefaultPolicy is caching the object and it checks from the cache first before fetching the url. If the use case is different, it will be great if you can explain.

@dlecocq Also I needed a time based caching so I was using HeaderWithDefaultPolicy. As it didn't work out I have to use ExpiringDict. RobotsCache doesn't provide time based cache. I know this can be hacked to work like that, but if there is a time based cache please point it out

Thanks both for prompt replies.

b4hand · 2018-05-31T00:16:10Z

The policy is intended to be used with the cache classes. It doesn't change the fact that fetch always fetches the robots.txt file. If you look at the code, the call to requests.get is unconditional in the FetchMethod:

reppy/reppy/robots.pyx

Line 78 in 813baf5

with closing(requests.get(url, *args, **kwargs)) as res:

DEFAULT_TTL_POLICY probably doesn't belong on the Robots class. I think that's a relic before the policy stuff was split out into separate classes. I can see where the confusion arises, but that's simply not how these classes are intended to be used. If you want reppy to cache, you must use one of the Cache objects.

b4hand · 2018-05-31T00:20:10Z

In fact, the methods ttl, expires and expired probably all belong on the ExpiringObject class rather than the main Robots class since it doesn't actually use or rely on those fields.

b4hand · 2018-05-31T00:22:24Z

Given the above analysis it seems pretty clear to me that a refactoring would be desirable, but this would also be an interface breaking change. I may take a look and see how impacting that change would be, but it's probably not a high priority right now.

b4hand closed this as completed May 30, 2018

b4hand mentioned this issue May 31, 2018

Move #expires, #expired, and #ttl to ExpiringObject class. #89

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache is not working correctly #86

Cache is not working correctly #86

rock321987 commented May 30, 2018 •

edited

Loading

dlecocq commented May 30, 2018

b4hand commented May 30, 2018 •

edited

Loading

rock321987 commented May 31, 2018

b4hand commented May 31, 2018 •

edited

Loading

b4hand commented May 31, 2018

b4hand commented May 31, 2018

Cache is not working correctly #86

Cache is not working correctly #86

Comments

rock321987 commented May 30, 2018 • edited Loading

dlecocq commented May 30, 2018

b4hand commented May 30, 2018 • edited Loading

rock321987 commented May 31, 2018

b4hand commented May 31, 2018 • edited Loading

b4hand commented May 31, 2018

b4hand commented May 31, 2018

rock321987 commented May 30, 2018 •

edited

Loading

b4hand commented May 30, 2018 •

edited

Loading

b4hand commented May 31, 2018 •

edited

Loading