Caching of response data is implemented via CacheControl, a library that extends requests. To enable a flexible usage of the caching mechanism, the use of cached data is steered through a cache policy, which can be specified either for the whole scraper or for a specific request.
The following policies are supported:
httpwill perform response caching and validation according to HTTP semantic, i.e. in the way that a browser would do it. This requires the server to set accurate cache control headers - which many applications are too stupid to do.
nonewill disable caching entirely and always revert to the server for up-to-date information.
forcewill always use the cached data and not check with the server for updated pages. This is useful in debug mode, but dangerous when used in production.
Per-request cache settings¶
While caching will usually be configured on a scraper-wide basis, it can
also be set for individual (
GET) requests by passing a
argument set to one of the policy names:
import scrapekit scraper = scrapekit.Scraper('example') # No caching: scraper.get('http://google.com', cache='none') # Cache according to HTTP semantics: scraper.get('http://google.com', cache='http') # Force re-use of data, even if it is stale: scraper.get('http://google.com', cache='force')