API documentation

The following documentation aims to present the internal API of the library. While it is possible to use all of these classes directly, following the usage patterns detailed in the rest of the documentation is advised.

Basic Scraper

class scrapekit.core.Scraper(name, config=None, report=False)

Scraper application object which handles resource management for a variety of related functions.

Session()

Create a pre-configured requests session instance that can be used to run HTTP requests. This instance will potentially be cached, or a stub, depending on the configuration of the scraper.

get(url, **kwargs)

HTTP GET via requests.

See: http://docs.python-requests.org/en/latest/api/#requests.get

head(url, **kwargs)

HTTP HEAD via requests.

See: http://docs.python-requests.org/en/latest/api/#requests.head

post(url, **kwargs)

HTTP POST via requests.

See: http://docs.python-requests.org/en/latest/api/#requests.post

put(url, **kwargs)

HTTP PUT via requests.

See: http://docs.python-requests.org/en/latest/api/#requests.put

report()

Generate a static HTML report for the last runs of the scraper from its log file.

task(fn)

Decorate a function as a task in the scraper framework. This will enable the function to be queued and executed in a separate thread, allowing for the execution of the scraper to be asynchronous.

Tasks and threaded execution

This module holds a simple system for the multi-threaded execution of scraper code. This can be used, for example, to split a scraper into several stages and to have multiple elements processed at the same time.

The goal of this module is to handle simple multi-threaded scrapers, while making it easy to upgrade to a queue-based setup using celery later.

class scrapekit.tasks.Task(scraper, fn, task_id=None)

A task is a decorator on a function which helps managing the execution of that function in a multi-threaded, queued context.

After a task has been applied to a function, it can either be used in the normal way (by calling it directly), through a simple queue (using the queue method), or in pipeline mode (using chain, pipe and run).

chain(other_task)

Add a chain listener to the execution of this task. Whenever an item has been processed by the task, the registered listener task will be queued to be executed with the output of this task.

Can also be written as:

pipeline = task1 > task2
pipe(other_task)

Add a pipe listener to the execution of this task. The output of this task is required to be an iterable. Each item in the iterable will be queued as the sole argument to an execution of the listener task.

Can also be written as:

pipeline = task1 | task2
queue(*args, **kwargs)

Schedule a task for execution. The task call (and its arguments) will be placed on the queue and processed asynchronously.

run(*args, **kwargs)

Queue a first item to execute, then wait for the queue to be empty before returning. This should be the default way of starting any scraper.

wait()

Wait for task execution in the current queue to be complete (ie. the queue to be empty). If only queue is called without wait, no processing will occur.

class scrapekit.tasks.TaskManager(threads=10)

The TaskManager is a singleton that manages the threads used to parallelize processing and the queue that manages the current set of prepared tasks.

put(task, args, kwargs)

Add a new item to the queue. An item is a task and the arguments needed to call it.

Do not call this directly, use Task.queue/Task.run instead.

wait()

Wait for each item in the queue to be processed. If this is not called, the main thread will end immediately and none of the tasks assigned to the threads would be executed.

HTTP caching and parsing

class scrapekit.http.PolicyCacheController(cache=None, cache_etags=True, serializer=None)

Switch the caching mode based on the caching policy provided by request, which in turn can be given at request time or through the scraper configuration.

class scrapekit.http.ScraperResponse

A modified scraper response that can parse the content into HTML, XML, JSON or a BeautifulSoup instance.

html()

Create an lxml-based HTML DOM from the response. The tree will not have a root, so all queries need to be relative (i.e. start with a dot).

json(**kwargs)

Create JSON object out of the response.

xml()

Create an lxml-based XML DOM from the response. The tree will not have a root, so all queries need to be relative (i.e. start with a dot).

class scrapekit.http.ScraperSession

Sub-class requests session to be able to introduce additional state to sessions and responses.

scrapekit.http.make_session(scraper)

Instantiate a session with the desired configuration parameters, including the cache policy.

Exceptions and Errors

exception scrapekit.exc.DependencyException

Triggered when an operation would require the installation of further dependencies.

exception scrapekit.exc.ParseException

Triggered when parsing an HTTP response into the desired format (e.g. an HTML DOM, or JSON) is not possible.

exception scrapekit.exc.ScraperException

Generic scraper exception, the base for all other exceptions.

class scrapekit.exc.WrappedMixIn(wrapped)

Mix-in for wrapped exceptions.