scrapekit: get the data you need, fast.

Many web sites expose a great amount of data, and scraping it can help you build useful tools, services and analysis on top of that data. This can often be done with a simple Python script, using few external libraries.

As your script grows, however, you will want to add more advanced features, such as caching of the downloaded pages, multi-threading to fetch many pieces of content at once, and logging to get a clear sense of which data failed to parse.

Scrapekit provides a set of useful tools for these that help with these tasks, while also offering you simple ways to structure your scraper. This helps you to produce fast, reliable and structured scraper scripts.

Example

Below is a simple scraper for postings on Craigslist. This will use multiple threads and request caching by default.

import scrapekit
from urlparse import urljoin

scraper = scrapekit.Scraper('craigslist-sf-boats')

@scraper.task
def scrape_listing(url):
    doc = scraper.get(url).html()
    print(doc.find('.//h2[@class="postingtitle"]').text_content())


@scraper.task
def scrape_index(url):
    doc = scraper.get(url).html()

    for listing in doc.findall('.//a[@class="hdrlnk"]'):
        listing_url = urljoin(url, listing.get('href'))
        scrape_listing.queue(listing_url)

scrape_index.run('https://sfbay.craigslist.org/boo/')

By default, this save cache data to a the working directory, in a folder called data.

Reporting

Upon completion, the scraper will also generate an HTML report that presents information about each task run within the scraper.

http://cl.ly/image/1J2o2T43422e/Screen%20Shot%202014-08-26%20at%2015.58.03.png

This behaviour can be disabled by passing report=False to the constructor of the scraper.

Contributors

scrapekit is written and maintained by Friedrich Lindenberg. It was developed as an outcome of scraping projects for the African Network of Centers for Investigative Reporting (ANCIR), supported by a Knight International Journalism Fellowship from the International Center for Journalists (ICFJ).

Indices and tables