scrapekit: get the data you need, fast.¶
Many web sites expose a great amount of data, and scraping it can help you build useful tools, services and analysis on top of that data. This can often be done with a simple Python script, using few external libraries.
As your script grows, however, you will want to add more advanced features, such as caching of the downloaded pages, multi-threading to fetch many pieces of content at once, and logging to get a clear sense of which data failed to parse.
Scrapekit provides a set of useful tools for these that help with these tasks, while also offering you simple ways to structure your scraper. This helps you to produce fast, reliable and structured scraper scripts.
Example¶
Below is a simple scraper for postings on Craigslist. This will use multiple threads and request caching by default.
import scrapekit
from urlparse import urljoin
scraper = scrapekit.Scraper('craigslist-sf-boats')
@scraper.task
def scrape_listing(url):
doc = scraper.get(url).html()
print(doc.find('.//h2[@class="postingtitle"]').text_content())
@scraper.task
def scrape_index(url):
doc = scraper.get(url).html()
for listing in doc.findall('.//a[@class="hdrlnk"]'):
listing_url = urljoin(url, listing.get('href'))
scrape_listing.queue(listing_url)
scrape_index.run('https://sfbay.craigslist.org/boo/')
By default, this save cache data to a the working directory, in a folder called
data
.
Reporting¶
Upon completion, the scraper will also generate an HTML report that presents information about each task run within the scraper.

This behaviour can be disabled by passing report=False
to the constructor of
the scraper.
Contents¶
Contributors¶
scrapekit
is written and maintained by Friedrich Lindenberg. It was developed as an outcome of scraping projects
for the African Network of Centers for Investigative Reporting (ANCIR), supported by a Knight International
Journalism Fellowship from the International
Center for Journalists (ICFJ).