scrapekit: get the data you need, fast.¶
Many web sites expose a great amount of data, and scraping it can help you build useful tools, services and analysis on top of that data. This can often be done with a simple Python script, using few external libraries.
As your script grows, however, you will want to add more advanced features, such as caching of the downloaded pages, multi-threading to fetch many pieces of content at once, and logging to get a clear sense of which data failed to parse.
Scrapekit provides a set of useful tools for these that help with these tasks, while also offering you simple ways to structure your scraper. This helps you to produce fast, reliable and structured scraper scripts.
Below is a simple scraper for postings on Craigslist. This will use multiple threads and request caching by default.
import scrapekit from urlparse import urljoin scraper = scrapekit.Scraper('craigslist-sf-boats') @scraper.task def scrape_listing(url): doc = scraper.get(url).html() print(doc.find('.//h2[@class="postingtitle"]').text_content()) @scraper.task def scrape_index(url): doc = scraper.get(url).html() for listing in doc.findall('.//a[@class="hdrlnk"]'): listing_url = urljoin(url, listing.get('href')) scrape_listing.queue(listing_url) scrape_index.run('https://sfbay.craigslist.org/boo/')
By default, this save cache data to a the working directory, in a folder called
Upon completion, the scraper will also generate an HTML report that presents information about each task run within the scraper.
This behaviour can be disabled by passing
report=False to the constructor of
- Installation Guide
- Using tasks
- Utility functions
- API documentation
scrapekit is written and maintained by Friedrich Lindenberg. It was developed as an outcome of scraping projects
for the African Network of Centers for Investigative Reporting (ANCIR), supported by a Knight International
Journalism Fellowship from the International
Center for Journalists (ICFJ).