Quickstart¶
Welcome to the scrapekit quickstart tutorial. In the following section, I’ll show you how to write a simple scraper using the functions in scrapekit.
Like many people, I’ve had a life-long, hidden desire to become a sail boat captain. To help me live the dream, we’ll start by scraping Craigslist boat sales in San Francisco.
Getting started¶
First, let’s make a simple Python module, e.g. in a file called
scrape_boats.py
.
import scrapekit
scraper = scrapekit.Scraper('craigslist-sf-boats')
The first thing we’ve done is to instantiate a scraper and to give it a name. The name will later be used to configure the scraper and to read it’s log ouput. Next, let’s scrape our first page:
from urlparse import urljoin
@scraper.task
def scrape_index(url):
doc = scraper.get(url).html()
next_link = doc.find('.//a[@class="button next"]')
if next_link is not None:
# make an absolute url.
next_url = urljoin(url, next_link.get('href'))
scrape_index.queue(next_url)
scrape_index.run('https://sfbay.craigslist.org/boo/')
This code will cycle through all the pages of listings, as long as a Next link is present.
The key aspect of this snippet is the notion of a task
. Each scrapekit scraper is broken up into
many small tasks, ideally one for fetching each web page.
Tasks are executed in parallel to speed up the scraper. To do that,
task functions aren’t called directly, but by placing them on a
queue (see scrape_index.queue
above). Like normal functions, they can still receive arguments -
in this case, the URL to be scraped.
At the end of the snippet, we’re calling scrape_index.run
. Unlike a simple queueing operation, this
will tell the scraper to queue a task and then wait for all tasks to
be executed.
Scraping details¶
Now that we have a basic task to scrape the index of listings, we might want to download each listing’s page and get some data from it. To do this, we can extend our previous script:
import scrapekit
from urlparse import urljoin
scraper = scrapekit.Scraper('craigslist-sf-boats')
@scraper.task
def scrape_listing(url):
doc = scraper.get(url).html()
print(doc.find('.//h2[@class="postingtitle"]').text_content())
@scraper.task
def scrape_index(url):
doc = scraper.get(url).html()
for listing in doc.findall('.//a[@class="hdrlnk"]'):
listing_url = urljoin(url, listing.get('href'))
scrape_listing.queue(listing_url)
next_link = doc.find('.//a[@class="button next"]')
if next_link is not None:
# make an absolute url.
next_url = urljoin(url, next_link.get('href'))
scrape_index.queue(next_url)
scrape_index.run('https://sfbay.craigslist.org/boo/')
This basic scraper could be extended to extract more information from each listing page, and to save that information to a set of files or to a database.
Configuring the scraper¶
As you may have noticed, Craigslist is sometimes a bit slow. You might want to configure your scraper to use caching, or a different number of simultaneous threads to retrieve data. The simplest way to set up caching is to set some environment variables:
$ export SCRAPEKIT_CACHE_POLICY="http"
$ export SCRAPEKIT_DATA_PATH="data"
$ export SCRAPEKIT_THREADS=10
This will instruct scrapekit to cache requests according to the rules
of HTTP (using headers like Cache-Control
to determine what to cache
and for how long), and to save downloaded data in a directory called
data
in the current working path. We’ve also instructed the tool to
use 10 threads when scraping data.
If you wanto to make these decisions at run-time, you could also pass
them into the constructor of your Scraper
:
import scrapekit
config = {
'threads': 10,
'cache_policy': 'http',
'data_path': 'data'
}
scraper = scrapekit.Scraper('demo', config=config)
For details on all available settings and their meaning, check out the configuration documentation.