Taking things apart can help you to understand how they’re put together. Web scraping is one of my favorite ways to take apart a digital product, in order to better understand just what comprises it. Sure, the data one can uncover through web scraping is useful, and I’ve used it in many ways, including prototyping interactions, content audits, and general discovery, but the activity of configuring a web crawler forces one to examine the content on a given site with an eye towards its underlying structure and metadata. To put it another way, you’ve done the work to crawl a website, you might not even need the data you uncover.1

There are some off the rack tools out there that crawl websites to help with content analysis, and some of them are quite good. I’ve heard good things about Blaze, and I know some folks who are pretty fond of CAT. Rolling your own web crawler, though, offers greater flexibility and functionality than one can get with the off the rack tools. Content strategists tend not count programming as a core competency, but one needn’t be an experienced programmer to get a web crawler working. A number of crawling frameworks exist that can help even novices build a custom crawler.

In this guide, we’ll use Scrapy, a Python-based web crawling and scraping framework, to build a web crawler with an eye towards using it to automate some of the work involved in a content inventory. Scrapy allows users to define spiders using a minimal amount of code, while leaving the mechanics of actually sending the http requests to the hidden parts of the framework. We’ll also cover getting your development environment set up on a Mac. (In many ways, getting everything installed is the trickiest part of all this.)

I’ve written this guide with content strategists (i.e. not professional programmers) in mind. As such, it goes into detail on a number of things that more experienced developers might read and say, “Duh.” This is exactly the point. Getting all this working for the first time can be a little tricky, especially for folks who don’t use the command line that much, but if you follow these instructions, things ought to work out.

This guide is divided into 3 sections:

  1. Environment Setup
  2. Your First Spider
  3. Extending Your Spider

1. Environment Setup

Scrapy has ample official installation documentation, and it’s definitely worth reading through that. However, Scrapy’s documentation doesn’t cover all the steps necessary from the point-of-view of someone who doesn’t already have their development environment set up. If you already work with Python, this’ll be a cakewalk, and if you don’t, installation should still be manageable. It also doesn’t cover just the things one might need to know to get up and running, because it isn’t written with a specific kind of project in mind. This guide is specifically focused on using Scrapy to generate CSV files that can be used as a basis for content inventory and audit activities. As such, it doesn’t concern itself with a lot of the features and functionality that Scrapy’s more comprehensive documentation does.

I also cover some of the specific pitfalls I’ve encountered (or seen others encounter). This isn’t the only way to do this, but it’s worked for me, and I’ve successfully used a similar guide to help colleagues get set up for crawling.

Tools

Scrapy requires Python 2.7, and Macs come with Python preinstalled. You can check your version with python --version, and if you want, you can use this installation to run Scrapy.

However, you should really install your own distribution of Python. This way you can ensure that you’ve got the latest version, and you can also make sure that anything else you install doesn’t interfere with what’s installed by default. The installation process is also a good way of getting comfortable with your environment. The rest of this section covers getting Python and some other necessary tools installed. (This guide owes a lot to this guide to installing scientific Python.)

Xcode

While you won’t actually be using Xcode, Xcode is a dependency of some of the software you will be using. It’s also a free download from Apple Developer. If you don’t already have an Apple Developer account, you’ll have to create one, but it’s tied to your Apple/iTunes ID, so it’s a quick registration. Xcode is a pretty big download—few gigabytes—so it may take a while. Also important is the Xcode Command Line Tools. If you’re running OSX 10.9 or later, you’ll have to get these from Apple Developer, too.

Homebrew

Homebrew is a Mac-specific package manager. It makes installing Python (and lots of other things) much easier. It’s worth installing Homebrew first. To do so, paste this into your shell:

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Once Homebrew is installed, you should be ready to install Python. However, you should also make sure there’s nothing else you need to do by running

brew doctor

and reading the results. brew doctor gives good information, and in clear language.

Installing Python

Since you’ve gotten Homebrew installed, installing Python will be a snap.

brew install python

This will install pip, another package manager (which you’ll also need). After the installation’s run, if you run brew doctor you may see a line about changing your PATH variable, which should offer the following one line command to fix it:

echo export PATH='/usr/local/bin:$PATH' >> ~/.bash_profile

After you’ve entered that command to edit your .bash_profile, enter which python`. You should see:

/usr/local/bin/python

You can also run python --version, and see, instead of the sytem-installed Python 2.7.5, a newer version (probably 2.7.8).

Installing Scrapy

Scrapy can be installed with a number of package managers; you’ll have pip from installing Python with Homebrew:

pip install Scrapy

If installation fails, you may need to add sudo at the beginning of the pip install Scrapy, and enter your password.

One common installation error involves not having lxml installed. If the errors you’re seeing mention lxml, you can just run

pip install lxml

If Scrapy installed correctly, you should be able to check using which scrapy to see the path:

/usr/local/bin/scrapy

to see the path. If all this has worked, you’re ready to write your first spider.

2. Your First Spider

Once you’ve successfully gotten Scrapy installed, you’ve accomplished the hardest part. Now all that’s left to do is to create the actual spider. Scrapy will create most of the files you’ll need; some boilerplate can help with the rest. Before getting to that, let’s discuss what we want our output to look like, so we can gather requirements for the spider.

This exercise is about creating a scaffolding for a content inventory. Generally, a content inventory takes the form of a spreadsheet, with individual rows for elements of content, and columns for different aspects of each element. With that in mind, we’ll plan on making Scrapy output a CSV file, that we can easily import into Excel. (For other uses, it’s easy to make Scrapy output JSON.)

For this example, we’ll use the Python 2.7 tutorial at Python.org. A quick note on crawling ethics and cautions is probably in order before we go any further. Loosing a crawler on a site to read all of its content uses server resources, and may be considered a violation of a site’s terms of service, or, if applicable, any license agreement you may have agreed to. You probably won’t run into any trouble if you craw judiciously, but you never know. Because we generally use crawlers to get information from clients’ sites, you should be in the clear, but keep these kinds of things in mind, especially if you’re not working on a site for a client. You may also want to familiarize yourself with the robots.txt file and its use. While it seems that the robots.txt file isn’t legally binding, it’s best not to be a jerk. (You could easily get your IP address blacklisted, even if you don’t suffer any legal consequences.)

Create Project

Navigate to a directory you want to keep your project files in using the shell, and use the command

scrapy startproject pythonsite

to create a project named pythonsite. (You can choose to call your project whatever you want.) Scrapy will create a project directory containing most of the necessary files, and tell you you can generate a spider, if you like. I’ve found it’s usually easier to just make the spider myself, so that’s what I suggest in this guide.

Go ahead and cd into the project directory. In the directory, you’ll find the following files:

.
├── pythonsite
│   ├── __init__.py
│   ├── items.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       └── __init__.py
└── scrapy.cfg

One thing to note: yes, that’s a second directory called pythonsite inside the first directory. I’m not sure why Scrapy chooses to make a directory with the project’s name inside another directory with the project’s name, but it does. Just know that this is normal, and don’t worry about it. From here, you can start making some changes to the various files to set up your spider.

settings.py

Settings is where you specify the output file. For the purpose of getting this all into a spreadsheet, the easiest format to work with is a CSV. To specify this, open

settings.py

and add the lines

FEED_FORMAT="csv"
FEED_URI="file:output.csv"

This will put the results from your crawl into a CSV named output.csv. If you like, you can go ahead and change that name to be something more specific to your project, but I tend to just go with output.csv for all my crawlers. The choice is yours.

items.py

Items is where you define the fields that will hold the items you scrape from each page. Scrapy handles these items much like a Python dictionary (i.e. using key/value pairs). In items.py, you just need to give the field names. It’s best to use names that you’ll understand here. For example, if you wanted to capture the url, title, and any pdf links from each page, you might replace the Item object in items.py with:

class PythonsiteItem(scrapy.Item):
    url = scrapy.Field()
    title = scrapy.Field()

Note that you can delete the pass from the last line of the Item object. Again, the Item object is just creating space to put this, it’s not defining how the spider captures the information.

Creating a Spider

Scrapy doesn’t create a spider for you, unless you use the genspider command. As I said before, though, it’s probably easier to make your own. To create your own, create a file inside the spiders directory, and give it a name. For whatever reason, it should not have the same name as the project, so if your project is called pythonsite don’t call your spider pythonsite.py. The name here doesn’t really matter—you won’t be typing it again—so something like pyspider.py will work.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from pythonsite.items import PythonsiteItem

class PythonsiteSpider(CrawlSpider):
    name = "pycrawler"
    start_urls = ['https://docs.python.org/2/index.html']
    
    rules = (Rule(LinkExtractor(deny=(), allow=('docs.python.org/2/tutorial/'), restrict_xpaths=('*')), callback='parse_item', follow=True),)

    def parse_item(self, response):
        sel = Selector(response)
        i = PythonsiteItem()
        pdfs = []
        i['url'] = response.url
        i['title'] = sel.xpath('/html/head/title/text()').extract()
        return i

There are a lot of things going on in the sample spider code provided here. Let’s walk through them bit by bit.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

This specifies that you’ll be using Unicode/UTF-8 encoding for the text. It just needs to be included.

import scrapy
from scrapy.selector import Selector
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from pythonsite.items import PythonsiteItem

You must import the classes your spider needs to run. The example uses:

  • Selector, which is used in defining what HTML elements to select.
  • CrawlSpider, the class of spider that will crawl from page to page, and the Rule, which allows one to set rules about what links get followed or ignored. (A simpler spider, scrapy.Spider, does not programatically crawl multiple pages, but can be useful if you just want to scrape elements from urls you already know about.)
  • The last line has to be changed to reflect the project name, as well as the item name, which is contained in items.py.

Next, let’s look at the our spider:

class PythonsiteSpider(CrawlSpider):
    name = "pycrawler"
    start_urls = ['https://docs.python.org/2/index.html']
    
    rules = (Rule(LinkExtractor(deny=(), allow=('docs.python.org/2/tutorial/'), restrict_xpaths=('*')), callback='parse_item', follow=True),)

    def parse_item(self, response):
        sel = Selector(response)
        i = PythonsiteItem()
        pdfs = []
        i['url'] = response.url
        i['title'] = sel.xpath('/html/head/title/text()').extract()
        return i

Here, we’ve extended the CrawlSpider class, given it a name, pycrawler (which we’ll use to run the spider from the shell). We also defined a rule for this spider. Rules can be regular expressions, limiting the spider to urls containing docs.python.org/2/tutorial/. Rules are powerful, especially if you choose to write them with RegEx, but there’s no need to do that for this example. What is important is that your rules here specify the callback method, which is parse_item by default, and whether to follow links on each page. The default is true, but there’s no harm in making that explicit just so you’re aware of that.

def parse_item(self, response):
    sel = Selector(response)
    i = PythonsiteItem()
    pdfs = []
    i['url'] = response.url
    i['title'] = sel.xpath('/html/head/title/text()').extract()
    return i

Finally, we have our parse_item method, which will be called after each page is loaded. There, we assign the response from each page to a variable, sel, to make it easier to call later, create an instance of our PythonsiteItem object and name it i, and then set about assigning values to the various fields in the PythonsiteItem object. Notably, you’ll see that we assign title from calling an xpath selector on our sel object. Then, on the last line, we return the instance of the object. That’s it!

Using Your Spider

While still in the project directory in the shell, use the command

scrapy crawl pycrawler

This will start the spider, and log progress to the terminal. Let this run. If you’ve properly defined everything, it will finish in a few seconds. Open up that CSV, and look at it: you should find one column with urls, one column with page titles. You’ve successfully created your first spider.

3. Extending Your Spider

Understanding Selectors

Even just getting a comprehensive list of URLs and Titles can be helpful, but it’s easy enough to extend your spider. All you need to do is add additional fields to your item in items.py and specify the elements to capture in your parse_item method. Maybe you want to capture a list of PDF files available on a given page. You can do so with sel.xpath('/html/head/title/text()').extract(). Maybe you want to capture the text in h1 tags: you can get that with sel.css('h1::text').extract(). In almost all cases, you’ll be using selectors.

Selectors are pretty intuitive to work with, especially if you’ve ever used CSS. However, there are a few tools that are helpful in understanding how selectors work.

The first, Selector Gadget, is a Chrome bookmarklet and plug-in that allows you to get the xpath or css selector value of any on-screen item by pointing and clicking. It’s a great thing to have in your toolbox.

The second tool is the Scrapy shell. It allows one to load a single web page, and then to try multiple selectors. It’s a great way to build your crawler, and I recommend using it to get a sense of the composition of pages before running your spider.

Other Uses for Your Spider

Once you realize you can capture whatever on-page data you want, opportunities to make use of this data often seem to present themselves. One thing that I’ve found is particularly useful is capturing some existing data elements to use in the prototyping process. Typically, the pieces that are most useful are:

  • Product information, in ecommerce contexts
  • Profile information, when the product involves a people search
  • All metadata that might be used in faceted search or browse. Obviously, what’s available will be specific to a given site.
  • All body text, which can then be analyzed with natural language processing tools (e.g. NLTK).

Having this information, designers are in a better position to prototype interactions which make use of it. Because their prototypes will be more realistic, they can yield more valuable results when tested with users.

Caveats and Extensions

Scrapy doesn’t capture anything rendered with JavaScript out of the box. Instead, it just captures what’s in the html files. If the pages you need to scrape make extensive use of AJAX or other JavaScript, you can use Selenium and Phantom JS to prerender the page, and then substitute that for the response.

For complicated text processing, you may also find Beautiful Soup a little more versatile than Scrapy’s built-in selectors. I generally don’t integrate Beautiful Soup with my crawlers unless I need to, but it’s a great option to have.

  1. I’ve found the converse to be true, too: if you just have the data from a crawl, it probably won’t help all that much unless you did the legwork to customize the crawler. In such cases, you’ve got an answer, but you don’t know what the question is. This is why I tend to derive little value from technically capable off-the-rack solutions like Screaming Frog, or the CAT, both of which work really well, but leave me a little light on insights.