Problem: Discovery activities like content inventories and content audits can take up a lot of time without delivering actionable insights.

Insight: By turning content into data, it can be analyzed.

Content audits are traditionally one of the most tedious deliverables in the UX toolkit to create. Perhaps most troublingly, the insights that a given audit may deliver may not be clear until a significant amount of time and effort have already been expended: until one knows what content exists, it’s impossible to begin evaluating that content.

To expedite and improve content discovery activities, I developed Python-based web crawlers. Modifying my technology stack depending on the subject content allowed me to quickly and easily capture elements germane to the project at hand, and to isolate elements like third-party metadata, available file downloads, and site-specific CSS usage.

After collecting the relevant data, it’s a straightforward task to derive insights that can help shape conversations.

While I used these techniques whenever relevant, particular wins included:

  • Helping a client to discover that more than 60% of the content on their corporate website was locked away in PDFs, and not published in a native format.

  • Extending the above techniques to automate the bulk of a major content migration and replatforming, reducing the time required for data entry by over 80%.

Following the internal success of these techniques, I conducted a series of internal workshops to teach colleagues the procedures for crafting custom web crawlers. I later gave a talk building on these activities at the 2016 Intelligent Content Conference.