• Uncategorized

No More Walls: The Screen Scrapers’ Manifesto

If you think about it, Web 1.0 was all about walls.

Everybody had a site and rarely did sites share data or content. That really wasn’t the intent. All you in terms of tooling was hyperlinks.

Enabling technologies like RSS and AJAX gave rise to the Web 2.0 era and the dynamic changed.

Sites were no longer isolated siloes of content; they could be connected, mashed up, provide additional value and even delight.

But the dawn of Web 2.0 was the early 2000s. Here we are in the early 2010s and many sites still haven’t embraced the open notion of Web 2.0.

We Need More Services

Few sites have the architectural style below that empower openness and enable mashup-ability.

Sure, the big name sites have this kind of architecture. FaceBook, Twitter, Flickr, Amazon etc all have callable APIs. That’s why you’ll find their content shared across the web and in native apps for various platforms.

Just think about it: there are Twitter clients for every conceivable device: even the Commodore 64!

Devices <3 APIs

The bottom line is we live in a world of devices and devices love programmatic APIs.

But for every API-accessible site, there are hundreds that aren’t.

If your site doesn’t have an API, you run the risk of being left behind as the internet becomes an internet of things.

Most sites with frequently updated content at least have RSS feeds, which goes a long way.

But what happens when a site exposes the most recent 25 items and you want to access item #26?

Think of all that content that’s locked away behind walls.

Ten years ago, that sort of thing was acceptable. Today, not so much.

So, what do we do?

Tear Down the Walls!

The best thing to do, if you’re a site owner, is to tear down the walls you’ve inadvertently put up.

If you’re a consumer to a site, you have two options: ask the site owner to join us in the 21st century or screen scrape the content.

I faced this problem with the Texts From Last Night Windows Phone App. As far as I could tell, the site had no explicit API for getting at the site’s content.

I had to roll up my sleeves, dig into the HTML, and write my own screen scraper.

That got me thinking about better ways to scrape data, package it, make it available, and wrap it with a bow.

I wrote a library and have since re-used it and refined it, so that I can pull down data from just about any target site quickly and easily.

Texts From Last Night took weeks to create, but DreamDictionary took about 2 hours from start to finish.

This library will be available soon on CodePlex and was written in C#.

A JavaScript version is nearly complete.

New Risks. New Opportunities

Content owners who don’t create an API miss out on a lot of opportunities: from building brand awareness to monetization of their API.

Sure, there are risks of exposing your content into easily consumable chunk, but the upside is huge. But here’s the bad news: once your material is on the public web, you’ve already lost control over it.

If control is your top concern, you’ve already lost that fight.

If you *really* want to protect your IP, protect it explicitly. Just don’t assume no RSS and no API equals security. You won’t keep everyone out, but you’ll keep most people out.

Reassess the situation and reframe it to make it work for you in the internet of things.

Think about it: Twitter, now has broad reach and mindshare, in large part to the ease of use of their Web API.

The Message

So, here’s my message to content owners and developers:

If you don’t provide us developers with an API, we will make one for ourselves.

To the developers, the web is your oyster: go forth and fetch data!


  • Declan said

    Thanks Frank,

    This is perfect and just what I was looking for.

    Can you please show the style sheet for your xml demo as I am having problems getting it to work.

  • jim said


    that site says you provide/list/show something called, "Screen Scaper Utility Kit"

    Have you moved it? I don’t see it here…. Thanks!