Online Since 1995

About Frank

Contents tagged with Screen Scraper

  • No More Walls: The Screen Scrapers' Manifesto

    Tags: Screen Scraper, Web 2.0

    Parts.Common.Body.Summary.cshtml – The template for summary of a content item's body.

    If you think about it, Web 1.0 was all about walls.

    Everybody had a site and rarely did sites share data or content. That really wasn’t the intent. All you in terms of tooling was hyperlinks.

    Enabling technologies like RSS and AJAX gave rise to the Web 2.0 era and the dynamic changed.

    Sites were no longer isolated siloes of content; they could be connected, mashed up, provide additional value and even delight.

    But the dawn of Web 2.0 was the early 2000s. Here we are in the early 2010s and many sites still haven’t embraced the open notion of Web 2.0.

    We Need More Services

    Few sites have the architectural style below that empower openness and enable mashup-ability.

    Sure, the big name sites have this kind of architecture. FaceBook, Twitter, Flickr, Amazon etc all have callable APIs. That’s why you’ll find their content shared across the web and in native apps for various platforms.

    Just think about it: there are Twitter clients for every conceivable device: even the Commodore 64!

    Devices <3 APIs

    The bottom line is we live in a world of devices and devices love programmatic APIs.

    But for every API-accessible site, there are hundreds that aren’t.

    If your site doesn’t have an API, you run the risk of being left behind as the internet becomes an internet of things.

    Most sites with frequently updated content at least have RSS feeds, which goes a long way.

    But what happens when a site exposes the most recent 25 items and you want to access item #26?

    Think of all that content that’s locked away behind walls.

    Ten years ago, that sort of thing was acceptable. Today, not so much.

    So, what do we do?

    Tear Down the Walls!

    The best thing to do, if you're a site owner, is to tear down the walls you've inadvertently put up.

    If you’re a consumer to a site, you have two options: ask the site owner to join us in the 21st century or screen scrape the content.

    I faced this problem with the Texts From Last Night Windows Phone App. As far as I could tell, the site had no explicit API for getting at the site’s content.

    I had to roll up my sleeves, dig into the HTML, and write my own screen scraper.

    That got me thinking about better ways to scrape data, package it, make it available, and wrap it with a bow.

    I wrote a library and have since re-used it and refined it, so that I can pull down data from just about any target site quickly and easily. 

    Texts From Last Night took weeks to create, but DreamDictionary took about 2 hours from start to finish.

    This library will be available soon on CodePlex and was written in C#.

    A JavaScript version is nearly complete.

    New Risks. New Opportunities

    Content owners who don’t create an API miss out on a lot of opportunities: from building brand awareness to monetization of their API.

    Sure, there are risks of exposing your content into easily consumable chunk, but the upside is huge. But here’s the bad news: once your material is on the public web, you’ve already lost control over it. 

    If control is your top concern, you’ve already lost that fight.

    If you *really* want to protect your IP, protect it explicitly. Just don’t assume no RSS and no API equals security. You won’t keep everyone out, but you’ll keep most people out.

    Reassess the situation and reframe it to make it work for you in the internet of things.

    Think about it: Twitter, now has broad reach and mindshare, in large part to the ease of use of their Web API.

    The Message

    So, here’s my message to content owners and developers:

    If you don’t provide us developers with an API, we will make one for ourselves.

    To the developers, the web is your oyster: go forth and fetch data!


  • Update to Texts From Last Night WP7 App on the Way

    Tags: TFLN, Screen Scraper

    Parts.Common.Body.Summary.cshtml – The template for summary of a content item's body.

    Yesterday, my Texts From Last Night app for Windows Phone stopped working when the folks at Texts From Last Night changed their HTML.

    I quickly found the problem and submitted a fix.

    If they were trying to deter me, it failed.

    More than likely, it was a design change. They added a new way to share an item on Facebook.

    If there was an API, there would have been no problem at all.

    Maybe they should join the rest of us in 2011 and expose a public API.

    Maybe they need to read the Screen Scrapers’ Manifesto.

    The new version 1.2 of the app will include this fix as well as the ability to save texts to favorites.

    UPDATE: Version 1.2 has been approved and is in the marketplace!


  • Introducing the Screen Scraping Utility Kit

    Tags: Screen Scraper, MSDN

    Parts.Common.Body.Summary.cshtml – The template for summary of a content item's body.

    Yesterday, I posted the Screen Scraping Utility Toolkit to MSDN.

    Here’s why you’ll want to take a look at it:

    1. You deal with unstructured data.
    2. You have legacy data stored in HTML and want to normalize it or put it into objects
    3. You want to create an app for a web site and you (or they) don’t have an API

    In a post from late last year, I posted the Screen Scraper’s Manifesto, which was meant to be be a teaser for this very utility kit.

    This is how the utility kit works: it provides a base class that wraps up the WebClient class and an easier means to parse values out of text.

    There’s also a way to offset the risks inherent to screen scraping: changes in the target site.

    In this first post, let’s talk about text processing utilities.

    All this this magic happens in the ExtensionMethods class in the Utils directory with the FindElement and FindElements method.

    As you can see the code below isn’t too complex, but what it does is save you a lot of work.

            public static string FindElement(this string sourceString, string startDelimeter, List<string> endDelimeters)
            {
                int endPoint;
                return FindElement(sourceString, startDelimeter, endDelimeters, out endPoint);
            }
     
            public static string FindElement(this string sourceString, string startDelimeter, List<string> endDelimeters, out int endPoint)
            {
                var startPoint = sourceString.IndexOf(startDelimeter);
     
                if (startPoint == -1)
                {
                    endPoint = -1;
                    return null;
                }
     
                var adjustedstartPoint = startPoint + startDelimeter.Length;
                endPoint = FindEndElement(sourceString, endDelimeters, adjustedstartPoint);
     
                var length = (endPoint - adjustedstartPoint);
                var subString = sourceString.Substring(adjustedstartPoint, length);
     
                return subString;
            }
     

     

    If you’re a seasoned developer, you have no doubt have already wrote code to pull out values from strings. Think of all the code you’ve had to do for that.  Now look at this bit of code, which uses extension methods, to make the code compact.

    var eventsToday = this.RawText.FindElement("Events</span></h2>", new List<string>() { "<h2>" });

    This takes the HTML from a typical “events occurred on this day” pages and pulls out only the events portion of the HTML. To split out the individual events, we’ll use the other method: FindElements

            public static List<string> FindElements(this string sourceString, string startDelimeter, List<string> endDelimeters)
            {
                List<String> returnList = new List<string>();
                List<int> length = new List<int>();
     
                int totalLength = sourceString.Length;
                int currentPoint = 0;
                int endPoint;
     
                string workingString = sourceString;
     
                while (workingString.Length > 1)
                {
                    workingString = workingString.Substring(currentPoint);
     
                    int workingStringLength = workingString.Length;
     
                    if (workingStringLength == 1)
                    {
                        break;
                    }
     
                    length.Add(workingStringLength);
                    string element = workingString.FindElement(startDelimeter, endDelimeters, out endPoint);
     
                    if (element == null)
                    {
                        break;
                    }
     
                    currentPoint = endPoint;
                    returnList.Add(element);
                }
                return returnList;
            }

     

    This means you can extract each of the events with one line of code:

    var events = eventsToday.FindElements("<li>", new List<string>() { "</li>" });

     

    The line of code above breaks apart the items on the events list into a List of type string. From there, you can slice and dice the data more.

    But the ugly parts are done for you and you can focus more on the big picture.

    Now, if only there were an easier way to grab text from the web.

    Stay tuned for my next post. Winking smile


  • Extension Methods in JavaScript

    Tags: Windows 8, Screen Scraper, JavaScript

    Parts.Common.Body.Summary.cshtml – The template for summary of a content item's body.

    One of the nicest features of C# 3.0 and beyond is extension methods.

    As I port my Screen Scraping Utility Kit to JavaScript, which makes extensive use of the language feature to attach the FindElement and FindElements to string.

    Creating extension methods is fairly trivial as you can see below.

       1:          public static string FindElement(this string sourceString, string startDelimeter, List<string> endDelimeters) 
       2:          { 
       3:              int endPoint; 
       4:   
       5:              return FindElement(sourceString, startDelimeter, endDelimeters, out endPoint); 
       6:          } 

     

    For a while, I wondered how to do this in JavaScript. And about two weeks ago, I had an my “aha” moment.

    As it turns out, it’s pretty simple:

       1:  // Attach to the string object
       2:  String.prototype.findElement = findElement;
       3:   
       4:  function findElement(startMarker, endMarker) {
       5:   
       6:      // use this keyword to reference target object
       7:      //   Just like C#!
       8:      var startPoint = this.indexOf(startMarker);
       9:   
      10:      // code removed for brevity
      11:   
      12:      return result;
      13:   
      14:  }
      15:   

     

    Basically, you attach a method by adding it to the object’s prototype. For more on prototypes in JavaScript. read this tutorial on the subject.