• Uncategorized

Introducing the Screen Scraping Utility Kit

Yesterday, I posted the Screen Scraping Utility Toolkit to MSDN.

Here’s why you’ll want to take a look at it:

  1. You deal with unstructured data.
  2. You have legacy data stored in HTML and want to normalize it or put it into objects
  3. You want to create an app for a web site and you (or they) don’t have an API

In a post from late last year, I posted the Screen Scraper’s Manifesto, which was meant to be be a teaser for this very utility kit.

This is how the utility kit works: it provides a base class that wraps up the WebClient class and an easier means to parse values out of text.

There’s also a way to offset the risks inherent to screen scraping: changes in the target site.

In this first post, let’s talk about text processing utilities.

All this this magic happens in the ExtensionMethods class in the Utils directory with the FindElement and FindElements method.

As you can see the code below isn’t too complex, but what it does is save you a lot of work.

        public static string FindElement(this string sourceString, string startDelimeter, List<string> endDelimeters)
            int endPoint;
            return FindElement(sourceString, startDelimeter, endDelimeters, out endPoint);
        public static string FindElement(this string sourceString, string startDelimeter, List<string> endDelimeters, out int endPoint)
            var startPoint = sourceString.IndexOf(startDelimeter);
            if (startPoint == -1)
                endPoint = -1;
                return null;
            var adjustedstartPoint = startPoint + startDelimeter.Length;
            endPoint = FindEndElement(sourceString, endDelimeters, adjustedstartPoint);
            var length = (endPoint - adjustedstartPoint);
            var subString = sourceString.Substring(adjustedstartPoint, length);
            return subString;

If you’re a seasoned developer, you have no doubt have already wrote code to pull out values from strings. Think of all the code you’ve had to do for that. Now look at this bit of code, which uses extension methods, to make the code compact.

var eventsToday = this.RawText.FindElement(“Events</span></h2>”, new List<string>() { “<h2>” });

This takes the HTML from a typical “events occurred on this day“ pages and pulls out only the events portion of the HTML. To split out the individual events, we’ll use the other method: FindElements

        public static List<string> FindElements(this string sourceString, string startDelimeter, List<string> endDelimeters)
            List<String> returnList = new List<string>();
            List<int> length = new List<int>();
            int totalLength = sourceString.Length;
            int currentPoint = 0;
            int endPoint;
            string workingString = sourceString;
            while (workingString.Length > 1)
                workingString = workingString.Substring(currentPoint);
                int workingStringLength = workingString.Length;
                if (workingStringLength == 1)
                string element = workingString.FindElement(startDelimeter, endDelimeters, out endPoint);
                if (element == null)
                currentPoint = endPoint;
            return returnList;

This means you can extract each of the events with one line of code:

var events = eventsToday.FindElements("<li>", new List<string>() { "</li>" });

The line of code above breaks apart the items on the events list into a List of type string. From there, you can slice and dice the data more.

But the ugly parts are done for you and you can focus more on the big picture.

Now, if only there were an easier way to grab text from the web.

Stay tuned for my next post. Winking smile