I’m currently working on porting an app of mine from the phone to Windows 8 Metro.
Since I have a lot more real estate to play with on Windows 8 than on the phone and do so love a good technical challenge, I decide to add more features to the Win8 version.
Many of these features are user-specific and require logging into the site. Since there’s no publicly documented API, I’ve been doing all this via screen scraping.
This approach has been pretty successful on the phone with my handy little library and embodies the spirit of my Screen Scraper’s Manifesto: if you don’t provide us developers with an API, we will make our own.
The best way to get a feel for how a site works is to fire up Fiddler and sniff the HTTP traffic going back and forth.
You’ll see that Fiddler recorded my activity: there’s a post to a login page with an HTTP 302 response from the server. Immediately after, there’s a GET request for the “my account screen.”
In case, you’re not up to speed on HTTP response codes, 302 means “redirect.” That makes sense: once I log in, I get redirected to another page. Many sites take this approach.
It’s actually the 302 response from the server that sends down the authentication cookie. Remember this little nugget for later.
Originally, I tried using WinJS.xhr to send a POST request with my credentials and get the cookie.
Getting WinJS.xhr to make a post call is easy enough, but it never got the cookie and the response HTML was the HTML for the “my account” page.
Let’s take a closer look at the 302 response.
When browsers talk with servers, this is actually what’s being passed back and forth: HTTP Headers and HTTP content.
The “Set-Cookie” line is where you get the authentication cookie. That’s what I was after.
The first line “HTTP/1.1 302 Moved Temporarily” and the last line “Location” tells the browser that the content it’s looking for has moved and here’s where you should go next.
When browsing the web, that’s the behavior you want. For screen scraping, that’s not what you’ll always want.
It turns out that the WinJS.xhr automatically “follows” the redirect and comes back with the results of the “my account” page. And that response didn’t resend the cookie.
For reference, here’s what that response looks like:
In short, there was no way to get to the data I wanted with WinJS.xhr.
Looks like I needed to dig a little deeper.
We’re not in .NET Kansas Anymore
What I needed was to write something a little lower level that would speak raw, unfiltered HTTP; access to an object that would not behave like a browser. I remembered that there were some .NET 3.x objects that did just that. But WinRT is not exactly .NET.
I remembered that the property I needed was AllowAutoRedirect, which when set to false, will not behave like a browser and make the next request.
After some digging, I discovered that the HttpClientHandler object in WinRT has a similar property. Now, I was onto something.
HttpClientHandler needs to be used in conjunction with HttpClient.
It’s HttpClient that’s actually going to make the call out to the server.
Here’s a code snippet:
HttpClientHandler handler = new HttpClientHandler(); handler.UseDefaultCredentials = true; handler.AllowAutoRedirect = false; HttpClient client = new HttpClient(handler); HttpContent httpContent = new StringContent(CreateDataString(userName, password)); httpContent.Headers.ContentType = new MediaTypeHeaderValue("application/x-www-form-urlencoded"); HttpResponseMessage response = await client.PostAsync(LOGIN_URI, httpContent); var headerString = response.Headers.ToString();
The headerString contains all the HTTP headers condensed into a string that looks something like this:
"Transfer-Encoding: chunked\r\nConnection: keep-alive\r\nDate: Wed, 13 Jun 2012 09:44:55 GMT\r\nSet-Cookie: AUTH_COOKIE=a2334xxxaseas; expires=Wed, 27-Jun-2012 09:44:55 GMT;path=/;
From here, it’s pretty easy to extract the contents of the AUTH_COOKIE (not it’s real name).