Access Keys:
Skip to content (Access Key - 0)

Web-scraping pages that are protected by Touchstone

MIT Touchstone is a sign-on service that allows members of the MIT community to log in to participating websites.

Sometimes programmers need a way to interact with pages in their programs, and often use command-line programs to get pages and inspect page data. This is often called web-scraping.

This page has tips and guidelines for how to get data from Touchstone-protected pages. Feel free to edit and add your own tips.

Overview

When a web browser requests a Touchstone protected page, the hosting server checks for cookies to decide if the browser is logged in, and will redirect the browser to a login page if it wants the browser to log in.

This means that in order to web-scrape a touchstone-protected page, your software will need to be able to handle cookies, and possible handle redirects.

Basic flow

When your program tries to load the desired page, you may be redirected through a number of login pages. Here is a flow you should be prepared for:

  1. Try to load the page you want.
  2. Handle a possible WAYF redirect
  3. Handle a possible IDP redirect
  4. Handle a possible SAML redirect
  5. Check to see if you got what you wanted

More details about each step

Try to load the page you want.

Request the page. If your browser has cookies showing that it is logged in, you should get the page. Or the server may use a 30x redirect header to redirect the browser to a login page.

Handle a possible WAYF redirect

If the server needs you to log in, it may redirect you to a "WAYF" ("Where are you from") server to ask you to choose your identity provider from a list.

The MIT WAYF page will have an address starting with http://wayf.mit.edu/ and usually shows "Please choose your account provider". Choices include "MIT Kerberos account (or MIT web certificate)", "Touchstone Collaboration account", etc.

Your code can detect the choices on the WAYF page and submit an appropriate POST request.

Note that some services don't allow you to choose a provider through WAYF, and will automatically direct you to the MIT IDP.

Handle a possible IDP redirect

After knowing which Identity Provider to use, you will be directed to a server with login options for that IDP. For example, if you POSTed a form requesting the MIT identity provider, you would receive a response directing you to https://idp.mit.edu/idp/Authn/MIT which gives you options to log in with a web certificate, log in with Kerberos username/password, or log in with Kerberos tickets.

Your code can detect the choices on the IDP page, and submit an appropriate POST request.

Handle a possible SAML redirect

The IDP server should prepare a final page that contains a webform with data about the logged in user, the data on the page should be POSTed back to your original server.

Your code should detect the SAML page and POST it.

Check to see if you got what you wanted

If everything worked you should be redirected back to the original web server, but you may not be on the page you started on. This is up to the web server. Some servers keep track of your original request and try to fulfill it for you. Other servers send you back to a generic welcome screen, and you need to repeat your original request.

Check to see if you got what you wanted, and repeat your original request (with cookies) if needed.

Other references

Add your own tips here.

On Athena, the consult locker has a tool named "tscurl.py" that does most of the above steps for you. Feel free to read its code, or submit improvements to the Athena consultants.

tscurl.py Usage
add consult
tscurl.py URL

Community

Documentation and information provided by the MIT Community


Last Modified:

November 02, 2013

Get Help

Request help
from the Help Desk
Report a security incident
to the Security Team
Labels:
touchstone touchstone Delete
c-touchstone c-touchstone Delete
saml saml Delete
redirect redirect Delete
wayf wayf Delete
idp idp Delete
web web Delete
scraping scraping Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
Feedback
This product/service is:
Easy to use
Average
Difficult to use

This article is:
Helpful
Inaccurate
Obsolete
Adaptavist Theme Builder (4.2.3) Powered by Atlassian Confluence 3.5.13, the Enterprise Wiki