Getting a logon sequence with ssl, portal redirects, etc. to be reliably repeatable
The captured authentication sequence from the proxy server for a project I am working on looks roughly as follows:
http://URL Transaction complete
http://URL/portal Transaction complete
Layering ssl over existing socket
http://URL/portal Transaction complete
Layering ssl over existing socket
https://www.wslx.URL/login.cgi?WslIP=address&back=OtherURL Transaction complete
https://www.wslx.URL/auth.cgi Transaction complete
http://URL/portal Transaction complete
http://URL/portal Transaction complete
http://URL/portal/sso/sso.asp Transaction complete
http://URL/portal/admin/dologin.asp Transaction complete
In general, must I copy all of these into the scraping session and parameterize the userid and password to reproduce a reliable logon sequence for the session? What is the significance of the "layering ssl over existing socket" and must it be copied to the scraping session. Any assistance in this area would be much appreciated. Once I reliably login, the balance of data extraction seems to work very well.
Thanks!
"Layering ssl over existing
"Layering ssl over existing socket" is simply what screen-scraper says in order to notify you that it's proxying over an https transaction. If it's left in your list of proxy'd pages, you can safely ignore it when recreating the login process.
screen-scraper will also handle standard redirects all by itself. This being the case, you likely won't need to make 10 seperate scrapableFiles. I'd start with the first one, and then run your session and see what page it ends up on. You'll likely get one or two pages into the process just from standard redirects that screen-scraper follows.
You'll likely find that the website tries to put you on a roller coaster. For example, after one or two redirects, you may have to examine the source of the page on which you stopped, and see if there is a redirect built into a javascript timer. For instance, logging into hiring.monster.com required me to do a postback onto one of the pages during the process, wait a couple of redirects, and then it would sit on a page for a hard-coded 5 seconds before javascript would then direct me into another set of 4 redirects.
In the event that you find these JS redirects, you can either hardcode the redirect URL into another scrapable file for the session to continue with, or you could be more variable in the matter and use an extractor pattern on the source to get the JS redirect URL.
So, the short answer is: No, in general you only have to have a fraction of the scrapeableFiles shown in a big login process like that. However, you'll never know how many for sure until you begin doing them one by one and watching the automatic redirects (that screen-scraper is doing for you) in the log as you execute the session periodically when you're building the login process.