How to interpret cookies and headers and how to use them?
I am scraping the site https://committing.efanniemae.com/eCommitting/eCommitting (Fannie Mae Oak Bank). I created a proxy session, a scraping session, and created a scrapeable file for each transaction. When I ran the scraping session, it failed right away. I think it failed because cookies and headers changed between the proxy and scraping sessions. I haven't been able to successfully handle this yet.
In the scraping session, the first scrapeable file is based on the first transaction. The Last Request, "Compare with proxy transaction" showed one cookie for the proxy transaction but none for the scrapeable file. In Headers, it showed some fields called "Connection" and "TE" that were not present in the scrapeable file. A couple of other fields were present but had different values. And there were a field named "cookie" and "cookie2" that were present in the proxy transaction but not the scrapeable file.
Cookies
Proxy session:
$Version
1JSESSIONID=ng11LncT8LGvtmK07qGJh76MpfQkr5X1fGPvl4LJhn2t4N2JCNT0!-1253707218!1525431447
Headers:
Cookie
JSESSIONID=ng11LncT8LGvtmK07qGJh76MpfQkr5X1fGPvl4LJhn2t4N2JCNT0!-1253707218!1525431447
Cookie2
$Version=1
Connection
Keep-Alive, TE
The "Raw Response" for the transaction is as follows:
HTTP/1.1 200 OK
Content-Type: text/html
Cache-Control: no-store
Date: Mon, 22 Mar 2010 17:19:23 GMT
Pragma: no-cache
Set-Cookie: JSESSIONID=zTVnLnmbHT2R1LdFJGFjTkV0LgnNPJgY0LhdRdSbNXFTYrvHyMxd!-1253707218!1525431447; path=/; HttpOnly; secure
Connection: Keep-Alive
Transfer-Encoding: chunked
Keep-Alive: timeout=15, max=150
Cache-Control: no-cache="set-cookie"
Server: Apache
X-Powered-By: Servlet/2.4 JSP/2.0
Some of the Key / Value pairs in the transaction request for Headers:
Cookie JSESSIONID=ng11LncT8LGvtmK07qGJh76MpfQkr5X1fGPvl4LJhn2t4N2JCNT0!-1253707218!1525431447
Cookie2 $Version=1
Host committing.efanniemae.com
Parts of the "Raw response"
HTTP/1.1 200 OK
Pragma: no-cache
Cache-Control: no-store
Date: Tue, 23 Mar 2010 13:32:10 GMT
Transfer-Encoding: chunked
X-Powered-By: Servlet/2.4 JSP/2.0
Set-Cookie: JSESSIONID=CgT9LyChdSyBSFmWmQW3pCXV2QSBpsCnMpVJQgBypKF6G5vKlclC!1525431447!-1253707218; path=/; HttpOnly; secure
Content-Type: text/html
Server: Apache
Cache-Control: no-cache="set-cookie"
----------------
I attempted to set the JSESSION and Version1 cookies by creating a script and adding it to the scraping file that is based on the first transaction. I ran it before and also after, but there was no effect. When I looked at the results of the scraping session, it did not create and set the cookies. The script:
session.setCookie("https://committing.efanniemae.com/eCommitting/eCommitting", "JSESSIONID", "ng11LncT8LGvtmK07qGJh76MpfQkr5X1fGPvl4LJhn2t4N2JCNT0!-1253707218!1525431447");
session.setCookie("https://committing.efanniemae.com/eCommitting/eCommitting", "$Version", "1");
I tried this script and it didnt do anything either:
currentURL = scrapeableFile.getCurrentURL();
session.setCookie(currentURL, "JSESSIONID", "ng11LncT8LGvtmK07qGJh76MpfQkr5X1fGPvl4LJhn2t4N2JCNT0!-1253707218!1525431447");
session.setCookie(currentURL, "$Version", "1");
Questions:
Where do you get the name of the URL to put in the setCookie first parameter?
Is this syntax correct?
Shouldn't you see the new cookies that the script created in the Last Response of the scraping session?
I can't seem to create cookies and set their values -- from what I posted, what do you think is wrong?
What exactly are "Headers"? What are the key / value pairs for Headers? Ca you set them?
What is the correct approach to finding cookies and Headers that need to be set and then setting them?
(Raw Response from proxy session transaction 1 is attached.)
Attachment | Size |
---|---|
RawResponse.txt | 13.34 KB |
Gary, You shouldn't need to
Gary,
You shouldn't need to set any of these cookies. In a few tests I ran screen-scraper was handling the JSESSIONID cookie on its own. You may need to include a scrapeable file before the scrapeable file that is missing the cookie. The scrapeable file starting off your scraping session won't have a cookie in the request unless you set it manually.
Should you need to set one manually be sure to include only the domain (not the URL) of the site setting the cookie. For example...
session.setCookie("committing.efanniemae.com", "JSESSIONID", session.getVariable("myCookieValue"));
Notice how I'm not indicating that it is an https site. That is handled automatically. Also, you'll almost always want to scrape the value of your cookie from the preceding page versus harding-coding it.
However, in your case, I don't think you should be manually setting your cookies.
Regarding HTTP headers more generally, typically the only ones you'll ever need to worry about are cookies, anything in the POST payload, referrer (rarely), and even more rarely the content type (in case screen-scraper thinks the response is binary when it's not). Otherwise you're safe to ignore the other headers.
Try starting your proxy transaction from the very starting point of the site. I'll typically open my browser to about:blank, set my proxy settings, turn on screen-scraper's proxy then enter the address I want to visit. To ensure you have a new session it's a good idea to clear your cookies and restart your browser.
-Scott
Login fails, Cookies set.
I took your advice and made a bit of progress but am not there yet. I created the proxy session as you described. Then I created a scrapeable file for every transaction. It still didn't work when I ran the scraping session. It fails quite early -- the login fails. Then it merely re-displays the login page for every scrapeable file after that. (I am able to login manually with the same login credentials.)
I noticed that the cookie values were different between the proxy session and the scraping session, so I set the cookies using the syntax you provided for setting them. I looked at "Compare with proxy transaction" and the scripts that I added had set the cookie values identically between the proxy and scraping sessions. But it didn't matter. The problem happened early -- it did not accept the login. The parameters for scrapeable file #8, which is based on transaction #8, show the userID and password keys with the correct values. The Last Response for Scrapaeabe file 8 displays the login page, and so does file 9, 10, etc.
First I used IE for the proxy session, then I used Opera. IE complained about the websites security certificate. I told it to keep going. Opera allowed me to tell it that this was a safe site. They both created some "Error" transactions that I deleted. Opera said "The server's certificate chain is incomplete, and the signers are not registered. Accept?" I accepted and saved that setting.
I created a script that set the cookies told the scaping session to run it before running. I also created another script and ran it before a scrapeable file. It matches new values for the $Version and JSESSION cookies that appeared in later proxy transactions.
Should I create a scrapeable file that is not based on any proxy transaction and set its sequence number to 1, then run a script tied to it? For what purpose? Is that better than running a script tied to the scraping session itself? Is it OK to create a scrapeable file for every transaction?
When I review the scraping session, some of the "Last Response", "Display Response in Browser" scrapeable files display this message on top: "To help protect your security, Internet Explorer has restricted the webpage from running scripts or ActiveX controls that could access your computer. Click here for options."
Your response seems to indicate that it is not a problem to have the values of a cookie like JSESSION differ between the proxy session and the scraping session.
Also, is there a way to remove a cookie? I noticed that sometimes the proxy transaction showed just one of the 2 cookies that appear throughout the proxy session transactions. My script makes both of them appear all of the time.
Is it possible that the web site being scraped is detecting the proxy and doing something to reject the login? I started with a clean browser, created the proxy session, then created a scrapeable file for every transaction. Based on your experience how would you proceed at this point?
There is a similar situation with another website I'm trying to scrape. It works fine for a couple of hours, then it fails. In that case I assume that the web program must be looking at something, perhaps a date/time in a cookie. I'm guessing that I'll have to find whatever is changing and set it to something that the web program accepts.
Gary, It's always best to
Gary,
It's always best to take things one step at a time. By this I mean, focus on getting from the first page to the next page and no farther until it works.
I set up a scrape using this as the url of the first scrapeable file. No get or postparameters. Just the url.
https://committing.efanniemae.com/eCommitting/eCommitting
The second item in my proxy transaction list that was not a css or javascript file had the same url but with the following post parameters.
actionResource=LoginSubmit
originalSessionId=FbdCLy7BqKQnwBVVFtG8GzDy2dJV2DLBQdhb97lyxppsLtqjnbTc%21-1920279331%211275366514%211265826657688
submittedPage=Login
userID=[ommitted]
password=[ommitted]
Because the internet is by design a "stateless" environment web developers use certain techniques to help understand the current "state" of each user. The two most common techniques are by using cookies and/or get/post data. Because each session must be unique the corresponding cookie and/or get/post data will be unique.
If you try to use session data from a previous visit the server will likely kick you back to the login page because it does not have that value stored as a currently active session.
Most often, screen-scraper will automatically handle the passing of session cookies. The one exception is when a cookie is being set in the browser using Javascript (this is very rare, but you would look for "document.cookie" in some Javascript). For get/post session data it is necessary to scrape the corresponding value(s) from the previous page and pass them as session variables to the next page.
After you run your scrape, view the last response of the first scrapeable file. Search for the word "session". Note the hidden form field "originalSessionId". Create an extractor pattern for the value of that hidden field, let's call the token, "originalSessionId" and set it as a session variable.
Now, move to your second scrapeable file and click on the parameters tab. Delete the long wacky value of the originalSessionId parameter and replace it with ~#originalSessionId#~. Using the tilde-pound like this will reference the value of a session variable by that same name.
Finally, run your scraping session. When finished, view the Last Response of the second scrapeable file. Click on the Display Response in Browser button. If all went well you should see something other than the login screen.
Just to reiterate an earlier point, click on the Last Request tab of your second scrapeable file. Note the JSESSIONID cookie automatically set, ready to be passed to the server. Also view the Last Response of the first scrapeable file and you'll see screen-scraper happily acknowledging the presence of the JSESSIONID cookie handed to it from the server.
For pages farther on down the trough you'll do something very similar. It's always a good idea to continually re-scrape the value of your session parameter even if (in theory) it should not change during your session. It may be that with the other site that keeps kicking you off you just need to be sure you're passing a session value for each request.
The warning you're seeing in IE about "running scripts or ActiveX controls" could just be and formerly woefully neglectful browser now being overly anxious about protecting you from innocuous behavior by the site. Try tuning down (or off) your phishing filter or better yet use Firefox and find yourself forgetting that Internet Explorer was ever conceived of.
We've developed screen-scraper to automatically handle many of the goings-on over HTTP so you won't have to. By default, try trusting that it will and only intervene if it is proving that is can't.
-Scott
Repeats login
Thanks for the detailed information. I would guess that you could do this quickly. Unfortunately I added the originalSession stuff and it still pops up the Login again. Probably something small is tripping me up but it looks like I'm not going to succeed here. By the way, how did you happen to choose originalSession? Which variables would be candidates for this treatment?
Gary, Below is a link to your
Gary,
Below is a link to your scraping session that I modified slightly to include the script that sets the cookie manually.
http://community.screen-scraper.com/files/support/xfer/ASDA.zip
That should work. Now, when you go to view the last response in your browser you're going to get a blank page. That doesn't mean that it's the wrong page, just that something is causing it not to display in your browser. Try disabling any onload calls and setting base href= in between the head tags.
Otherwise, just scan the document for what you want to extract and have at it.
The site is definitely trying to obfuscate any scraping attempts, so you may want to prepare yourself for future hurdles to leap.
-Scott
Correct link?
This doesn't seem right. Did you send the wrong link? That link is for ASDA, which seems to be a grocery store. (We're looking at the Fannie Mae Oak Bank site.)
Gary, I apologize. I
Gary,
I apologize. I confused your post with a different one I was helping with.
I'm guessing the reason it is not working for you has to do with the first scrapeable file you're using. For my test, I set the first page to be simply:
https://committing.efanniemae.com/eCommitting/eCommitting
With no get/post data being passed. Doing this gives the server a chance to set a cookie and redirect you (if necessary) as though it is the first time you've come to the site. When proxying be sure to clear your cookies and cache so it really is as though it's your first time coming to the site.
Have a look at the following. You'll see that I'm trying to keep the test as simple as possible: Start at the front door and attempt to log in. Be sure to add your login credentials under the parameters tab of the first scrapeable file.
http://community.screen-scraper.com/files/support/xfer/Frank-Gary.zip
Once the first cookie is set screen-scraper should continue to respond to and pass back any subsequent cookies, provided they are not being set in Javascript.
-Scott
Achieved scraping BLISS on login. Continuing on...
I can't tell you how much I appreciate your posts that added pieces to the puzzle. The scraping session finally logged in. YEAH, it logged in! Next I'll add scrapeable files to do the rest of it.
What exactly do you mean by respond to and pass back subsequent cookies? Do you mean that if a cookie is created in a trasaction, each following scrapeable file should have a script that sets the same value to the cookie?
Gary, That's good news.
Gary,
That's good news.
Regarding cookies, I actually meant the opposite. Once you get the ball rolling by setting a cookie manually, the server will respond back with that cookie, screen-scraper will automatically pick it up and the hand-off back and forth will happen automatically within screen-scraper.
-Scott