KeepAlive Process
Hi,
I am trying to scrape a website to obtain job information. I start the Proxy and record the session. I have created my scrapping session with scripts and scrapeable files. But when I try to Run Scraping Session it doesn't scrape the information. Instead I get the initial search page for the site. In the Last Response for the Search Results scrapeable file I get the following information:
IgnorePortalRegisteredURL: 1
X-Powered-By: ASP.NET
Transfer-Encoding: chunked
Content-Encoding: gzip
Connection: close
X-Powered-By: Servlet/2.4 JSP/2.0
PortalRegisteredURL: <a href="https://employment.wellsfargo.com/psc/PSEA/APPLICANT_NW/HRMS/c/HRS_HRS.HRS_APP_SCHJOB.GBL<br />Content-Type:" title="https://employment.wellsfargo.com/psc/PSEA/APPLICANT_NW/HRMS/c/HRS_HRS.HRS_APP_SCHJOB.GBL<br />Content-Type:">https://employment.wellsfargo.com/psc/PSEA/APPLICANT_NW/HRMS/c/HRS_HRS.H...</a> text/html; CHARSET=UTF-8<br />Set-Cookie: PS_TOKENEXPIRE=12_Dec_2008_16:12:40_GMT; path=/; secure
Date: Fri, 12 Dec 2008 16:12:47 GMT
Set-Cookie: BIGipServerPS_89_WebLogic_Pool=4229584906.25627.0000; expires=Fri 12-Dec-2008 16:43:36 GMT; path=/
UsesPortalRelativeURL: true
Server: Apache/2.2.9 (Unix) DAV/2 mod_ssl/2.8.31 OpenSSL/0.9.8h
Expires: Thu, 01 Dec 1994 16:00:00 GMT
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html dir='ltr' lang='en' xmlns="http://www.w3.org/1999/xhtml">
<!-- Copyright (c) 2000, 2007, Oracle. All rights reserved. -->
<head>
<meta name="generator" content="HTML Tidy, see <a href="http://www.w3.org"" title="www.w3.org"">www.w3.org"</a> />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<script type="text/javascript" language='JavaScript'>
var totalTimeoutMilliseconds = 2073600000;
var warningTimeoutMilliseconds = 2073600000;
var timeOutURL = 'https://employment.wellsfargo.com/psc/PSEA/APPLICANT_NW/HRMS/?cmd=expire';
var timeoutWarningPageURL = 'https://employment.wellsfargo.com/psc/PSEA/APPLICANT_NW/HRMS/s/WEBLIB_TIMEOUT.PT_TIMEOUTWARNING.FieldFormula.IScript_TIMEOUTWARNING';
</script>
Then with the Job Detail of course it won't pull up anything either as it states no matches were made of the Extractor Patterns. When I look at the Display Reponse in Browser it brings up a blank page. And the following is what is displayed in the Job Details Last Response:
Server: Apache/2.2.9 (Unix) DAV/2 mod_ssl/2.8.31 OpenSSL/0.9.8h
Cache-Control: no-cache
Expires: Thu, 01 Dec 1994 16:00:00 GMT
Set-Cookie: PS_TOKEN=; path=/; secure
X-Powered-By: ASP.NET
Set-Cookie: BIGipServerPS_89_WebLogic_Pool=4229584906.25627.0000; expires=Fri 12-Dec-2008 16:43:38 GMT; path=/
Set-Cookie: PS_LOGINLIST=-1; path=/; secure
X-Powered-By: Servlet/2.4 JSP/2.0
Date: Fri, 12 Dec 2008 16:12:49 GMT
Content-Length: 212
Set-Cookie: PS_TOKENEXPIRE=-1; path=/; secure
Set-Cookie: ExpirePage=; path=/; secure
Content-Type: text/html; CHARSET=utf-8
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content="HTML Tidy, see <a href="http://www.w3.org"" title="www.w3.org"">www.w3.org"</a> />
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title></title>
</head>
<body onload="self.close()">
</body>
</html>
Is there a KeepAlive option that I can use - is there a sample of it somewhere? The previous programmer seemed to have a problem with this issue on this site as well but didn't get a chance to resolve the issue - so it is up to me now. Any help you can provide would be greatly appreciated, a link, sample code, etc.
Thank you,
Time out errors
Dear needhelpplease,
While using screen-scraper I've run into a variety of time out errors. Most of the time what I will do is develop an extractor pattern based on the time out and then I will go through the process of logging back into the site. It looks like there is actually a "ok" button where you have the warning page. Perhaps if you proxied what gets submitted when you press that "ok" button you could see what information you need to pass to keep your session current. If you can simply keep your session current, then you might avoid the headache of signing back in and trying to pick up where you left off.
Otherwise I'd save a variable in session scope that would allow me to identify exactly where I was along my scraping path and then try to devise a way to get back there without wasting time going through what you've already been through. Perhaps page numbers or job listing numbers could be something you could match against to get back to the spot you were at.
Any other questions or thoughts?