Leaving files open
We are calling screen scraper from within a web application running on Tomcat, so it is long running process. We are scraping
files from sites and downloading them using the session.downloadFile.
Some of the scraping sessions are scraping many documents and we are finding that after the files are downlaoded they are still open and we are getting "Too many files open" exception after. My question is why are the files being left in an open state after download? Is this a bug in the session.downloadfile?
Allow me to refer to you to
Allow me to refer to you to this FAQ: http://community.screen-scraper.com/faq#80n1313
rhelsen49, I'll bet this
rhelsen49,
I'll bet this issue is related to the other issue you're having where your downloads are fully downloading but throwing an error at the same time. It may be that a file handle is being left open for each attempted download.
A possible workaround would be to call HTTP Client directly in your script (rather than using a screen-scraper method). Here is some sample code we use to download a file and check the content-type found in the header.
import org.apache.commons.httpclient.*;
import org.apache.commons.httpclient.methods.*;
import org.apache.commons.httpclient.params.HttpMethodParams;
import org.apache.commons.httpclient.contrib.ssl.EasySSLProtocolSocketFactory;
urlString = session.getVariable( "URL_TEMP" );
session.setVariable("URL_TEMP", null);
session.log("Checking content-type for: " + urlString);
// Create a method instance.
HeadMethod method = new HeadMethod( urlString );
// Provide custom retry handler is necessary
method.getParams().setParameter
(
HttpMethodParams.RETRY_HANDLER,
new DefaultHttpMethodRetryHandler( 3, false )
);
try
{
HttpClient client = new HttpClient();
session.setProxySettingsOnHttpClient( client, client.getHostConfiguration() );
try
{
HostConfiguration hostConfiguration = new HostConfiguration();
URL url = new URL( urlString );
if( url.toString().startsWith( "https" ) )
{
Protocol easyHTTPS = new Protocol( "https", new EasySSLProtocolSocketFactory(), 443 );
hostConfiguration.setHost( url.getHost(), 443, easyHTTPS );
}
else
{
hostConfiguration.setHost( url.getHost() );
}
}
catch( MalformedURLException mfue )
{
session.log( "MalformedURLException: " + mfue, mfue );
}
// Execute the method.
int statusCode = client.executeMethod( method );
if( statusCode!=HttpStatus.SC_OK )
{
throw new Exception( "Received status code: " + statusCode );
}
// Retrieve just the last modified header value.
String contentType = method.getResponseHeader( "Content-Type" ).getValue();
session.log( "contentType: " + contentType );
session.setVariable( "CONTENT_TYPE", contentType );
}
catch( Exception e )
{
throw e;
}
finally
{
// Release the connection.
method.releaseConnection();
}
Hopefully this will give you something to work with.
-Scott
Chunked followup
Hi Scott,
Thanks for the sample, but when I try to download the file by calling the getResponseBodyAsStream to buffer the whole response in memory, it seems the stream is empty. I think httpclient tries to consume the response in its entirety to reuse the connection but I think its finding an issue with the content (maybe malformed) and it can't reuse the connection. Any thoughts or additional workarounds.
Thanks,
Rick
Rick, Am I correct in
Rick,
Am I correct in assuming that this problem is related to the other issue you're having with the files being corrupt? Let us know if you need any further help beyond what Todd has sent you over email.
If you have found a solution it would be benevolent of you to post it here for the rest of the community (once the dust settles around you, of course).
-Scott