OutOfMemoryError

I get an OutOfMemoryError when using the method, scrapeableFile.saveFileOnRequest( filePath ).

My max memory allocation in settings is set at 1024mb.

My session runs fine w/o the saveFileOnRequest method, but if I try to save the file via the saveFileOnRequest method,(which are .jpg) the memory jumps to 100% within a couple of sec.

Here is the sequence
Category Page (after each pattern match) -> Product Page

Product Page (after each pattern match) -> Details Page
Product Page (once if pattern match) -> Next Page

Next Page (after each pattern match) -> Details Page
Next Page (once if pattern match) -> Next Page

Details Page(once if pattern match) -> Image Page

Image Page is where the saveFileOnRequest is being called before file is scrape.
this is where the memory error occurs.

Is it a huge file you're

Is it a huge file you're trying to save? That might explain it. I've had one where I was trying to download a zipped file that was several hundred megabytes (if not more), and it would need more memory than I had.

If that could be the issue, let me know if I can look at it so I can see if there's a workaround.

File Size is 1-5mb

File size is 1-5mb only.

Test-Scraper, scrapeableFile.

Test-Scraper,

scrapeableFile.saveFileOnRequest should only be called from a script which gets invoked by the scrapeable file of whose content you want to download. You should not call the method from a script that gets invoked after an extractor pattern match.

Also, are you sure you need to use scrapeableFile.saveFileOnRequest? We recommend using the session.downloadFile method unless the data to be downloaded requires POST data be passed in the request.

-Scott

Still confused

Basically I have a scrapeable file named image_page that has a script save_file_on_request. this script is being called before the file is scrape. The script contains the scrapeableFile.saveFileOnRequest method. The scrapeable file itself is pass a get parameter that is named id. this is the variable that changes with each image.

I tested it and it works with some id's only.

ex:
it works if I pass it id=1 and it downloads fine.
if I pass it id=2, the memory shoot up to 100%.
if I pass it id=3 it downloads fine etc.

I've tried to put the url straight into the browser and it was able to download fine. Even the image where it made the memory jump to 100%.

I then tried to use the session.downloadFile method, but it gave me this error "ERROR: Failed to retrieve the file: http://example. The error message was: Circular redirect to 'http://exmaple:80/index.cfm'.

The url looks like this:
http://example.com/index.cfm?param1="value1"&param2="id", param1 is constant while param2 changes.
When the url of the image is enter into the web browser, a popup box saying do I want to save the image pops up. I think this is why session.downloadFile doesn't work. Unless there is a way around it.

It looks like there should be some post data, but the proxy doesn't show it.
I took a look at the last request of the image_page and there is the usual stuff and a cookie as follows:
Cookie: CFID=%var1%; CFTOKEN=%var2%; LAST_LOGIN=%var3% where %var% contains data

some additional notes:
this site requires a login.
the images are <5mb in size.
I am able to scrape the site fine, it's just on the image part.

So after all of this I'm still confuse on why it works for some but doesn't for others. Any help is appreciated.

Test_Scraper, If you're using

Test-Scraper,

If you're using either Professional or Enterprise Edition I recommend you switch to session.downloadFile. This is because it is likely you don't need to pass anything in the POST payload in your request for the image. You can confirm this by proxying the site using Charles Proxy. It is a bit more sophisticated at detecting HTTPxmlrequest's (which may be the POST data you're suspicious of).

If there is no POST data in the request then the session.downloadFile implementation is a bit different than scrapeableFile.saveFileOnRequest.

Instead of using a scrapeable file you'll download your file in a script which is usually called "After each pattern match" of an extractor pattern.

In the script you'll just reconstruct the URL to the image file like so...

session.downloadFile("http://example.com/index.cfm?param1=value1&param2=" + session.getv("id"), "c:/" + session.getv("id") + ".jpg",3,true);

The last parameter works only in Enterprise Edition. It tells screen-scraper to download the file in it's own thread. This allows your scraping session to carry on while each image is downloaded.

On a possibly related note...in your example URL you have double quotes around the values in the querystring. Double quotes are unusual to find in a URL and are technically illegal (they should be URL encoded). Did you use them in your example by mistake? If not, it's not likely that they are causing the problem but stranger things have happened.

Let me know if you're using Basic Edition and we can walk through some scenarios to try and troubleshoot this.

-Scott

Enterprise Edition

Scott,

I have the Enterprise Edition.

I Just ran it through charles, and it does not include post data.

the quotes in the example are just for example. the actual url does not have quotes.

When I use session.downloadFile("http://example.com/index.cfm?param1=value1&param2=" + session.getv("id"), "c:/" + session.getv("id") + ".jpg"); it gives me this error:

"ERROR: Failed to retrieve the file: http://example/index.cfm?param1=value1&param2=2714. The error message was: Circular redirect to 'http://example:80/index.cfm'.

the url of the image, is somehow redirecting me back to the home page I think.

That is the reason why I tried the scrapeableFile.saveFileOnRequest method.
It works, but as I posted earlier, it works for some and makes the memory go up to 100% and gives me the heap memory error for others.

Next, I would recommend using

Next, I would recommend using the "Compare with Proxy Transaction" button under the Last Response tab. You'll need to do this for a run where the image is being requested as a scrapeable file.

If you had previously proxied your site with the "Don't record binary files" flag set then you'll want to proxy your site again with that unchecked.

Once you're comparing your transaction with its counterpart proxy transaction you'll be looking to see if your scraping session is passing cookies correctly and that the correct referrer is being used.

If the cookies are not correct then you can use session.setCookie to manually set them. If your referrer is not correct, simply request the correct referrer as a scrapeable file prior to calling session.downloadFile.

-Scott

Still not working

Scott,

I compared it with the proxy transaction and the referrer and cookies are correct.

The image page has this on the last response tab.

HTTP/1.1 200 OK
Content-Type: application/unknown
Transfer-Encoding: chunked
Date: Thu, 10 Feb 2011 00:25:42 GMT
Server: Apache/2.2.14 (Unix) mod_ssl/2.2.14 OpenSSL/0.9.8e-fips-rhel5 mod_auth_passthrough/2.1 mod_bwlimited/1.4 FrontPage/5.0.2.2635 JRun/4.0
Content-Disposition: attachment; filename=12345_rgb.jpg
Expires: {ts '2011-02-09 19:25:42'}

"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">






[Binary Data]

The url is as follows:
http://site.com/index.cfm?go=product_files.show_content&file_id=1111

It is still redirecting me back to the login page.

This is the error:
ERROR: Failed to retrieve the file: http://site.com/index.cfm?go=product_files.show_content&file_id=1111. The error message was: Circular redirect to 'http://site.com:80/index.cfm'.

Test-Scraper, Please send me

Test-Scraper,

Please send me an email (scottw [@] screen-scraper). You may be experiencing a bug in HTTPClient but we will need to reproduce it in order to know for sure.

Thanks,
Scott

Test_Scraper, I apologize for

Test_Scraper,

I apologize for the delay. This has been fixed and is available in the latest alpha release.

Follow These instructions to update your copy to the latest alpha release.

The solution includes a notice when one attempts to wrap the HTML of a page which has at least one continuous line of HTML too long to wrap. The notice says, "Unable to wrap text for this page, part of the HTML contains a string that is too long".

Thanks,
Scott