Next page and memory loss

I've just managed to get my first site scraped and all is well besides that SS almost chokes itself as the scrape is running. It's a Real Estate listings site with around 2500 listings that take about 3 hours to run. I have read the post regarding the cashing of Next pages on http://blog.screen-scraper.com/2008/07/07/large-data/ but the navigation set on this site is different and does not contain any "next batch link" nor the "max pages".

The navigation looks like: << 1 2 3 4 Page 5 6 7 8 9 >> where "<<" is previous and next page respectively and "Page 5" is the active page.

I have used the approach in tutorial 2 and am only cashing the Next page. I am using "Save to session variable" as sparsely as I can and have very few items in the running cache.

Any hints on how to release memory?

Best,
Johan

While the technique found in

While the technique found in the tutorial works, it is not well suited for hundreds of pages. I have written a sample script for iterating more effectively. The reasoning is explained in the description found on the page, so that you can follow what exactly is different. In short it has to do with recursion.

Let me know if you need any pointers about how to use the script:

http://community.screen-scraper.com/Next_Page_Memory_Conscious

Tried you example on a new

Tried you example on a new site that has a very long range of pages in a row - around 2500 - but I keep getting into memory loss problems. I think I have read almost every post that is written about this on the form but still find no way of plugging the memory leak.

Some other things known to

Some other things known to have causes memory issues have be websites that pass an ungodly amount of POST data back and forth... If that's the case, such POST values are usually unimportant to the site, in reality, and can be removed. This is particularly true of POST values which consist of senseless blocks of HTML!

Recursion is usually the enemy in memory battles... are there any other spots in the scrape's general flow which call scrapeableFiles from other scrapeableFiles? Sometimes it's hard to avoid, but it's best to avoid that if at all possible.

Also, do you notice that the memory fills up only after a certain area of the scrape starts to run?

Tim

Thanks for the answer

Thanks for the answer Tim.

There are not any large amounts of HTML sent back and forth in this scrape. Its pretty similar to other scrapes that are running fine without any memory loss problems. I've not noticed that any specific area fills the memory. It's ticking up steadily until 100% is reached and it chokes.

/Johan

I did receive your note via

I did receive your note via email-- I will take a quick look at it and try to figure out what might be the culprit factor.

Tim

That's really appreciated! On

That's really appreciated!

On the same subject I am using the "memory conscious next page" on another scrape I am working on but of some reason the Offset does not increase with the pages scraped? I am using the example script and have only changed the initialOffset and offsetstep.

What can be missing this time?

/Johan

Thanks for that Tim!

Thanks for that Tim!

just a left field idea...

just a left field idea... you're not running this session in the workbench with no limit on the log screen are you? Especially if you've got the log screen set to debug mode this will generate a HUGE amount of data that is kept in memory. Set the max lines to 100 or a couple of thousand if you want to be able to look back. This shouldn't be an issue in server mode though since it dumps the log straight to a file and doesn't need to hold it in memory...

I've managed to get up to 100% memory usage (500mb allocated) in about 1/2 - 1 hr when I forget to limit the log size.

Thanks for the hint Shadders

Thanks for the hint Shadders but its not that. Maybe it still is related to the next page thing. I'm trying to get that "memory conscious" page rolling but it won't add on pages for some reason.

/Johan

The solution

After spending countless hours hunting down the memory monster I tried turning on "Reject cookies" and voila! Everything turned normal and the scrape is running fine.