Appending data set to existing Token Variable
This appears in log when running scraping session:
FullProductDetailsPage_01: Extracting data for pattern "MainGrab"
FullProductDetailsPage_01: Saving data set "MainGrab" into a session variable.
FullProductDetailsPage_01: Appending data set "MainGrab" to an existing data set.
FullProductDetailsPage_01: Number of data records for data set "MainGrab" is now 2 <---- This is the problem
FullProductDetailsPage_01: The following data elements were found:
When I run my sessions I use the special token ~@DATARECORD@~ named "MainGrab" to define the section holding the data I'm after and then use sub-extractor patterns to get the details. You'll note in the above that it says it is Appending the info to "MainGrab" and on the next line it says there are 2 data records for the data set "MainGrab". This number increases by 1 with each new scrape. The site I am scraping has over 270,000 products and I'm not sure how much of a performance hit this is costing me. Don't I really want to overwrite the existing data? It seems to be functioning, but as the scrape gets into the 10s of thousands of items is the above really adding a couple of hundred lines of HTML to the "MainGrab" record set and keeping some bloated montstosity of a Token in memory to access the data I'm after?
The scrape does appear to get slower with each item.
I also am missing the last 2 check boxes under the advanced tab, for filtering duplicates and caching data. I have a registered copy of Professional. (License fee paid).
Your log makes it look like
Your log makes it look like you've got the first checkbox on the advanced tab (on your MainGrab pattern) checked, which reads "Automatically save the data set generated by this extractor pattern in a session variable".
Normally, you don't want this, and so it is off by default.. Can you verify that you have it un/checked, one way or the other?
Those two checkboxes that are dimmed out should only light up when the above checkbox is marked.
Normally the way an extractor pattern works is fairly simple, and is closely related to 'scope' in programming. (Not sure what you're programming background is-- if you need clarification, just ask away.) Assuming that you're hitting that scrapeableFile for each page of search results, then the whole dataSet is wiped out at the end of each page request. The dataSet is only meant to be available until the pattern is completely done matching. Once it moves on tot he next pattern, it discards the dataSet. screen-scraper expects you to do something with each record of the dataSet on your own.
That being the case, screen-scraper shouldn't (by default) save all that data throughout the entire scrape.
So, as long as you're writing your data to a CSV, or saving it to a database, or sending it to some website, or whatever, then you certainly don't want to have that first Advanced-tab checkbox marked.
As for the scrape slowing down as it goes on, it might have something to do with how you iterate pages. If you are simply detecting the next/link to the "next page" of search results, and then calling a script on that pattern, which just runs the same scrapeableFile again, you might have a problem. Simple explanation is that if you're recursively calling the same scrapeableFile, then all efforts to conserve memory will be rendered useless, and you will likely run into problems on a site as big as the one you want to scrape.
I'm probably going to make a page on this site covering this topic, since it's an easy thing to fall into. The topic itself is covered here:
http://community.screen-scraper.com/node/1091
It's specifically dealing with page iteration that goes in offsets (0, 20, 40, etc), but can easily be adapted into something simpler.
I'll stop there to let you have a look at those things-- If there's any trouble or further questions, by all means, let me know!
Tim
Unchecking Box helped & Loop saved memory
Thanks Tim. Unchecking the box stopped it from adding it to the data set. Appears to be writing new with each "MainGrab", which is what I wanted.
As for the memory gobble, I went to the linked page and while it does address how to interate by 20... (oddly enough the exact number I need)... I'm not sure how this addresses memory conservation.
Anyway, one thing at a time. I'll address my need to increment a variable by 20 without the need to read the values from a text file. I hope you will soon post further tips on memory
Update: I used a simple Loop similar to what you linked and it did wonders for memory. The memory usage indicator at the bottom of the log page was running at over 40% when I was reading values from 3 text files running 3 simultaneos scrapes. Now it's 17% and processor usage has dropped from mid 70's to roughly 1/2 that.
Question: Does the memory usage indicator reflect a percetage of the memory I have allocated in the settings or of total system memory?
Just made that page on the
Just made that page on the memory indicator
http://community.screen-scraper.com/documentation/misc/memory_usage_indicator
That's last part about the
That's last part about the memory percent is a great question, and as a new feature, hasn't been properly documented yet.
It reflects only the used memory for your allocated memory for Java. By default screen-scraper reserves 256mb of RAM for the JVM. So while the 256 might always be used according to your operating system, screen-scraper is reporting to you how much of that is actually being used.
The simple loop shown in that other thread lets your scrape a page, and then release it from memory.
If you just keep calling a script "After file is scraped" to execute the same scrapeableFile, they'll never get released from memory until it's finished.
Maybe this semi-visual representation will help:
// the non-loop "recursive" approach:
search results for category "A"
|- next results
|- next results
|- next results
|- next results
search results for category "B"
|- next results
|- next results
|- next results
|- next results
|- next results
|- next results
// Now here's the for-loop "iterative" approach, via a single control script:
search results for category "A"
next results
next results
next results
next results
search results for category "B"
next results
next results
next results
next results
next results
next results
The point is that the iterative approach doesn't "stack". Instead, it's able to release memory as soon as it's done with a page. The non-iterative "recursive" approach can't actually let go of all those pages until the last one finishes.
Hope that helps!
Tim