Increasing sub-extractor patterns = decreasing reliability?
Hi there,
In one of my scrapes, I pass data via a DATARECORD to a set of sub-extractor patterns. Because the site I'm scraping a) may provide data in 2 languages (English/French) and b) scrambles the information I'm trying to extract among differing arrangements/configurations, I've set up an increasing number of sub-extractor patterns to handle each arrangement/configuration. (I'm at 14 patterns and counting...) What I've noticed is that as I add more sub-extractor patterns, SS becomes less stable.
More specifically, when I add a new pattern, SS usually does one of the following:
- freeze (requires forced quit & restart)
- refuse to display the Edit Token dialog on a right-click (I may see the context menu for a second)
- garble text in the sub-extractor pattern so that as I add/modify tokens within the pattern, previous text either disappears or is replaced by tokens where the should not be (this is rare)
I've found that the cut-off point for my scrapes seemed to be around 5-6 sub-extractor patterns. Once I passed this limit, the problems I described above happened more often. As a result, adding the last few new patterns to this particular scrape has been very difficult and time-consuming.
A couple of additional notes that might be helpful:
- I've seen the same behavior on my Mac G4 Powerbook and Windows desktop, so I don't think it's platform-related
- SS has 512MB dedicated to it on both systems, although the overall RAM differs (1GB on Windows, 2GB on Mac)
- the SS reliability problems seem to occur right after you click "Add sub-extractor pattern"; once the sub-extractor pattern is entered, it works perfectly
- I'm running SS 5.5, basic edition
On a related topic, is there another way of handling cases where there are a lot of different data permutations/combinations that either would reduce/don't involve sub-extractor patterns?
Thanks in advance for any help you can provide!
Regards,
Justin
I've made extractors with
I've made extractors with lots and lots of sub-extractors, so I'm surprised you have this issue. What version of screen-scraper you on? Is there anything in the error log?
As for the logistics, the situation you describe is the very reason sub-extractors are there, so it sounds like you're on the right track if we can just keep you from locking up.
Answer to your questions
Hi Jason,
Thanks for the reply. In response to your questions:
- the version I'm using is SS 5.5 Basic on both Mac & PC;
- when I look at the error logs, I see issues with authentication, but I'd expect to see these errors as our corp. firewall is pretty anal when it comes to letting SS out of our network...that said, our IT has supposedly fixed this problem.
Here's a question: when you create a new extractor or sub-extractor pattern, does SS try to immediately apply it to the HTML last response page in the background, or does it wait until you click Test Pattern?
Regards,
Justin
You need to click the test
You need to click the test button.