I agree that it's not the prettiest way to go about it but it works. I can't spend much more time on it at the moment. If anyone else out there has any ideas please feel free to share.
This didn't really work in my situation, again because your example is premised (as far as I can tell) on consistently different layout for your data. I just can't see how to make it work in my situation. I ended up making four different sub-extractor patterns for each possible heading following the "Subject" heading (~@SUBJECT@~, ~@SUBJECT2@~, ~@SUBJECT3@~, and ~@SUBJECT4@~). That way the results can be limited by forcing the recognized pattern to stop at any one of the four possible headings.
The problem here is that occasionally some of the data that's in ~@SUBJECT@~ as a session variable will also be saved in ~@SUBJECT3@~ as a session variable, just because of the structure of the pages.
So then when I write the data to a file I have the script write to the file depending on what data is available (first I check that the same data hasn't been saved in a session variable twice):
Yes, there is a way. It's a little complex but very useful (nay, necessary) for cases like this.
I created an example scraping session for a previous poster who had a similar need. May I point you to the posting and the example scraping session, have you give it a try and report back with any questions?
Forum posting:
http://www.screen-scraper.com/forum/phpBB2/viewtopic.php?p=3275
I think the problem is this: "When using sub-extractor patterns only the first match will be used. That is, even if a sub-extractor pattern could match multiple times, only the data corresponding to the first match will be extracted."
What happens is screen scraper grabs the first "Subject" row with the data
[list]Swindlers and swindling[/list:u]
but ignores the second one with the data
[list]Great Britain -- History
Great Britain -- Kings and rulers
Scotland -- History[/list:u]
because the columns labelled "Subject" are identical. I wanted to try just grabbing everything down to the next category, for example "Location" but the problem is that the next category might be "Location" or it might be "Provenance" depending on what information is in the particular record--so I can't tell it where the extractor pattern stops.
Sorry I let this one go so long without responding to it. We have a potential solution posted on our site under the "Using sub-extractor patterns" section here.
How might I deal with two identical patterns?
apm,
I agree that it's not the prettiest way to go about it but it works. I can't spend much more time on it at the moment. If anyone else out there has any ideas please feel free to share.
-Scott
How might I deal with two identical patterns?
This didn't really work in my situation, again because your example is premised (as far as I can tell) on consistently different layout for your data. I just can't see how to make it work in my situation. I ended up making four different sub-extractor patterns for each possible heading following the "Subject" heading (~@SUBJECT@~, ~@SUBJECT2@~, ~@SUBJECT3@~, and ~@SUBJECT4@~). That way the results can be limited by forcing the recognized pattern to stop at any one of the four possible headings.
The problem here is that occasionally some of the data that's in ~@SUBJECT@~ as a session variable will also be saved in ~@SUBJECT3@~ as a session variable, just because of the structure of the pages.
So then when I write the data to a file I have the script write to the file depending on what data is available (first I check that the same data hasn't been saved in a session variable twice):
{
out.write( "\nimprint_subject: " + session.getVariable( "IMPRINT_SUBJECT" ) );
}
if( ( session.getVariable( "IMPRINT_SUBJECT2" ) != null ) )
{
out.write( "\nimprint_subject: " + session.getVariable( "IMPRINT_SUBJECT2" ) );
}
if( ( session.getVariable( "IMPRINT_SUBJECT3" ) != null ) )
{
out.write( "\nimprint_subject: " + session.getVariable( "IMPRINT_SUBJECT3" ) );
}
if( ( session.getVariable( "IMPRINT_SUBJECT4" ) != null ) )
{
out.write( "\nimprint_subject: " + session.getVariable( "IMPRINT_SUBJECT4" ) );
}
This probably isn't the prettiest way to do it, but it seems to work.
How might I deal with two identical patterns?
apm,
Yes, there is a way. It's a little complex but very useful (nay, necessary) for cases like this.
I created an example scraping session for a previous poster who had a similar need. May I point you to the posting and the example scraping session, have you give it a try and report back with any questions?
Forum posting:
http://www.screen-scraper.com/forum/phpBB2/viewtopic.php?p=3275
Sample scraping session:
http://www.screen-scraper.com/support/examples/Manual-Extraction-Example_Scraping-Session.zip
Thanks,
Scott
How might I deal with two identical patterns?
Hi Scott,
I think the problem is this: "When using sub-extractor patterns only the first match will be used. That is, even if a sub-extractor pattern could match multiple times, only the data corresponding to the first match will be extracted."
What happens is screen scraper grabs the first "Subject" row with the data
[list]Swindlers and swindling[/list:u]
but ignores the second one with the data
[list]Great Britain -- History
Great Britain -- Kings and rulers
Scotland -- History[/list:u]
because the columns labelled "Subject" are identical. I wanted to try just grabbing everything down to the next category, for example "Location" but the problem is that the next category might be "Location" or it might be "Provenance" depending on what information is in the particular record--so I can't tell it where the extractor pattern stops.
This is my sub-extractor pattern:
Subject
How might I deal with two identical patterns?
apm,
Sorry I let this one go so long without responding to it. We have a potential solution posted on our site under the "Using sub-extractor patterns" section here.
http://www.screen-scraper.com/support/docs/using_extractor_patterns.php
Please let us know if this helps or not.
-Scott