Varying numbers of images
Hi
I'm having problems scraping the images from a results page. The problem is that there is not always the same number of photos. For example one result could show:
another could be:
I've tried writing out an extractor pattern for each possibilty, but seeing as some pages could have 10 photos, this seems a very long winded and inefficient way of doing it. I'm sure I must be missing something.
Is there a way I can write an extracor pattern for which would look something like this:
And if there are 1, 2, 5 or 10 photos, it would capture them all, and either allow me to assign each one a different variable name, or store it as a long string, which I can then sort through afterwards and split?
I may be asking too much! I hope somebody can help!
Thanks
Ben
Varying numbers of images
Good news! Just let us know if we can help in the future.
Best wishes,
Todd
Varying numbers of images
That's great Todd thanks!
I used that with photoDataRecord.get( "photo" ); and it works perfectly.
Thanks for your help and patients!
Ben
Varying numbers of images
Hi Ben,
If you need to keep track of the index of the photo, you'd actually need to run the script after the pattern has been applied (i.e., not after each pattern application, but "After pattern is applied"). The script would look something like this, though:
for&( i = 0; i < dataSet.getNumDataRecords(); i++ )
{
photoDataRecord = dataSet.getDataRecord( i );
// ... do something with the record using the index "i" ...
}
What do you think? Will that work?
Best wishes,
Todd
Varying numbers of images
Thanks very much for your help Todd. I'm nearly there, my last question is can I work out the datarecord number each time I run the script.
For example I would like to store datarecord i as strPhotoi
or session.setVariable( "photoi" ) = datarecord(i)
Or would it be easier to store it as a long string and split it up afterwards?
Thanks!
Ben
Varying numbers of images
Hi Ben,
That's good news. That was definitely the easier of the two options :)
I would recommend creating a script that gets invoked "After each pattern application" for the extractor pattern that pulls the images. That is, that script will get invoked each time an image is extracted. You would then refer to a given image in your script like this:
dataRecord.get( "photo" );
You could then write the value to a file, or whatever else you need to do. The "photo" value corresponds to the name of your extractor pattern token (~@photo@~). The key to remember is that the script will get invoked for each image individually, rather than for all of them at once, after the data extraction has occurred. It's actually quite similar to the method we demonstrate in our third tutorial (here). You can view a sample script that would likely be somewhat similar to what you're doing here: here.
Please just let me know if I can help further.
Kind regards,
Todd
Varying numbers of images
Hi Todd
Thanks for this. The first method you suggested would be fine.
Using img scr="~@photo@~"
returns a dataset of the images that I need to store in a database. How do I store each of those data records as a seperate variable? I have tried doing this without a script, but the session variable only picks up the last instance of the image.
I guess I need to use a script?
From reading the documentation I thought this would work:
session.setVariable( "photo1" )=dataSet.getDataRecord( 1 )
session.setVariable( "photo2" )=dataSet.getDataRecord( 2 )
But I get the error "Object doesn't support this action".
However when I tried this, just for testing purposes, it worked:
session.setVariable( "photo1" )=dataSet.getNumDataRecords()
It worked.
I have a feeling I'm making this much more complicated than needs be.
Could you offer any help?!
Thanks
Ben
Varying numbers of images
Hi Ben,
There are a couple of ways I can think of to handle this:
1. If you don't need to be able to distinguish between the groups of photos, you could just use your extractor pattern () to grab them all.
2. If you need to know which photo came from which group, you'd need to perform the extraction in a script. With this, I'm assuming that you might have multiple groups of photos on the same page. That is, you could have this:
and this:
And you need to be able to distinguish between the first photo found in the first group, and the three photos found in the second group. In such a case you'd first need an extractor pattern that would match a photo group. In this case, that would probably look like this:
You would then use your other extractor pattern:
within a script to pull the data out. The key would be to use the session.extractData method (here).
Hopefully that's enough to get the ball rolling. Feel free to reply back if I can give any other suggestions.
Kind regards,
Todd Wilson