Multiple Page Scraping Logicâ€

Hello All,

I need help in getting started with screen-scraper. What I'm trying to do is log onto one page which contains a list of names and for each name, a list of days. The days are actually links to other pages which have the detail information I want to scrape.

I'm using http://www.drf.com/results/rindex.html as my sample page (This is not the actual site I need scraped but its format is very similar (a list of items (tracks) with a variable number of links (days) associated with each item).

From the tutorial and the examples I've seen here's what I think should be done.
a. Log on to the above page
b. Scrape each entire line for each track into a variable "Hollywood Park 07, 06, 05, '¦.."
c. Set up a sub-extractor pattern for the "Day"
d. Write a script to go through the variable and saving each day into an array
e. Write an additional script to take that array and link to each of the pages (should it be in the same script as "D" above?)
f. Scrape the detail page of data & create records
g. Return to step "B" to get the next entire line into a variable

Can anyone set me straight on the logic to get this accomplished?

I'll be going through the tutorials once again and looking at examples. I know I’ll have more questions like how to navigate to a new page, should I be saving the "day" or the "page link" for the day in my array, etc.

Are there any complete examples of the above functionality I need?

Thanks,

tazer98

Multiple Page Scraping Logic Array of pages???

Hi,

If I understand what you're after, I think you'd actually need to modify your approach a bit. I'm assuming you'll need to retain the association between the track and the dates. If that's not the case, this would be simpler, but here's what I'd recommend if you need to keep that association:

1. Log in to the site.
2. Extract each row (meaning each track along with all of its dates).
3. Create a script that will extract out any relevant data from the URL (e.g., 07). Note that this would require using scrapeableFile.extractData (here).
4. For each URL extraction, request the corresponding details page, using a URL like this: http://www.drf.com/results/~#DIT#~/~#FILE#~?rn=~#RN#~.
5. Extract and save any data from the details page.

If you don't need to retain the association between the race name and the date (e.g., because you could just get that information from the details page), the task becomes quite a bit simpler. You could just use the extractor pattern from step 3 to grab all of the URL parts, then scrape the details page for each one.

Kind regards,

Todd Wilson