forum structure traversing question
We need to scrape a forum that has a forum list where each forum (in the list of all forums) has the following structure: www.siteaddress/forum_identifier.html. However, each page of threads in the individual forum (after the first page) has the structure www.siteaddress/forum_identifier_site_identifer_index_number.html.
The problem is that the site identifier is not available until the initial page of the forum is scraped. I think I can get this to work simply by scraping a file with the first structure as the URL, scraping the "site_identifier", then calling a script that calls a second file with the second structure.
It seems that there should be a more efficient way to do this however. Is there a better way to accomplish this?
Most forums have a main page
Most forums have a main page with links to each forum, and each of those have links to threads. Normally I'd just scrape the links rather than use a file, and the threads will be pretty dynamic.