Scraping multiple pages by changing a url parameter
I would like to scrape the same details from each page of a site. The site uses dynamic url parameters to display pages, e.g. www.sample.com/page.aspx?id=1 etc. I've tried following the e-commerce tutorial but can't figure out where to put the script to iterate through the pages. My script basically creates an int, counts from 1 to 1000, and puts this into the url.
Any links to similar scrapes or any help in general would be much appreciated.
dhayz2000, You can call your
dhayz2000,
You can call your script from the scraping session level. Under the General tab of your scraping session (blue cog), under the scripts area you can add your script. If you have a loop happening inside your script you can simply call your script "Before scraping session begins".
For your scrapeable file, be sure to check the box under the Properties tab that says, "This scrapeable file will be invoked manually from a script". Then, in your script, from within your loop, be sure to call session.scrapeFile("My Scrapeable File").
Your script might look something like this:
{
session.setVariable("id", i);
session.scrapeFile("My Scrapeable File");
}
And, the URL of your scrapeableFile might look like this:
I noticed that your sample URL includes an "aspx" file extension. It's possible that you may not be able to link directly to each URL. If you have any trouble, have a look at Scraping ASP.Net sites.
-Scott
Hi Scott, Thanks for that
Hi Scott,
Thanks for that information, worked well.
I am trying to output each page's details to a text file. I'm using the following script (from a tutorial):
try
{
session.log( "Writing data to a file." );
// Open up the file to be appended to.
out = new FileWriter( "C:\\Users\\xx\\Desktop\\xx.txt", true );
// Write out the data to the file.
out.write( "Name: " + dataRecord.get( "NAME" ) + "\t" );
out.write( "Email: " + dataRecord.get( "EMAIL" ) );
out.write( "\n" );
// Close up the file.
out.close();
}
catch( Exception e )
{
session.log( "An error occurred while writing the data to a file: " + e.getMessage() );
}
This is added to the Extractor Patterns Script section and runs "After each pattern match".
The problem is that the final page result is being added to the file, not each page's results. Do you know how I can change it to display each page's results on the text file?
Thanks in advance.
Got it working - it was
Got it working - it was nothing to do with the above script, that works fine. The problem was that my extractor pattern was not set correctly. Thanks again for your help with the original problem.
You're welcome. Glad you got
You're welcome. Glad you got it working.