incremental URL scraping

I know this relates kind of to the first two tutorials, but i've been have a semi difficult time trying to set it up -- so i thought i would ask here and see if anyone had any input.

Basically I would like to scrape some info from a site based on incremental URLs. ie: the website http://www.ccc.com/item.php?kasi=00001. I want the scraper to get a few pieces of information from the site (Book name, author, price, etc) from each site from kasi=00001 to 00200.

But theres one more thing I would like to do. Each URL above imports a short excerpt from the book from a different URL into a flash file. The text is just stored in a different location with the same number. ie: http://www.ccc.com/item.php?kasi=00001 imports some text from http://www.ccc.com/item.php?txt=00001.

Ideally I would like to be able to tabulate the book name, author, price, and short excerpt all together if possible.

Any advice as to how I should go about doing that?

Thanks.

tr3online on 01/30/2008 at 9:49 pm

screen-scraper public support

incremental URL scraping

tr3online,

The iterating loop gets a bit more complicated with the inclusion of the pre-pending letter. We would be happy to offer you a free quote for doing the project for you. Below is a very rough sketch of what I had in mind if you'd like to give it a try.

The idea is to loop through an established array of letters in the sequence and length that you need, while in the loop for a given letter you would then loop through a defined sequence of numbers.

For each number you would dispatch to two different scrapeable files -- one for the general book data and the other for the excerpt text. Both scrapeable files would contain extractor patterns to match the text you're after and each would write an output with the unique alphanumeric string (e.g. "A00001") as a common element to enable proper sorting after the fact.

When the inner loop hits its threshold it will roll back up to the outer loop which will increment to the next letter and start it all again.

Roughly, something like this...

// Create an array containing the needed letters in the
//sequence you desire
String[] alphabet = { "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z" };

// Loop through each letter one-by-one
for (int a=0; a<=alphabet.length; a++)
{
session.log("***Current letter is " + alphabet[a]);

// For the current letter loop through the range of numbers.
for( i = 0; i <= 200; i++ )
{

// Prepend the "i" iterator with the necessary zeros depending
// on the number of digits in the value of i.

// lengthOfIterator is simply how many digits make up i
lengthOfIterator = i.length;

// totalLengthOfVariableString will be "6" since there are
// 6 characters in "A00001"

prepending_zeros = lengthOfIterator - totalLengthOfVariableString;

sn = a + prepending_zeros + i;

session.setVariable("sn");

// Each of these scrapeable files with utilize a
// variable GET parameter that will reference
// the value of the session variable "sn".
session.scrapeFile("generalTextURL");
session.scrapeFile("excerptTextURL");
}
}

I hope this helps.

-Scott

swilsonmc on 02/01/2008 at 3:05 pm

incremental URL scraping

[quote="swilsonmc"]tr3online,

I can give you a better explanation if you can tell me whether you are scraping the incremental URLs from the site and visiting them one-by-one "after each pattern application" OR if you're utilizing a loop in a script to iterate through the known numbers and scraping the URL that way.

Also, if they're feeding in the book excerpts from an additional URL does that URL contain only the text that you're after or does it also contain HTML and other yuck that you don't necessarily want?

-Scott[/quote]

Thanks for the replay Scott.

Ideally I would like to iterate through the known URLs with a loop (00001 to 00200). The URLs are actually prefixed with a letter, ie: A00001, so that'll change the code a bit.

The book excerpts come from an additional URL with the same number but a different precursor (kasi?sn=A00001 vs. surl?sn=A00001). It is pretty much straight text with the exception of the starting code to plug the text into the flash file. test1=36&test2=[excerpt] & test1=33&test2=[excerpt] are examples of this.

Appreciate any help you can give me.

Thanks again!

tr3online on 01/31/2008 at 8:09 pm

incremental URL scraping

tr3online,

Also, if they're feeding in the book excerpts from an additional URL does that URL contain only the text that you're after or does it also contain HTML and other yuck that you don't necessarily want?

-Scott

swilsonmc on 01/31/2008 at 3:27 pm

Search

Community

screen-scraper

User login

incremental URL scraping

incremental URL scraping

incremental URL scraping

incremental URL scraping