The simplest newbie question

I've followed the tutorial for Hello World and have managed to apply a pattern to a page with many email addresses (that's what I want to extract). I know the pattern and it is only working on the first instance of an email address encountered. Soooooo.... here is the dumb question.... how do you get the script or program to repeat itself over and over again on the same page of data to extract more than one occurrence of the pattern?

I would think this is an obvious question after using the Hello World tutorial but I cannot find the answer.

Thanks!

The simplest newbie question

bulgin,

Without seeing the page and the extractor pattern in question my guess would be that you need to modify your extractor pattern so that it matches all email addresses.

It's most affective when you utilize a regular expression. Fortunately for all of us we have on staff a kind of self-made regex wizard who has come up with a pretty darn good email regex. He says it's not fool proof but it's been tested and should be a good start. There may be more complete one's out there on the Net but this one is nice and simple.

Add this to your extractor pattern token (double-click token to edit). And when you click, "Apply Pattern to Last Scraped Data" it should match all of the email addresses on the page.

[\w\.-]+@[\w\.-]+\.\w{2,}

If it does not match like you're expecting then you'll need to look closely at the text on the page that makes up the email addresses and the HTML surrounding the text. Make sure they're not employing some technique to thwart screen-scraping of email addresses (a common practice often poorly employed) like displaying the address using JavaScript. Often you'll still be able to scrape the address it just may require a little extra work.

Once you have it matching all the addresses you would write a script to do what you want with the addresses (i.e. write them to a file) and call that script "After each pattern application". This will cause the script to be called each time the pattern matches successfully.

Hope this helps,

Scott