Beginner help.

Hi,
I am learning to use screen-scraper and encountered problem.

Some of my results in a table are links and some just plain text. For example:

1. John
2. Tom
3. Greg
4. Brian

How do I get rid of html elements?

steelaz on 02/04/2008 at 12:01 pm

screen-scraper public support

Beginner help.

steelaz,

If you're using either the professional or enterprise editions of screen-scraper the easiest way is to check the box under the advanced tab for the extractor pattern token in question that says, "Strip HTML".

Otherwise, you'll need to make three separate extractor tokens. One for the possible starting "href", one for the person's name and another for the possible closing "href" tag. For the first and the last tokens set your regular expression to be, "Ignore HTML tags" ((?:\s*<[^>]*>\s*)*) and your name token regular expression to, "Non-HTML tags" ([^<>]*).

You'll notice the ending "*" in each of these regular expressions. That simply means "match all or none". So, it will ignore the presence of the beginning and ending "href" tags whether they're there or not.

Here is an example that will work in all editions:

~@number@~. ~@start_href@~~@name@~~@end_href@~<br />

...and an example for professional or enterprise when you select "Strip HTML" under the advanced tab:

~@number@~. ~@name@~<br />

Set the "number" token regex to "Number" (\d*).

Hope this helps.

-Scott

swilsonmc on 02/04/2008 at 1:44 pm

Search

Community

screen-scraper

User login

Beginner help.

Beginner help.