screen-scraper help

How can I extract information from inside this HTML

<div class="article-text">
....Information..
</div>

if there are tags like

. If is posible how can I ignore this tags in the extracted information??? I use screen-scraper basic. sorry for the bad english

botzoboy on 11/24/2008 at 7:05 am

screen-scraper public support

Capture the right Info

If I understand you correctly, you would like to capture some information, but you can't be sure that the information will be free of HTML tags. If you make the extractor pattern very specific

~@STUFF@~

you could capture the HTML tags in between. If the stuff you capture contains <> tags you can create a regular expression in a script which will remove everything inside those tags and replace it with a space. Professional and Enterprise have a very simple checkbox to do this, but it can be done with a script. Just save the token as a session variable and edit the string.

Hope that helps

scraper on 11/24/2008 at 6:40 pm

help again

The HTML code is this:

ANTIFUNGOL, solutie sub forma de spray

Substanta activa: clotrimazolum

Ultima actualizare: 02 : 01 : 2007

Indicatii:
Domeniul de utilizare al solutiei de Antifungol cuprinde infectii cu ciuperci.

Contraindicatii:
Sensibilitatea fata de clotrimazol..

Administrare:
Daca nu exista alta indicatie, se pulverizeaza in strat subtire de 2-3 ori pe zi cu Antifungol.

Efecte adverse:
Ocazional pot aparea iritatii ale pielii (de ex. inrosire pasagera, usturime sau intepaturi).

Compozitie:
1 ml solutie contine 10 mg clotrimazol; polietilenglicol; 2-propanolol.

I maked this extractor patern:

~@DATARECORD@~

Then I maked these sub-extractor paterns:

~@TITLE@~

~@SUBSTANTA@~

these ones are working.
And then I maked this sub-extractor patern:

~@STUFF@~

Who is not working . Any sugestion how can i write information from STUFF in an excel document??

botzoboy on 11/25/2008 at 4:55 am

still need help

first of all I can not capture information STUFF. I don,t know why :maybe because of these tags inside or maybe because inside of STUFF is allot of information or I make something wrong. And if is possible to capture this information what are the steps to save the token in a session variable and edit the string to remove those tags?? thanks in advance!!

botzoboy on 11/25/2008 at 3:09 am

You've definitely got the

You've definitely got the right idea. My guess is that it is failing to match the end of the main extractor pattern, the part which is like

That might be too general.

could you post a link to the site that you're scraping, so that we can see the actual HTML? That could help us figure out a good extractor pattern.

Again, it looks like you've almost got it, but something's a little off.

timv on 12/01/2008 at 2:15 pm

hi again

It's Ok now I can write out STAFF, the web page i want to scrape is this http://www.emedonline.ro/cuprins/medicamente/view.tag.php/1/ (all details from all products from all letters). But now I have another problem:
- the write out excel sheet contain tags inside STUFF. Can you tell me the detailed steps to remove them in SS Basic
- the write out excel sheet more columns than the number of sub-extractor paterns (in my case 3).
Thank's in advance for your advices

botzoboy on 12/03/2008 at 3:53 am

Before you write out your

Before you write out your data to excel, use this command in an Interpreted Java script:

void removeTags(String varName)
{
if (session.getVariable(varName) != null)
session.setVariable(varName, session.getVariable(varName).replaceAll("<[^>]*>", ""));
}
removeTags("VARIABLE_NAME_GOES_HERE");

You can have as many variables cleaned up as you need, by adding extra removeTags("VAR_NAME") lines to the end.

I wrote this assuming that you're dealing with session variables. If you are using a dataRecord instead, use this:

void removeTags(String varName)
{
if (dataRecord.get(varName) != null)
dataRecord.set(varName, dataRecord.get(varName).replaceAll("<[^>]*>", ""));
}
removeTags("VARIABLE_NAME_GOES_HERE");

As for having too many columns..... You probably have commas in your data, which is breaking it up into multiple columns. Commas are the magic divider. The way to fix that (if that's in fact your problem) is to write your data to a CSV with quotes around your values.

Let me know if this helps!

Tim

timv on 12/04/2008 at 2:47 pm

:(

Thank you for your support but it seems that I can not handle it. I puted the code you give to me in an Interpreted java script and run it before write out the file but the results is the same (I can not write out in a excel file the information from STUFF without html tags). Anyway thank you again!!

botzoboy on 12/08/2008 at 9:50 am

I would just make sure that

I would just make sure that the script is running just before you write to your file. The process is pretty simple...

if you can't figure that out, and if you're using sessionVariables, then you can add this line to the top of your write-to-file script:

session.executeScript( "the name of the script i told you to make" );

This will be sure to make the script happen just before you write your data to a file.

timv on 12/08/2008 at 2:06 pm

nop

I give up, still not working. in the Log is write that: The "executeScript" method is not available in this edition of screen-scraper.
I am nob in screen-scraper so probably I make some mistakes. I want to thank you for your suport and sorry for wasting your time.

botzoboy on 12/10/2008 at 5:57 am

I'm sorry-- I actually had

I'm sorry-- I actually had forgotten entirely that the executeScript method was not in the basic edition of SS. No need to apologize for anything! Please let me/us know if there is anything else you need help with.

Tim

timv on 12/11/2008 at 12:46 pm

Search

Community

screen-scraper

User login