If I understand you correctly, you would like to capture some information, but you can't be sure that the information will be free of HTML tags. If you make the extractor pattern very specific
~@STUFF@~
you could capture the HTML tags in between. If the stuff you capture contains <> tags you can create a regular expression in a script which will remove everything inside those tags and replace it with a space. Professional and Enterprise have a very simple checkbox to do this, but it can be done with a script. Just save the token as a session variable and edit the string.
first of all I can not capture information STUFF. I don,t know why :maybe because of these tags inside or maybe because inside of STUFF is allot of information or I make something wrong. And if is possible to capture this information what are the steps to save the token in a session variable and edit the string to remove those tags?? thanks in advance!!
It's Ok now I can write out STAFF, the web page i want to scrape is this http://www.emedonline.ro/cuprins/medicamente/view.tag.php/1/ (all details from all products from all letters). But now I have another problem:
- the write out excel sheet contain tags inside STUFF. Can you tell me the detailed steps to remove them in SS Basic
- the write out excel sheet more columns than the number of sub-extractor paterns (in my case 3).
Thank's in advance for your advices
Before you write out your data to excel, use this command in an Interpreted Java script:
void removeTags(String varName)
{
if (session.getVariable(varName) != null)
session.setVariable(varName, session.getVariable(varName).replaceAll("<[^>]*>", ""));
}
removeTags("VARIABLE_NAME_GOES_HERE");
You can have as many variables cleaned up as you need, by adding extra removeTags("VAR_NAME") lines to the end.
I wrote this assuming that you're dealing with session variables. If you are using a dataRecord instead, use this:
void removeTags(String varName)
{
if (dataRecord.get(varName) != null)
dataRecord.set(varName, dataRecord.get(varName).replaceAll("<[^>]*>", ""));
}
removeTags("VARIABLE_NAME_GOES_HERE");
As for having too many columns..... You probably have commas in your data, which is breaking it up into multiple columns. Commas are the magic divider. The way to fix that (if that's in fact your problem) is to write your data to a CSV with quotes around your values.
Thank you for your support but it seems that I can not handle it. I puted the code you give to me in an Interpreted java script and run it before write out the file but the results is the same (I can not write out in a excel file the information from STUFF without html tags). Anyway thank you again!!
I give up, still not working. in the Log is write that: The "executeScript" method is not available in this edition of screen-scraper.
I am nob in screen-scraper so probably I make some mistakes. I want to thank you for your suport and sorry for wasting your time.
I'm sorry-- I actually had forgotten entirely that the executeScript method was not in the basic edition of SS. No need to apologize for anything! Please let me/us know if there is anything else you need help with.
Capture the right Info
If I understand you correctly, you would like to capture some information, but you can't be sure that the information will be free of HTML tags. If you make the extractor pattern very specific~@STUFF@~
you could capture the HTML tags in between. If the stuff you capture contains <> tags you can create a regular expression in a script which will remove everything inside those tags and replace it with a space. Professional and Enterprise have a very simple checkbox to do this, but it can be done with a script. Just save the token as a session variable and edit the string.
Hope that helps
help again
The HTML code is this:
ANTIFUNGOL, solutie sub forma de spray
Indicatii:
Domeniul de utilizare al solutiei de Antifungol cuprinde infectii cu ciuperci.
Contraindicatii:
Sensibilitatea fata de clotrimazol..
Administrare:
Daca nu exista alta indicatie, se pulverizeaza in strat subtire de 2-3 ori pe zi cu Antifungol.
Efecte adverse:
Ocazional pot aparea iritatii ale pielii (de ex. inrosire pasagera, usturime sau intepaturi).
Compozitie:
1 ml solutie contine 10 mg clotrimazol; polietilenglicol; 2-propanolol.
I maked this extractor patern:
Then I maked these sub-extractor paterns:
~@TITLE@~
these ones are working.
And then I maked this sub-extractor patern:
Who is not working . Any sugestion how can i write information from STUFF in an excel document??
still need help
first of all I can not capture information STUFF. I don,t know why :maybe because of these tags inside or maybe because inside of STUFF is allot of information or I make something wrong. And if is possible to capture this information what are the steps to save the token in a session variable and edit the string to remove those tags?? thanks in advance!!
You've definitely got the
You've definitely got the right idea. My guess is that it is failing to match the end of the main extractor pattern, the part which is like
That might be too general.
could you post a link to the site that you're scraping, so that we can see the actual HTML? That could help us figure out a good extractor pattern.
Again, it looks like you've almost got it, but something's a little off.
hi again
It's Ok now I can write out STAFF, the web page i want to scrape is this http://www.emedonline.ro/cuprins/medicamente/view.tag.php/1/ (all details from all products from all letters). But now I have another problem:
- the write out excel sheet contain tags inside STUFF. Can you tell me the detailed steps to remove them in SS Basic
- the write out excel sheet more columns than the number of sub-extractor paterns (in my case 3).
Thank's in advance for your advices
Before you write out your
Before you write out your data to excel, use this command in an Interpreted Java script:
void removeTags(String varName)
{
if (session.getVariable(varName) != null)
session.setVariable(varName, session.getVariable(varName).replaceAll("<[^>]*>", ""));
}
removeTags("VARIABLE_NAME_GOES_HERE");
You can have as many variables cleaned up as you need, by adding extra removeTags("VAR_NAME") lines to the end.
I wrote this assuming that you're dealing with session variables. If you are using a dataRecord instead, use this:
void removeTags(String varName)
{
if (dataRecord.get(varName) != null)
dataRecord.set(varName, dataRecord.get(varName).replaceAll("<[^>]*>", ""));
}
removeTags("VARIABLE_NAME_GOES_HERE");
As for having too many columns..... You probably have commas in your data, which is breaking it up into multiple columns. Commas are the magic divider. The way to fix that (if that's in fact your problem) is to write your data to a CSV with quotes around your values.
Let me know if this helps!
Tim
:(
Thank you for your support but it seems that I can not handle it. I puted the code you give to me in an Interpreted java script and run it before write out the file but the results is the same (I can not write out in a excel file the information from STUFF without html tags). Anyway thank you again!!
I would just make sure that
I would just make sure that the script is running just before you write to your file. The process is pretty simple...
if you can't figure that out, and if you're using sessionVariables, then you can add this line to the top of your write-to-file script:
session.executeScript( "the name of the script i told you to make" );
This will be sure to make the script happen just before you write your data to a file.
nop
I give up, still not working. in the Log is write that: The "executeScript" method is not available in this edition of screen-scraper.
I am nob in screen-scraper so probably I make some mistakes. I want to thank you for your suport and sorry for wasting your time.
I'm sorry-- I actually had
I'm sorry-- I actually had forgotten entirely that the executeScript method was not in the basic edition of SS. No need to apologize for anything! Please let me/us know if there is anything else you need help with.
Tim