Ignoring HTML in scrapping session results

Hi,

I am able to scrap webpages using screen scraper, but because there are variations between the pages I am scraping (from the same site) I have to make the sub extractor patterns wider than I want to ensure I don't miss scraping some data from a paragraph that may have lots of p code, for example.

The result is that when I get my data, I have to do a lot of clearing up (for example, going through and deleting with find and replace html code like h1 div br etc ...

Is there anyway to make Screen scraper ignore any html code that it picks up when running an extractor pattern and only produce the text you see on the webpage?

Thanks

If you have screen-scraper

If you have screen-scraper enterprise edition, you can to the extractor pattern > token properties > advanced tab, and check the box to strip HTML. For the other editions you need to do that in a script. I have a script I use a lot to write data to a CSV, and use this:

String fixString(String value)
{
        if (value != null)
        {
                value = value.replaceAll("<[^<>]*>", " ");
                value = value.replaceAll("\\s{2,}", " ");
                value = value.trim();
        }
        return (value==null ? "" : value);
}

See that I replace all HTML with a space, reduce all multiple spaces, and trim the extra.

Still not getting it to work on v6 basic screen-scraper

I am using this to write to csv, is it not working because of this, it runs after each pattern match.

FileWriter out = null;

try
{
session.log( "Writing data to a file." );

// Open up the file to be appended to.
out = new FileWriter( "data-from-files-in-local-folder.txt", true );

// Write out the data to the file.

out.write( "START" + "\t" );
out.write(scrapeableFile.getCurrentURL() + "\t" );
out.write( dataRecord.get( "NAME" ) + "\t" );
out.write( dataRecord.get( "LOCALITY" ) + "\t" );
out.write( "END" + "\n" );

// Close up the file.
out.close();
}
catch( Exception e )
{
session.log( "An error occurred while writing the data to a file: " + e.getMessage() );
}

Where to put this script on basic edition of screen scraper

Hi Jason,

I've just had a go at applying this script and I can't get it to work.

I have it at add script: "where to run" at "before pattern is applied".

Then I have my write data to csv script at "after each patten is applied"

If tried different combinations to and no luck.

Please help,

I have basic edition v6.0 and I enter the script as interpreted java

You run this script after

You run this script after each pattern match

dataRecord.put("URL", scrapeableFile.getCurrentURL());

// Fix format issues
String fixString(String value)
{
        if (value != null)
        {
                value = value.replaceAll("<[^<>]*>", " ");
                value = value.replaceAll("\\s{2,}", " ");
                value = value.trim();
        }
        return (value==null ? "" : value);
}

// Set name of file to write to
// outputFile = "output/" + session.getName() + "_" + sutil.getCurrentDate("yyyy-MM-dd") + ".csv";
outputFile = "output/" + session.getName() + ".csv";

// Set columns to write
// Will look for tokens of same name using usual naming convention
String[] names = {
        "Start",
        "URL",
        "Name",
        "Locality",
        "End"
};

try
{
        File file = new File(outputFile);
        fileExists = file.exists();
       
        // Open up the file to be appended to
        out = new FileWriter(outputFile, true);
        session.log("Writing data to a file");
        if (!fileExists)
        {
                // Write headers
                for (i=0; i<names.length; i++)
                {
                        out.write(fixString(names[i]));
                        if (i<names.length-1)
                                out.write("\t");
                }
                out.write("\n");
        }
               
        // Write columns
        for (i=0; i<names.length; i++)
        {
                var = names[i];
                var = var.toUpperCase();
                var = var.replaceAll("\\s", "_");
                out.write(fixString(dataRecord.get(var)));
                if (i<names.length-1)
                        out.write("\t");
        }
        out.write( "\n" );

        // Close up the file
        out.close();
}

catch( Exception e )
{
        session.log( "An error occurred while writing the data to a file: " + e.getMessage() );
}

addToNumRecordsScraped(1)

Thanks Jason,

Just had a chance to check this.

I ran it and it said "The "addToNumRecordsScraped" method is not available in this edition of screen-scraper."

I have the basic version of Screen Scraper 6.0. Is there anyway for this to work without the "addToNumRecordsScraped" method?

Thanks for helping!

It should still run despite

It should still run despite that warning, however, you can comment out that line. It's does nothing functional in the script.

Still not working, here's the notices I get

Thanks Jason,

I have taken

// Add to controller
session.addToNumRecordsScraped(1);

out of the script, and I get two notices.

1) Sorry, tidying HTML failed. Returning the original HTML.
2) The token "NAME" in sub-extractor pattern #1 has no regular expression.

I have ~@NAME@~, and when I test pattern it shows up as what is in-between this html in the webpage.

and it works when I just use the script.

Any ideas where I'm going wrong?

FileWriter out = null;

try
{
session.log( "Writing data to a file." );

// Open up the file to be appended to.
out = new FileWriter( "data-from-files-in-local-folder.txt", true );

// Write out the data to the file.

out.write( "START" + "\t" );
out.write(scrapeableFile.getCurrentURL() + "\t" );
out.write( dataRecord.get( "NAME" ) + "\t" );
out.write( dataRecord.get( "LOCALITY" ) + "\t" );
out.write( dataRecord.get( "INDUSTRY" ) + "\t" );
out.write( dataRecord.get( "HEADLINE" ) + "\t" );
out.write( dataRecord.get( "CURRENT" ) + "\t" );
out.write( dataRecord.get( "STARTDATE" ) + "\t" );
out.write( dataRecord.get( "PAST" ) + "\t" );
out.write( dataRecord.get( "SUMMARY" ) + "\t" );
out.write( dataRecord.get( "EXPERIENCE" ) + "\t" );
out.write( dataRecord.get( "EDSUM" ) + "\t" );
out.write( dataRecord.get( "EDUCATION" ) + "\t" );
out.write( "END" + "\n" );

// Close up the file.
out.close();
}
catch( Exception e )
{
session.log( "An error occurred while writing the data to a file: " + e.getMessage() );
}

Neither of those notices are

Neither of those notices are a big deal.

If you go the the scrapeableFile > advanced tab, there is box to select tidy. You could disable it--it's failing anyhow, so you may as well save the processor the effort of trying.

If you go the the extractor pattern, and the NAME token, then open the property window for it, there is no RegEx in there. 90% of tokens should have a RegEx, but sometimes not having one is good too.

Here's everything

Hi Jason,

I made the adjustments, and it's great to save on the processor effort, but it's still not working. It prints out everything with the HTML.

Here is everything I am doing:

IMPORT SCRIPT (interpreted java, run before scraping session begins)

File inputFile = new File( "urls.txt" );

// These two objects are needed to read the file.
FileReader in = new FileReader( inputFile );
BufferedReader buffRead = new BufferedReader( in );

// Read the file in line-by-line. Each line in the text file
// will contain a search term.
while( ( nextURL = buffRead.readLine() )!=null)
{
// Set a session variable corresponding to the URL.
session.setVariable( "URL", nextURL );

// Get search results for this particular URL.
session.scrapeFile( "scraping-files-in-local-folder" );
}

// Close up the objects to indicate we're done reading the file.
in.close();
buffRead.close();

THEN SCRIPTS THAT RUN ON THE EXTRACTOR PATTERN TAB.
FIRST YOUR SCRIPT TO RUN AFTER EACH PATTERN MATCH, interpreted java, )
SECOND THE WRITE DATA TO A FILE SCRIPT I POSTED

I HAVE ONE MAIN EXTRACTOR PATTERN

THEN A FEW SUB EXTRACTOR PATTERNS LIKE ~@NAME@~, Job description~@LOCALITY@~

I updated the script I posted

I updated the script I posted on 07/02/2014. Paste it and try again.

This is the output

Starting scraper.
Running scraping session: scraping-files-in-local-folder
Processing scripts before scraping session begins.
Processing script: "importing-urls-from-txt-file"
Scraping file: "scraping-files-in-local-folder"
Scraping local file: /Users/me/Downloads/11332939.html
scraping-files-in-local-folder: Processing scripts before all pattern applications.
scraping-files-in-local-folder: Extracting data for pattern "scraping pattern"
scraping-files-in-local-folder: The following data elements were found:
scraping pattern--DataRecord 0:

THEN ALL THE HTML FROM THE WEBPAGE (THERE'S A LOT)

scraping-files-in-local-folder: scraping pattern: Processing scripts after a pattern application.
Processing script: "New Script"
Writing data to a file
The token "NAME" in sub-extractor pattern #1 has no regular expression.
scraping-files-in-local-folder: scraping pattern: Processing scripts once if pattern matches.
scraping-files-in-local-folder: scraping pattern: Processing scripts after all pattern applications.
Processing scripts after scraping session has ended.
Processing scripts always to be run at the end.
Scraping session "scraping-files-in-local-folder" finished.

No working

Thanks Jason for all the help, but it's still giving me lots of html like br li span ...

You must be getting sick of this.

I'm out of ideas, other than emailing you the file to look at and the webpage I'm trying to scrape.

I think I gave you everything that I'm doing but I may have missed something out, I don't know ...

I've sent you all the scripts, when and where they run.

I hope this isn't a deadend, as if we can get it to work it would be an amazing time saver for what I do.

Thanks

I attached a scraping session

I attached a scraping session to this thread. If you download and import it, when you run you will see the CURRENT_IP in the log has HTML, but in the "output/Test instance.csv" file it's been stripped.

Hope it helps.

This is what I get when I run the session

Starting scraper.
Running scraping session: Test instance
Processing scripts before scraping session begins.
Processing script: "Write status to log"
=================== Log Variables with Message ===============
screen-scraper Instance Information
=================== Variables being monitored ================
Java Vendor : Apple Inc.
Java Version : 1.6.0_65
OS Architecture : x86_64
OS Name : Mac OS X
OS Version : 10.7.5
SS Connection Timeout : 180 seconds
SS Edition : Basic
SS Extractor Timeout : 30000 milliseconds
SS Max Concurrent Scraping Sessions : null
SS Maximum Memory : 256 MB
SS Memory Use : 32%
SS Run Mode : Workbench
SS Version : 6.0
======== Message logged at: 07/22/2014 12:10:17.42 CEST ========
Scraping file: "IP address"
IP address: Resolved URL: http://www.icanhazip.com
IP address: Sending request.
IP address: Processing scripts before all pattern applications.
IP address: Extracting data for pattern "Get IP address"
IP address: The pattern did not find any matches.
IP address: Get IP address: Processing scripts once if no matches.
IP address: Get IP address: Processing scripts after all pattern applications.
IP address: Warning! No matches were made by any of the extractor patterns associated with this scrapeable file.
Processing scripts after scraping session has ended.
Processing scripts always to be run at the end.
Scraping session "Test instance" finished.

The big problem here is IP

The big problem here is
IP address: The pattern did not find any matches.
Can you go to http://www.icanhazip.com in a browser?

Two questions but fixed now THANKS!

Hi Jason,

Okay, not sure what I did but I fixed it, and it now works on my localhost.

I gave up trying externally online.

I'm not going to try to work it out as it's been many long evening trying to get it to work, and it does now ...

So 1000 thanks for making my life lots quicker with this file.

Two questions though.

I have changed the write to file to include the this to write the localhost url:

out.write("\n");
}

// Write columns
for (i=0; i {
var = names[i];
var = var.toUpperCase();
var = var.replaceAll("\\s", "_");
out.write(fixString(dataRecord.get(var)));
if (i out.write("\t");

}
out.write(scrapeableFile.getCurrentURL() + "\t" );
out.write( "END");
out.write( "\n" );

// Close up the file
out.close();
}

The thing is it writes it twice for some reason? Not sure if I put it in the right position.

Second,

If there are urls in the list that I import to scrape that are wrong, in the way I used to do it, it would have written START END and I think it was NULL in the missing fields in the line so When I compared the list of import urls with the final list it would be the same length and match.

Now if there is a missing/wrong url it ignores it and goes to write the next line, which will mean sorting out after the scrape is done.

How do I get it to write something in each line?

Here is the old csv writing script that did this

I am using this to write to csv, is it not working because of this, it runs after each pattern match.

FileWriter out = null;

try
{
session.log( "Writing data to a file." );

// Open up the file to be appended to.
out = new FileWriter( "data-from-files-in-local-folder.txt", true );

// Write out the data to the file.

out.write( "START" + "\t" );
out.write(scrapeableFile.getCurrentURL() + "\t" );
out.write( dataRecord.get( "NAME" ) + "\t" );
out.write( dataRecord.get( "LOCALITY" ) + "\t" );
out.write( "END" + "\n" );

// Close up the file.
out.close();
}
catch( Exception e )
{
session.log( "An error occurred while writing the data to a file: " + e.getMessage() );
}

Tried all over weekend before writing this

Hi Jason,

Thanks for posting the file.

I deleted all the scripts folders ... in my screenscraper and did a fresh import of your file.

I ran it and it didn't create a .csv file, or anything, in the output folder.

I've tried everything then over the weekend, fiddling here and there, and all I can get it to do is to create a .csv in the output folder with the word Current IP. After creating a datarecord and sub pattern for a random bit of html.

I run the scraping session and afterward when I test the pattern I can see what I want to scrape, but it is not outputted on the name of the pattern.

I changed your file to have as main extractor pattern html xmlns ~@DATARECORD@~ /html and scrapped from avpgalaxy.net

and as sub: strong>em>Alien Isolation em~@Current@~
and other things so you can see the html in this reply.

Any ideas, I also searched my mac for any other csv in case it is being saved elsewhere but it is not, anyway I see the .csv file that I saved so it can't be that.

I really gave it a try and tried lots of different things, but when it does output its only the name of the pattern not the actual pattern