screen-scraper support for licensed users

Questions and answers regarding the use of screen-scraper. Only licensed Professional and Enterprise Edition users can post; anyone can read. Licensed users please contact support with your registered email address for access. This forum is monitored closely by screen-scraper staff. Posts are generally responded to in one business day.

Running Multiple Instances of Screen Scraper on one machine.

I've created a batch file that runs scrapes daily on my computer using the command prompt at scheduled times. I frequently have other scrapes running using the GUI version of screen-scraper because of scripts that require breakpoints and other troubleshooting problems.

No complete URL showing on product page

Hi - The hard of thinking Jason here.

On the page I am scraping there is no complete URL in the html to scrape and the filename of an image I wish to download is just the filename with no path.

Is there a way of determining the URL when it is not on the page and then saving this alongside the data in a csv file?

If this isn't possible, can I somehow save the URL from an earlier page (results page) that I am scraping.

The site I am trying to scrape isn't very well put together!

Thanks in advance

Jason

Showing error as "An input/output error occurred while connecting to https//... The message was peer not authenticated." in log

The session created is working locally, but showing error as follows.
Tried with checking "Use only SSL version 3" checkbox under the "Advanced" tab, still showing the same error.
Please suggest to rectify this issue.

Starting scraper.
Running scraping session: Derby_Scraping_Session
Processing scripts before scraping session begins.
Processing script: "Derby - Init Script"
Scraping file: "Derby - Select Search Page"
Derby - Select Search Page: Resolved URL: https://eplanning.derby.gov.uk/acolnet/planningpages02/acolnetcgi.gov
Derby - Select Search Page: Sending request.

Server Timeouts

During lengthy extraction sessions (over several days), I find that some servers that I am pinging will stop responding to my inquiries for a brief time or refuse to give me location results, and then re-establish a normal connection after a certain set amount of time has transpired. The page I am scraping either turns completely blank, or returns HTML but without the location results. Meanwhile, my scraping session continues to run and iterate through the zip codes without noticing that something is wrong, which means I end up with gaps of missing data in my extracted file.

Data Extraction Timeout

Is there a method or other way to set the Data Extraction Timeout for a specific Session or Scrapeable file?

Is it possible to use back reference RegEx in Mappings?

I have a pattern I would like to remove some characters from such as this: "BuyPrice":"274 900 $",
The "Â " in the pattern should be removed to leave only "274900"

I was thinking something like FROM: ([0-9]*) ([0-9]*) TO: \1\2 TYPE: RegEx might work in the mapping section for the token.
But that does not appear to work.

Is it possible to use this sort of RegEx back reference substitution in Tokens?

Suggestions for a better way to handle this?
I was thinking perhaps a script cleanup just prior to writing out the data if the Token RegEx can not handle the transformation.

REST API appears to be locking up when issuing a rest?action=run_scraping_session

Updated to 6.0.46a running in Server mode.
Tried issuing a REST call for rest?action=run_scraping_session
The wget command seems to hang and not return.
This is new behavior in 6.0.45a the call returns immediately.

The Session does appear to start, we just have to ctrl-c the command line to get the terminal session back for entering in the next command.

Is there something new in 46a we should know about or is this a bug?

Reading in from Pipe delimited text file as oposed to CSV

When trying to read in from a pipe (|) delimited text file the scraper can't understand the value delimiter and will only work with a comma separator. Any idea why this is? I have tried to change the file encoding also the character set on the scrape file. Here's my read in fail:

import java.io.*;

////////////////////////////////////////////
session.setVariable("INPUT_FILE", "D:/xxxxx.txt");
////////////////////////////////////////////

BufferedReader buffer = new BufferedReader(new FileReader(session.getVariable("INPUT_FILE")));
String line = "";

Saving tidied data using sutil.tidydatarecord

I have got the sutil tidier to tidy my datarecord, but I am unsure how to save the data in a csv file (as instructed below.) It still saves the original untidied record

I call this script after each pattern match for my DATARECORD:

DataRecord tidied = sutil.tidyDataRecord(dataRecord);

// Run code here to save the tidied record

Then I use the standard csv writer as below:

// Retrieve CsvWriter from session variable
writer = session.getv( "WRITER" );

// Write dataRecord to the file (headers already set)
writer.write(dataRecord);

SOAP API Client Issues

As far as I can see based on responses to questions around the SOAP API, it is recommended to use the REST API instead; however we have a need to use the SOAP API since the REST API does not provide some of the capabilities we require. So, I have been working on implementing a Java SOAP client and I have run into several major issues.

Our ideal implementation would be to use JAX-WS to generate the client side Java classes from the WSDL. This is not possible using the current SOAPInterface WSDL as RPC Encoding is not supported by JAX-WS.