screen-scraper support for licensed users
Running Multiple Instances of Screen Scraper on one machine.
I've created a batch file that runs scrapes daily on my computer using the command prompt at scheduled times. I frequently have other scrapes running using the GUI version of screen-scraper because of scripts that require breakpoints and other troubleshooting problems.
No complete URL showing on product page
Hi - The hard of thinking Jason here.
On the page I am scraping there is no complete URL in the html to scrape and the filename of an image I wish to download is just the filename with no path.
Is there a way of determining the URL when it is not on the page and then saving this alongside the data in a csv file?
If this isn't possible, can I somehow save the URL from an earlier page (results page) that I am scraping.
The site I am trying to scrape isn't very well put together!
Thanks in advance
Jason
Showing error as "An input/output error occurred while connecting to https//... The message was peer not authenticated." in log
The session created is working locally, but showing error as follows.
Tried with checking "Use only SSL version 3" checkbox under the "Advanced" tab, still showing the same error.
Please suggest to rectify this issue.
Starting scraper.
Running scraping session: Derby_Scraping_Session
Processing scripts before scraping session begins.
Processing script: "Derby - Init Script"
Scraping file: "Derby - Select Search Page"
Derby - Select Search Page: Resolved URL: https://eplanning.derby.gov.uk/acolnet/planningpages02/acolnetcgi.gov
Derby - Select Search Page: Sending request.
Server Timeouts
During lengthy extraction sessions (over several days), I find that some servers that I am pinging will stop responding to my inquiries for a brief time or refuse to give me location results, and then re-establish a normal connection after a certain set amount of time has transpired. The page I am scraping either turns completely blank, or returns HTML but without the location results. Meanwhile, my scraping session continues to run and iterate through the zip codes without noticing that something is wrong, which means I end up with gaps of missing data in my extracted file.
Data Extraction Timeout
Is there a method or other way to set the Data Extraction Timeout for a specific Session or Scrapeable file?
Is it possible to use back reference RegEx in Mappings?
I have a pattern I would like to remove some characters from such as this: "BuyPrice":"274Â 900 $",
The "Â " in the pattern should be removed to leave only "274900"
I was thinking something like FROM: ([0-9]*)Â ([0-9]*) TO: \1\2 TYPE: RegEx might work in the mapping section for the token.
But that does not appear to work.
Is it possible to use this sort of RegEx back reference substitution in Tokens?
Suggestions for a better way to handle this?
I was thinking perhaps a script cleanup just prior to writing out the data if the Token RegEx can not handle the transformation.
REST API appears to be locking up when issuing a rest?action=run_scraping_session
Updated to 6.0.46a running in Server mode.
Tried issuing a REST call for rest?action=run_scraping_session
The wget command seems to hang and not return.
This is new behavior in 6.0.45a the call returns immediately.
The Session does appear to start, we just have to ctrl-c the command line to get the terminal session back for entering in the next command.
Is there something new in 46a we should know about or is this a bug?
Reading in from Pipe delimited text file as oposed to CSV
When trying to read in from a pipe (|) delimited text file the scraper can't understand the value delimiter and will only work with a comma separator. Any idea why this is? I have tried to change the file encoding also the character set on the scrape file. Here's my read in fail:
import java.io.*;
////////////////////////////////////////////
session.setVariable("INPUT_FILE", "D:/xxxxx.txt");
////////////////////////////////////////////
BufferedReader buffer = new BufferedReader(new FileReader(session.getVariable("INPUT_FILE")));
String line = "";
Saving tidied data using sutil.tidydatarecord
I have got the sutil tidier to tidy my datarecord, but I am unsure how to save the data in a csv file (as instructed below.) It still saves the original untidied record
I call this script after each pattern match for my DATARECORD:
DataRecord tidied = sutil.tidyDataRecord(dataRecord);
// Run code here to save the tidied record
Then I use the standard csv writer as below:
// Retrieve CsvWriter from session variable
writer = session.getv( "WRITER" );
// Write dataRecord to the file (headers already set)
writer.write(dataRecord);
SOAP API Client Issues
As far as I can see based on responses to questions around the SOAP API, it is recommended to use the REST API instead; however we have a need to use the SOAP API since the REST API does not provide some of the capabilities we require. So, I have been working on implementing a Java SOAP client and I have run into several major issues.
Our ideal implementation would be to use JAX-WS to generate the client side Java classes from the WSDL. This is not possible using the current SOAPInterface WSDL as RPC Encoding is not supported by JAX-WS.