Server Timeouts
During lengthy extraction sessions (over several days), I find that some servers that I am pinging will stop responding to my inquiries for a brief time or refuse to give me location results, and then re-establish a normal connection after a certain set amount of time has transpired. The page I am scraping either turns completely blank, or returns HTML but without the location results. Meanwhile, my scraping session continues to run and iterate through the zip codes without noticing that something is wrong, which means I end up with gaps of missing data in my extracted file.
I've used the variable session pause to mimic human clicks as best I can, and I've tried clearing all cookies and pausing for a few minutes after no extractor pattern matches with the hopes that this will fix the problem. Sometimes I write out the zip code to a file if there is no pattern match so I can track where my iteration missed, but these methods are clumsy at best, very time consuming, and not very effective. It doesn't seem to be the cookie causing the problem most of the time and the session pause works about 50% of the time. Trying to re-scrape pages with the missed zip codes doesn't solve the problem either because the server will still do the same thing during the second attempt at gathering the data.
I need some other way of knowing if and when a server times out and refuses my connection. I want to know if the server has changed its response to something out of the norm, and I want to be able to differentiate that from an extractor pattern not matching simply because there isn't location data within that zip code. If I can know when the server has changed its response or is refusing my connection, I can save a lot of time. I have no idea why and how this server time-out and/or temporary blocking is taking place, I just know it happens, especially when I run extractions for several days with thousands of inquiries.
Are there any built in features of screen-scraper that can detect changes in the server response or detect whether a server is refusing connections or taking a time-out? What if there are no error codes in response from the server? Is there any sort of "refresh" I can perform that can reset the connection or fool the server into thinking I am a new fresh inquiry that doesn't involve proxy servers?
Thanks in advance for you advice and opinion.
This is the bane of scraping
This is the bane of scraping large datasets. The easiest thing to do is on the scraping session > advanced tab increase the max retries. This works great if the response will come up, but if it gets though the tries and gets no result you have to dig though the log to see it.
You can get more control with methods like:
You can create a script that runs on each file, and checks for each of those conditions. You can then set a pause and try again, or just record it failed and move on ... whatever you need for the site's particular issues.
Input/Output Error
Thank you for your prompt reply and reference to those API calls. Just curious, what does Input/Output Error look for when determining whether or not there was an error? I guess what I am asking is, how is it different from wasErrorOnRequest?
ErrorOnRequest means you got
ErrorOnRequest means you got a valid HTTP response that is not 20x or 30x. If you get a 403 or 500 it would trigger.
The other is just no response, so we don't know if the server got the request, or if it tried to respond ... just nothing.
Works great.
I tried the wasErrorOnRequest, and it worked great. It stopped my scrape when there was a server timeout, which saved me a huge headache because now I know when the error was occuring and I don't have to double back and check the validity of my results as carefully. Thank you!