Screen Scraper 7 hangs when scraping NGX website

We have been running a scrape of http://www.ngx.com/settlehistory.html for a few year and have had no problems, until we upgraded to 7. Now when we run the scrape it just hangs on the download (the exact message is "NGX - Download & Parse: Requesting URL: http://www.ngx.com/settlehistory.html"). When we look at the CPU and memory usage it becomes stagnant after about 10 mins, zero CPU usage for at least an hour. I have increased the memory to 1024 MB and that seems to have no effect (still hangs around 360MB).

Any suggestions?

jwhitnack on 04/04/2017 at 1:58 pm

screen-scraper support for licensed users

Thank you for your help so

Thank you for your help so far. We have upgraded to version 7.0.8a and we are now able to download the file, but it still hangs when it attempts to scrape it. It seems to reach about 385 MB of RAM and then gives up (i.e. 0 CPU for more then an 1 hour).

jwhitnack on 04/10/2017 at 3:21 pm

Do you have anything in the

Do you have anything in the log or the error.log? It sounds like you're running out of memory, so you might pepper some scripts with:

log.log("Memory: " + sutil.getMemoryUsage());

Especially if you have a loop about where it runs out of memory. If that is the case then you could increase the memory allocation or look for inefficiencies.

jason on 04/11/2017 at 11:26 am

The problem is that it never

The problem is that it never makes it to my scripts. It starts to do the extraction and never comes back. I have increased the memory to 1024MB and it seems to have had no effect. The log output is:

Starting scraper.
Running scraping session: NGX
Processing scripts before scraping session begins.
Processing script: "!Initialize Global Variables"
Executing script: "!Set Current Date".
Processing script: "@NGX - Init"
Scraping file: "NGX - Download & Parse"
NGX - Download & Parse: Requesting URL: http://www.ngx.com/settlehistory.html

and that is as far as it gets. I have left it running for over 2 hours with the memory sitting around 380MB and CPU at 0 for a majority of that time. The machine running this has 16GB of RAM and had 9.7GB free at the time this was running.

Let me know if you need anymore information.

jwhitnack on 04/11/2017 at 2:47 pm

I think I see the problem.

I think I see the problem. The page is 685.07 KB, and if I go to the page in Firefox and view source, it takes forever to load. If I do the same in Chrome it crashes.

So when I made a scrapeableFile for that page, I went to the advanced tab, and set to "do not tidy HTML" so it won't have to wait for that.

Next, I set the extractor timeout to 300, and the memory at 512 (256 wouldn't work). Also note the way I changed the logging level because writing the log takes time that I don't want to spend.

My log looks like this, and you can see my scrape attached.

Starting scraper.
Running scraping session: Test NGX
Processing scripts before scraping session begins.
Processing script: "Log init"
=========================================================
=================== Log Variables with Message ===============
screen-scraper Instance Information
=================== Static Values ================
Java Vendor: Oracle Corporation
Java Version: 1.8.0_66
OS Architecture: amd64
OS Name: Windows 7
OS Version: 6.1
Scrape HTTP Client: AsyncScrapingHttpClient2
SS Connection Timeout: 180 seconds
SS Edition: Enterprise
SS Extractor Timeout: 300000 milliseconds
SS Max Concurrent Scraping Sessions: 5
SS Maximum Memory: 512 MB
SS Memory Use: 27%
SS Run Mode: Workbench
SS Version: 7.0.7a
======== Message logged at: 04/14/2017 09:41:28.752 MDT ========
=========================================================
>>>Found 57065
Scraping session elapsed running time: 1 minute 47 seconds
Processing scripts always to be run at the end.
Scraping session "Test NGX" finished.

jason on 04/14/2017 at 9:45 am

Thanks for the suggestions,

Thanks for the suggestions, it looks like the "do not tidy HTML" and changing the extractor timeout did the trick.

jwhitnack on 04/19/2017 at 9:37 am

We tried the

We tried the session->Advanced tab options and we got the same results.

We also tried to update but we received a message that says 'An error occurred while checking for updates.' We are able to write to the folder that it resides in and we have an internet connection. We have also tried the python method but we get errors about the print statements needing brackets (we are using 3.6.0 which version should we be using?).

The links on your website to manually download also do not work. Is there another way to get the update?

Thanks.

jwhitnack on 04/05/2017 at 2:38 pm

Because we were changing

Because we were changing servers, the updates were down for a while yesterday. They should be working now.

jason on 04/06/2017 at 9:34 am

On the session > advanced tab

On the session > advanced tab there is now a selector for client. I would try other options in there.

It might also be good to update to the latest alpha as we're still modifying the HTTP clients.

jason on 04/05/2017 at 7:46 am

Search

Community

screen-scraper

User login