Odd behaviour in SS, hanging, delete_me file, etc...

Greetings, I've been using Screen Scraper for almost a year now (ver 3.0) and I've never had much of a problem with the SS application itself. Occasional weird behavior, which does sometimes happen, has always been fixable with a restart of SS. Anyway, my latest job was to scrape a fairly basic site found here: http://www.moribundcult.com/Merchant2/merchant.mvc?Screen=CTGY&Store_Cod... for all of its letters.

Upon gathering navigation with the proxy, I click to generate a scrapeable file. Nothing happens after a minute. I click a few more times and nothing happens. About 20 minutes later around 10 instances of the scrapeable file show up. Ok, so maybe I was just being impatient? So I go on as normal and work out my extractor patterns and scripts. Ok, that went well, I start the scrape and it resolves the URL, sends request, and hangs forever. Ending the scraping session doesn't seem to do anything and the last session on the log is still "Sending Request" with no mention of it being canceled. I try again, same deal. Restart SS: same deal. Also, after attempting to scrape this site, doing common tasks in SS such as marking any file as being manually invoked by a script will often not work. I restarted my computer, no changes. Re-installed Screen Scraper and re-entered all the data for the website: no changes. Weirdest thing is Screen Scraper has performed finely on the 2 new site scrapes I have done since attempting this one (as long as I close SS before working on the new site after tinkering with the problem site).

Since then I have noticed a file called "delete_me.htm" in my Screen Scraper root folder which contains the HTML from the site I was trying to scrape.

This seems to definitely be an oddity and I simply cannot fathom what the problem is here. Any ideas?

jeffreydean1 on 07/20/2008 at 4:34 pm

screen-scraper public support

Content size of page too big

Jeffreydean1,

I believe this is happening because the page's content is too large (631768 kb). We've implemented a fix for this when the content of the POST values are too large but we'll need to do the same for the page content as well.

A work around would simply to construct a scrapeable file manually. Because this page does not contain any POST data you would be ok to just copy and paste the entire URL in the scrapeable file's URL field. The GET parameters on the querystring will be interpreted there just as they would if entered individually under the parameters tab.

Let us know if that works and we'll work on a fix in the mean time.

Thanks,
Scott

swilsonmc on 07/22/2008 at 11:36 am

A few thoughts...

First, you mentioned that you are using version 3.0. I suggest upgrading to version 4.0 by following the instructions in our FAQ on the subject: http://community.screen-scraper.com/FAQ/Version4. You may even wish to upgrade to the latest alpha version. See http://community.screen-scraper.com/FAQ/NoUpdates for doing that. Having said that...I also had issues with creating scrapeable files after proxying the site. I am running the latest screen-scraper (4.0.18a), XP, and Firefox 3.0. After viewing the source of the page, I believe that the site is mostly done by hand, which often leads to malformed HTML, something that can cause screen-scraper to hiccup. Also, we had another developer here try the site out on his Linux machine, proxying the site using Opera as his browser. He too had the same result. We will continue to have a look at the site and see if I can find a solution for you.

ryans on 07/22/2008 at 11:01 am

Search

Community

screen-scraper

User login

Odd behaviour in SS, hanging, delete_me file, etc...

Content size of page too big

A few thoughts...