Webpage error breaks rest of scrape
This is kind of an odd error, so I'm going to describe it first, and if that doesn't work, I'll attach a sample scrape to demonstrate. First, go to http://www.familydollar.com/pages/store-locator.aspx. Then type in 56686 for the zip code and Search. You come up with the error. You can then click the back button or the link that says "Go back to site" to get back to the locator. At this point you could put in a different zip code and it works fine, like 56687.
My problem is not the error itself. It's that whenever my scrape hits that error, then all future similar requests have that error as well, even though they normally wouldn't. And if I stop the scrape and start it right away again, but on that next query, it works just fine, until it hits the error again, then the error shows up from then on.
I tried clearing the cookies and session variables when the error comes, and trying to scrape the original page again to get the referrer right, but to no avail. What is it that makes the difference when I stop and start the scrape again? It's the same order of queries...
Let me know if this doesn't make sense and I can send an example scrape.
I follow you comment, but
I follow you comment, but when I go to http://www.familydollar.com/pages/store-locator.aspx and put in zip 56687 I don't get an error; it shows a good result with 5 stores.
Chris, I was able to
Chris,
I was able to replicate the error you describe and I recorded it in the proxy. I'll write up a bug on it and have it looked at.
Thanks,
Scott
Chris, I'm finding that the
Chris,
I'm finding that the error is happening for 56686 every time (regardless of whether I use the proxy). I don't think this has anything to do with screen-scraper and is just an anomaly on the site.
Not sure what to suggest other than skipping zips that return that error.
-Scott
Hey Scott, The 56686 error is
Hey Scott,
The 56686 error is not the problem. We're fine skipping zip codes that return that error. The problem is that in screen-scraper, once that error is hit, all zip codes get that error, until the scrape is restarted. If you're just browsing, then you can just search another zip code and it works fine. Does that make sense? We don't care about the individual error, but rather what it does to the rest of our scrape.
Thanks,
Chris
Chris, Could you send me your
Chris,
Could you send me your scraping session?
Thanks,
Scott
Chris, When scraping .Net
Chris,
When scraping .Net sites it's important to note the sequence in which pages are requested. For example, at familydollar.com each time you submit a zip code to search for you do so from the store locator search page.
This is important with .Net sites because the search results page makes use of the values of VIEWSTATE, REQUESTDIGEST and other post parameter values from the previous page. Your scraping session was set up to extract the VIEWSTATE, REQUESTDIGEST and other post parameters from the search page once then reuse their values for every zip code you were searching for.
I modified your scrape to extract the values of VIEWSTATE, REQUESTDIGEST, and other post parameters as new values for each search submission.
For more info on scraping .Net sites please see Scraping ASP.NET Sites on our blog.
-Scott
Hey Scott, thanks for this.
Hey Scott, thanks for this. I was trying that, where if it hit the error, it would go back to the initial search page, but then I would also clear the cookies or do something that messed it up. I guess I just didn't have the right combination of fixes in place. Lesson learned :)