HTTP Error - Invalid query
I'm scraping a site where the results are delivered over multiple pages. When I step through the site using the proxy server and Firefox, everything is fine until I click on the Next Page link. When I do that, I get a blank screen. If I don't use a proxy server, the Next Page link works fine. I captured the URL using Live Headers, but when I try to use it to get results, I get the following message in the log: An HTTP error occurred while connecting to [URL]. The message was: Invalid query..
The site uses GET not POST, so everything is in the address line. It should be easy, but I can't figure out how to get past the first page of results.
Any ideas?
HTTP Error - Invalid query
Hi,
Aside from a lack of better error handling (which I've just added in), I believe this is actually functioning as designed. I'll explain a bit.
Given a URL with embedded session variables, such as
http//www.foo.com/default.asp?bar=~#BAR#~
screen-scraper will resolve those session variables, but will not attempt to URL-encode them. In the case above, if the session variable "BAR" held the value "this and that", screen-scraper would resolve the URL to
http//www.foo.com/default.asp?bar=this and that
This is an invalid URL since space characters are disallowed. It would need to be encoded instead like so
http//www.foo.com/default.asp?bar=this+and+that
Now, you might ask, why screen-scraper doesn't just encode the values automatically. In answer, consider the following URL
http//www.~#MY_DOMAIN#~.com/
Or even
~#URL#~
In both of these cases, if screen-scraper were to URL encode the values of the embedded session variables it would potentially result in an invalid URL.
Getting back to the original question, I think there are a few good options to resolve the problem
1. Remove the GET parameters from the URL, and put them instead under the "Parameters" tab (as proposed by fnirt). Here screen-scraper will safely URL-encode any of the values automatically.
2. Leave the URL as is, but encode any values in a script "Before the file is scraped" (as proposed by Scott). That way the session variables will already be encoded when the URL gets generated.
Hopefully this clarifies things. Just let us know if we can provide further detail.
Kind regards,
Todd
HTTP Error - Invalid query
I might be misunderstanding, but...
I've found that trying to get to the url:
http://www.foo.com/default.asp?bar=this+and+that
..
When I enter it in the URL box as
http://www.foo.com/default.asp?bar=~#BAR#~
(assuming BAR is a session variable containing "this and that")
screen-scraper tries to request it LITERALLY as
http://www.foo.com/default.asp?bar=this and that
But if the URL is
http://www.foo.com/default.asp
And I create a GET parameter of "bar" with the value of ~#BAR#~ it resolves it propery encoded.
I use this trick often. Quite often. Hoping I'm not exploiting a bug! :)
HTTP Error - Invalid query
jclerie,
The error you're experiencing is related to a previous issue with how screen-scraper handled certain characters in a URI (in this case it's probably the pipe character between "DPPO" & "Dental"). This issue is resolved in a later release of screen-scraper. I was able to scrape the URL in question using 3.0.66a but received the error message when trying it in the 3.0 basic edition.
Only the professional edition has the option of upgrading to alpha versions. If you are running the professional edition you can upgrade by:
1. Within the workbench, click on the wrench icon to open the settings window.
2. Uncheck the box next to where it says "Allow upgrading to unstable versions".
3. Close the settings window.
4. Click on the Options menu and select "Check for updates".
5. Follow the instructions presented to complete the upgrade.
Now, you should be able to scrape the URL without the previous error.
If you're not running the professional edition then you'll want to manually URL encode the URL in a script before using it in a scrapeable file. You have two ways you can do this.
1. If I'm right that the character that's causing the error is the pipe "|" then do a replace all on that one character like this.
url = url.replaceAll( "\\|", "%7C" );
2. If you want to be more thorough and take care of any bad characters you come across apply the Java method [url=http://java.sun.com/j2se/1.4.2/docs/api/java/net/URLEncoder.html]URLEncoder[/url] to the URL and it will handle all of the characters that would cause that error.
Please let us know how this goes.
Thanks,
Scott
HTTP Error - Invalid query
The URL is http://www.aetna.com/index.htm. I choose the Find a Doctor link on the right.
Here's what the log shows:
Next Page - County: Sending request.
Next Page - County: An HTTP error occurred while connecting to 'http://www.aetna.com/docfind/provSummarySearch.do?state=FL&langpref=en&sortOrder=ASC&button_flag=S&groups=100&geo1=county&county=Lee&search_cat=dall&provider_category=dental&secureStatus=Y&site_id=docfind&sortBy=name&product=DPPO|Dental%20PPO&lastProvRow=100'. The message was: Invalid query.
HTTP Error - Invalid query
jclerie,
I don't have any suggestions off hand. There are a few too many variables to consider. Could you provide the URL in question or perhaps more information from the proxy log?
What version of screen-scraper are you using?
Thanks,
Scott
So what happens if I want to go the opposite?
Just stumbled upon this thread.
There are brackets in my URL, and they need to stay in there.
But because of the version I have, the URL is encoded.
Any way to go the reverse effect... make SS actually leave the [] in the URL?
Hi, I did some investigating
Hi,
I did some investigating on this one, and it turns out that there are certain characters that are illegal in a URL. When you use these characters in your web browser, the browser must simply handle encoding them behind the scenes. Within screen-scraper we use an HTTP library called HttpClient, which tends to conform to the HTTP specification relatively strictly. We've found ourselves having to do things from time to time to make it a bit more forgiving. This is one such case. In screen-scraper, if you embed characters like square brackets (i.e., []) in a ULR, internally we actually replace those with their encoded equivalents. If we don't HttpClient generates an error and disallows the request. In fact, you can check section 2.2 through 2.4.3 of the HTTP spec for details on what characters are disallowed: http://www.faqs.org/rfcs/rfc2396.html.
Using the escaped characters *should* be equivalent to using the actual characters. Have you verified that the difficulty you've found in scraping that page isn't due to something else, such as a missing cookie?
Kind regards,
Todd