response content type

Hello,

I have a scraper that is spidering links that are discovered on arbitrary web sites. I'm trying to check for obvious URL's that I don't want to spider off to... e.g. things that end in .doc, .pdf, etc. However, sometimes it is unavoidable, I still hit the random binary file and screen-scrape tries to scrape it.

Is there a way to tell screen-scraper to fail fast if the content type of the response is something other than "text/xxx"?

The only other option I can think of is to manually connect to the URL first in my script via HttpURLConnection and check the header myself so I'll know whether to try and scrape it or not... but that is a redundant call.

Thanks

Sounds like you just need to

Sounds like you just need to make a HTTP HEAD request against the page to retrieve the headers, parse the response to determine if you should make a HTTP GET and scrape its contents. ss has http://community.screen-scraper.com/API/makeHEADRequest to help you out with it.

I can't think of another way

I can't think of another way to do it either ... there's no way to analyze the headers until they are requested.

I would most likely do as you suggest, make an initial request, and set http://community.screen-scraper.com/API/setMaxResponseLength so you get a tiny response back to check the headers, and if it's okay, you will need to request again.

Hmmm... once I started

Hmmm... once I started looking at scrapableFile, I noticed [get/set]ForceNonBinary. This makes it sound like screen-scraper does have some consideration for whether the HTTP response is binary or not. What does it do differently if the response is binary? In my case it looks like it is still trying to run the extractors on it.

Back to the original idea, I was thinking of just coding up an HTTPURLRequest in a script and checking the header from there. However, your suggestion on scrapeableFile.setMaxResponseLength makes me think that I might be over-coding it, since that implies I could use scrapableFile to make the request? I wasn't sure though because I don't see methods on scrapableFile for interrogating the response... the getContentType method appears to deal with the outbound request, not the headers coming back from the remote server.

Thanks!