403 Error received when trying to download images

My goal is to download multiple gif images from a particular site.

I have an extractor pattern as follows:

a href="/members/product_images/filestore/~@GIF_IMAGE_END_URL@~.GIF" onclick="return true" class="reddownload"

The token is stored as a session variable.

I wrote the following script to handle the downloading (it runs after each pattern match - not sure if this is correct):

session.downloadFile( "http://www.kwikeesystems.com/members/product_images/filestore/" + session.getVariable("GIF_IMAGE_END_URL") + ".GIF", "C://Product_Images//" + session.getVariable("PRODUCT_UPC") + ".GIF", 3);

The PRODUCT_UPC variable is stored from a previous extraction.

When I run the scrape this is what I see in the log:

Product Details: Extracting data for pattern "Image Download"
Product Details: The following data elements were found:
Image Download--DataRecord 0:
GIF_IMAGE_END_URL=0/0/0/0/0/0/1/1/2/0/2/2/2/2/1/1/2/0/0/0/12000/00193/00193CF
Storing this value in a session variable.
Product Details: Processing scripts after a pattern application.
Processing script: "Download Product Images"
Attempting a file download with the following maximum number of attempts: 3
ERROR: Failed to retrieve the file: http://www.kwikeesystems.com/members/product_images/filestore/0/0/0/0/0/.... The server returned a status code of 403.
The file download failed. Making another attempt...

I am not sure why I am getting a 403 error as I also have a script that downloads a text file from the same page as the images and it is successful in doing this. Also, if you copy and paste http://www.kwikeesystems.com/members/product_images/filestore/0/0/0/0/0/... in your browser it takes you to the image, with no 403 error.

I am new to screen-scraping. Can anyone help me solve this issue?

Thanks!

403

The site is blocking you. See this wikipedia page for more info on HTTP status codes and what they mean.

Workaround?

I understand what 403 means. What I don't understand is why I get it for trying to download the image and not for the text file which is on the same page, although different folders.

Also, how come I can copy and paste it into the browser and access the image without being logged into the site?

There must be a reason for this and I am sure someone knows a workaround. Anyone?

In your download script,

In your download script, write out the URL to the image. Make sure that URL gets you to the image when you feed it to the browser. I bet that the URL is either incomplete or pointing you to a page that isn't the actual image.

Example

Thanks for the response. Would you be able to provide me with an example of exactly what you mean?

Relative Link

I see what you mean by the full absolute URL. In my case that would be the following:

"http://www.kwikeesystems.com/members/product_images/filestore/a/0/0/0/0/0/0/1/2/2/1/0/2/0/0/1/0/2/2/1/2/59290/57218/57218CL.GIF"

However, my script was written as follows:

session.downloadFile( "http://www.kwikeesystems.com"+session.getVariable("GIF_IMAGE_END_URL")+".GIF", "C://Product_Images//" + session.getVariable("PRODUCT_UPC") + ".GIF", 3);

Shouldn't that work as it points the browser to the full URL?

I need to be able to pull down each image from each product which the image path changes from product to product. That is why I have an extractor to pull that part of the path that continually changes and append it to the http part which does not change and add .gif at the end. I do this for the text file which is on the same details page and I am able to download that. However, the images are not but the code is the same.

I really appreciate your help with this as everything in my current scrape works. I just need to download the images and I am good to go.

Thanks!

I see a couple

I see a couple problems

  • You're pointing to a URL that would have 2 .gif at the end
  • You have double slashes--you need to double backslashes because they are a special character, and you need to escape it, or you can just use the slash.
  • session.downloadFile(
    "http://www.kwikeesystems.com" + session.getVariable("GIF_IMAGE_END_URL")+".GIF",
    "C://Product_Images//" + session.getVariable("PRODUCT_UPC") + ".GIF", 3);

    Should look like:
    session.downloadFile(
    "http://www.kwikeesystems.com" + session.getVariable("GIF_IMAGE_END_URL"),
    "/Product_Images/" + session.getVariable("PRODUCT_UPC") + ".GIF", 3);

Still not working

Jason,

I tried the above and it did not work.

Here is what my extractor pattern looks like:

[span class="filedescription"]GIF[/span][a href="~@GIF_IMAGE_END_URL@~" onclick="return true" class="reddownload"][br /]
(Download now)[/a]

Here is the HTML code as copied from Last Response:

[tr]
[td width="19" bgcolor="#c5c8c9" valign="top"][input type="checkbox" name="item#100" value="/members/product_images/filestore/a/0/0/0/0/0/0/2/0/0/0/0/1/2/1/2/0/2/0/0/2/59290/34237/34237CL.GIF|KELLOGG COMPANY|CARR'S|CRACKERS|CHEESE MELTS 5 OZ BOX|CL|5929034237|0|/ccs/a/0/0/0/0/0/0/2/0/0/0/0/1/2/1/2/0/2/0/0/2/59290/34237/34237CL.GIF" /] [/td]
[td bgcolor="#c5c8c9" align="top"][span class="filedescription">GIF (Download now)[/a][/td]
[/tr]

Also, here is the code I use to successfully download the txt file:

session.downloadFile( "http://www.kwikeesystems.com/data_files/" + session.getVariable("PRODUCT_FILE"), "C://Product_Text_Files//" + session.getVariable("PRODUCT_UPC") + ".TXT", 3);

The double slashes worked here.

Let me know if you need more info. Again I appreciate all your help. Can't wait to solve this.

In your last response, see

In your last response, see how value starts with a slash? That refers back to the root, so your image isn't in:

http://www.kwikeesystems.com/data_files/members/product_images/filestore/a/0/0/0/0/0/0/2/0/0/0/0/1/2/1/2/0/2/0/0/2/59290/34237/34237CL.GIF|KELLOGG COMPANY|CARR'S|CRACKERS|CHEESE MELTS 5 OZ BOX|CL|5929034237|0|/ccs/a/0/0/0/0/0/0/2/0/0/0/0/1/2/1/2/0/2/0/0/2/59290/34237/34237CL.GIF

It should be in:
http://www.kwikeesystems.com/members/product_images/filestore/a/0/0/0/0/0/0/2/0/0/0/0/1/2/1/2/0/2/0/0/2/59290/34237/34237CL.GIF|KELLOGG%20COMPANY|CARR'S|CRACKERS|CHEESE%20MELTS%205%20OZ%20BOX|CL|5929034237|0|/ccs/a/0/0/0/0/0/0/2/0/0/0/0/1/2/1/2/0/2/0/0/2/59290/34237/34237CL.GIF

If you are using enterprise edition, you could go to your extractor pattern, open the properties on the "PRODUCT_FILE" and on the advanced tab check "resolve relative URL to absolute URL" ... then your code would just be:
session.downloadFile(session.getVariable("PRODUCT_FILE") + session.getVariable("PRODUCT_FILE"),
"C:/Product_Text_Files/" + session.getVariable("PRODUCT_UPC") + ".TXT",
3);

Also, I would urge you to not use the double forward slashes as that may be unpredictable.

Tried and the got this error

An error occurred while processing the script: Download Product Images
The error message was: IllegalArgumentException (line 4): Invalid uri 'http://www.kwikeesystems.com/members/product_images/filestore/a/0/0/0/0/0/0/2/0/0/0/0/1/2/1/2/0/2/0/0/2/59290/34237/34237CL.GIF|KELLOGG COMPANY|CARR'S|CRACKERS|CHEESE MELTS 5 OZ BOX|CL|5929034237|0|/ccs/a/0/0/0/0/0/0/2/0/0/0/0/1/2/1/2/0/2/0/0/2/59290/34237/34237CL.GIF': escaped absolute path not valid-- Method Invocation session.downloadFile

Not sure what this means. Any thoughts?

I do not have enterprise edition. And the the "product file" token is for the text file which downloads fine. In my last response I only used it as an example of how the similar script works fine but when used for the images it does not work as expected.

Again, thanks for your help.

They have some weird

They have some weird characters in there. You will need to encode the string.

You can use the Java URL encoder to do it.

Example?

Sorry, I am new at this. Could you possibly show me how to use the encoder in my script? Thanks!

url =

url = "http://www.kwikeesystems.com" + session.getVariable("GIF_IMAGE_END_URL") + ".GIF";
url = URLEncoder.encode(url, "UTF-8");
session.downloadFile(url, "C://Product_Images//" + session.getVariable("PRODUCT_UPC") + ".GIF", 3);

This is what I got

Product Details: Applying extractor pattern: Image Download
Product Details: Extracting data for pattern "Image Download"
Product Details: The following data elements were found:
Image Download--DataRecord 0:
GIF_IMAGE_END_URL=/members/product_images/filestore/a/0/0/0/0/0/0/2/0/0/0/0/1/2/1/2/0/1/2/2/0/59290/20832/20832CL.GIF|KELLOGG COMPANY|CARR'S|CRACKERS|ROSEMARY 5 OZ BOX|CL|5929020832|0|/ccs/a/0/0/0/0/0/0/2/0/0/0/0/1/2/1/2/0/1/2/2/0/59290/20832/20832CL
Storing this value in a session variable.
Product Details: Processing scripts after a pattern application.
Processing script: "Download Product Images"
Attempting a file download with the following maximum number of attempts: 3
Sorry the URL indicated to download the file is invalid: http%3A%2F%2Fwww.kwikeesystems.com%2Fmembers%2Fproduct_images%2Ffilestore%2Fa%2F0%2F0%2F0%2F0%2F0%2F0%2F2%2F0%2F0%2F0%2F0%2F1%2F2%2F1%2F2%2F0%2F1%2F2%2F2%2F0%2F59290%2F20832%2F20832CL.GIF%7CKELLOGG+COMPANY%7CCARR%27S%7CCRACKERS%7CROSEMARY+5+OZ+BOX%7CCL%7C5929020832%7C0%7C%2Fccs%2Fa%2F0%2F0%2F0%2F0%2F0%2F0%2F2%2F0%2F0%2F0%2F0%2F1%2F2%2F1%2F2%2F0%2F1%2F2%2F2%2F0%2F59290%2F20832%2F20832CL.GIF
The file download failed. Making another attempt...

Right ... I encoded too much

url = "http://www.kwikeesystems.com" + URLEncoder.encode(session.getVariable("GIF_IMAGE_END_URL"), "UTF-8") + ".GIF";
session.downloadFile(url, "C://Product_Images//" + session.getVariable("PRODUCT_UPC") + ".GIF", 3);

Sorry about that. Don't want to encode the http:// and stuff.

In fact, I may still be

In fact, I may still be encoding too much ... since I can't log into the site I can't test. You may need to do something like:

url = "http://www.kwikeesystems.com";
String[] urlStuff = session.getVariable("GIF_IMAGE_END_URL").split("/");
for (i=0; i<urlStuff.length; i++)
{
   part =URLEncoder.encode(urlStuff[i]));
   url += "/" + part;
}
session.downloadFile(url, "C://Product_Images//" + session.getVariable("PRODUCT_UPC") + ".GIF", 3);