I'm trying to scrape an HTML form that requires the user to type in text shown in an image. Can screen-scraper handle this?
This is known as a CAPTCHA mechanism, and is intended to discourage automated form submissions. There are essentially two ways of working with these:
- Often a site will use a poorly implemented CAPTCHA that can be determined what the text will read. For example, the site may actually have only four or five images, and it simply cycles through them. By looking at the names or a hash of the images one could determine what the corresponding text will be. The text could then be used to populate the appropriate HTML form.
-
If the CAPTCHA isn't readable that way, there are a few possibilities. You will need to scrape the session ID, and then download the image to read it.
- In workbench mode, you can make a pop-up that will prompt you to fill in the response. This is only a good solution if you won't have many, and the scrape is fairly short.
- There is a number of 3rd party services like DeathByCAPTCHA to which you can subscribe, and submit the image to, and it will respond with the text to fill in.
- If the text is fairly clean, one can use ImageMagic to attempt to OCR the image