Executing a ScrapeFile from an External Program
I want to execute a scraping session from an external program using either SOAP or by linking to the .NET DLL's. I can't figure out how to do the following:
1. Start a scraping session from the external program;
2. Pause the session after running the first couple of scrapefile's and extract data gathered so far;
3. Based on the extracted data from #2, let the external program get some more user input through some web-based form (ex. perhaps the website can be asking for some 'secret question' that requires human input before scraping can proceed to succeeding pages);
4. Continue the scraping session from where #2 left off but with the additional user input taken from #3 (ex. the answer to the 'secret question').
The SOAP and .NET API's only seem to have functions to initialize a scraping session and execute it. There seems to be no function for running a specific scrapefile within the session. The session.scrapeFile("") seem to do the job for internal scripts, but there seems to be no way to execute it from an external program. I also can't find any SOAP or .NET function call that will allow me to execute a script externally. Presumably, if there was a way to call a script from outside, I can make the script call the session.scrapeFile().
Please advise.
You should be able to look at
You should be able to look at tutorial #4 for step 1: http://community.screen-scraper.com/Tutorial_4_Page_1
On step 2, you would build the pause into the scrape. You can use session.pause() for that.
Step 3 and 4 become difficult. There isn't a way to pass the session back to your web-application and then resume based on input. You can do a little work to emulate that functionality though. You could make your own web-form that accepts data from the scrape, and insert a scrapeable file that will fill in that form, and the scrape will act upon the response. In this case if you have the secret question and answer in a database, you will find the question on the site, submit it to your own web-form, the web-form will look up the answer, and display it, you scrape it and continue with the scrape.
Thanks for the response. But
Thanks for the response. But wouldn't the above suggestion only work if I have a database of secret questions with pre-defined answers? What if the question really requires a human to answer it?
You're right, the database
You're right, the database does assume predefined questions and answers. But launching from a web-application, there only time you can set variables is when then scrape is started.
We have devised a script that would get a scraped CAPTCHA image, and pop that up on the local machine, wait for input, and pass that back. If you can make something like that work, here is the script:
/*
Takes the session variable CAPTCHA_URL, generates a user input window, then saves the output to CAPTCHA_TEXT.
*/
import javax.swing.JOptionPane;
cfile = "captcha_image_" + System.currentTimeMillis();
session.log( "CAPTCHA_URL: " + session.getVariable("CAPTCHA_URL") );
session.log( "CAPTCHA image file: " + cfile );
session.downloadFile( session.getVariable( "CAPTCHA_URL" ), cfile );
imageIcon = new ImageIcon( cfile );
// Prompt the user for the text in the image.
response = JOptionPane.showInputDialog
(
null,
"Enter the text in the image",
"CAPTCHA Image",
JOptionPane.QUESTION_MESSAGE,
imageIcon,
null,
null
);
session.log( "User response: " + response );
session.setVariable( "CAPTCHA_TEXT", response );
imageIcon = null;
// Delete the image, now that we no longer need it.
new File( cfile ).delete();
System.gc();
And also, how do I externally
And also, how do I externally "resume" a session.pause'd scraping session?