When running, the proxy server listens on a specified port for incoming HTTP requests from your web browser. Upon receiving a request from your browser the proxy server records it, then sends it along to the server for which it was intended. When that server responds it is received by the proxy server, which, once again, makes a record of it, then sends it along to your web browser.
screen-scraper's proxy server allows you to view HTTP requests and responses as they pass between your web browser and remote servers. In scraping files from web sites there are a few more details than you typically worry about when surfing, such as HTTP headers and POST data. The proxy server makes all of these details visible to you.
Often one of the headaches of scraping information from sites that use HTTPS is that it's not always easy to tell what's getting passed back and forth in the way of cookies, POST data, etc. Even if you put a proxy server in the way that lets you view the requests and responses, the information is encrypted as it's leaving your browser and as it's leaving the web server that responds to the request. screen-scraper gets around this problem by using it's own temporary certificate to encrypt traffic from itself to the browser and then encrypting each request before sending it up to the server. The result of this is that your browser will issue a warning about the certificate that screen-scraper returned. You can safely accept the certificate and be assured that all your traffic is encrypted.
We've also used other proxy software, such as Charles proxy, for handling SSL sites. They have additional features to allow the browser to trust the certificates so you don't see a warning in the browser. We've also added an import proxy session feature so you can import a JSON Session File from Charles and use it in screen-scraper to build a scraping session.
This feature is only available to Professional and Enterprise editions of screen-scraper.
screen-scraper has the ability to act as a proxy while in server mode. Combined with the ability to execute scripts, this functionality opens up many possibilities for how you use screen-scraper. More information about how to go about using screen-scraper in this capacity is available on our using scripts with the proxy server page.
First, create a proxy session to organize your interactions with the specific web sites.
Configuring a web browser to use a proxy server is generally pretty straightforward, but varies somewhat for each browser. We have provided instructions on how to setup different browsers:
Assuming you've configured everything and set up a proxy session, from here you should be able to start up the proxy server by selecting your proxy session in the objects tree and then clicking on the Start Proxy Server button in the general tab. Now just surf the pages that you want to record.
After you've surfed a bit with your web browser click on the progress tab. From here you can view all of the HTTP and HTTPS requests and responses logged by the proxy server. Clicking on a transaction brings up its details in the lower pane.
If you are using Internet Explorer 7 you have to adjust your security settings. To do this open Internet Options in the menu and under the security tab change the security level to medium.
If security settings are not updated you will see an error page when accessing a site that uses HTTPS encryption.
IE domain mismatch warning
This warning occurs because screen-scraper is using a temporary certificate for encryption that will not match the url that you are accessing. You can safely ignore this warning by clicking Continue to this website. This practice is, however, not recommended.
Most browsers have recently started preventing you from accepting the certificate. As a work-around you can use another proxy service, such as Charles, to proxy SSL sites by installing their root authority certificate on your server (which prevents an error from showing up while using them). You can then import the proxy data into screen-scraper for building your scrape
If you normally use an external proxy server when connecting to the internet (on your local area network, for example), you'll need to specify this information in screen-scraper's external proxy settings. Before you can run the proxy server.
This feature has been deprecated and by default is not available in the workbench interface. To enable proxy scripting please add AllowProxyScripting=true
to your resource/conf/screen-scraper.properties
file and restart screen-scraper.
screen-scraper has the ability to run custom-made scripts while the proxy server is running (more information on starting and stopping the server is available). This allows you to setup blacklists, filter web pages, or otherwise manipulate browser requests and server responses. It is recommended that you read about managing and using scripts before continuing.
The scripts tab is used to associate scripts with a proxy. Depending on when you decide to run your script, certain built in objects will be in scope that are unique to the proxy environment.
screen-scraper offers a few objects that you can work with in a script in the proxy environment. See the variable scope section and/or API documentation for more details.
Depending on when a script gets run different variables may be in scope (available). The table that follows specifies what variables will be in scope depending on when a given script is run.
When Script is Run | proxySession in scope | request in scope | response in scope |
---|---|---|---|
Beginning of proxy session | X | ||
Before HTTP request | X | X | |
After HTTP request | X | X | |
Before HTTP response | X | X | X |
After HTTP response | X | X | X |
One of the best ways to fix errors is to simply watch the proxy session log (under the log tab in the proxy session) and the error.log file (located in the log directory of screen-scraper's install directory) for script errors. When a problem arises in executing a script screen-scraper will output a series of error-related statements to the logs. Often a good approach in debugging is to build your script bit by bit, running it frequently to ensure that it runs without errors as you add each piece.