Proxy Sessions
Overview
A proxy session in screen-scraper is a record of the requests and responses that go between a browser and a proxy server. It is useful in learning how to scrape a site and is used to configure screen-scraper's proxy server. For more information see our documentation about using the proxy server.
Managing Proxy Sessions
Adding
- Select New Proxy Session from the menu.
- Click on the globe in the button bar.
- Right click on a folder in teh objects tree and select New Proxy Session.
- Use the keyboard shortcut Ctrl-J
Removing
- Press the Delete key when it is selected in the objects tree
- Right-click on the proxy session in the objects tree and select Delete.
- Click the Delete button in the proxy session general tab.
Proxy Session: General Tab
General Tab
- Start Proxy Server: Starts the proxy server and records the requests and responses that go through it.
- Delete: Removes the proxy session from screen-scraper
- Name: The name used to refer to the proxy session
- Port: The port that this proxy session should connect to in the proxy server
Proxy Session: Progress Tab
Progress Tab
- Clear All Transactions: Remove all of the transaction records currently in the list.
- Find (professional and enterprise editions only): Search transactions for text string.
- Detect JS Cookies: Show cookies that were not set by the server.
For the button to work correctly you will want to clear your browser cookies before having the proxy session record all transactions. This makes it so that cookies already in existence are not considered to be javascript cookies.
- Filter out less useful transactions (professional and enterprise editions only): When checked files that are unlikely to contain desired information do not show up in the transactions list. This includes such things as JavaScript and CSS files.
- Don't record binary files: When checked this option will cause screen-scraper to not display files such as images or other media files to the list of transactions under the progress tab. This will make it easier to find the files that you want without having to look through everything that goes through the server.
Transactions not included in the list are still recorded to the proxy session log.
- HTTP Transactions: A log of each of the transactions that has taken place (except for binary files if you have selected not to log them).
- #: The order in which the requests were initiated.
- Note: Editable field to help keep track of the transactions, when transactions are turned into scrapeable files the note becomes the initial name of the scrapeable file.
- URL: The requested URL of the transaction.
- Status: Indication of the current state of the transaction.
When a transaction is selected more information regarding the request and response is displayed.
Request Sub-tab
- Display Raw Request: Displays the whole request as it was sent to the server.
- Generate scrapeable file in: Creates scrapeable files in the specified scraping session for each of the selected transactions. The names of the scrapeable files are the text specified in the note section of each transaction.
- Request Line: The first line of the request.
- Headers: Any additional headers specified in the request.
- POST Data: All POST data that was sent along with the request.
Response Sub-tab
- Display Raw Response: Displays the whole response as it came from the server.
- Display Response in Browser: Opens your system's default browser and displays the contents of the response as they would appear when passed through a browser.
- Status Line: HTTP status of the transaction.
- Headers: Headers sent along with the response from the server.
- Content: The content of the response with headers and such removed.
Detect JS Cookie
Overview
screen-scraper has always kept track of server set cookies and does that for you automatically; however, when the cookies are set by javascript screen-scraper does not catch them. This saves on the time lost having screen-scraper scrape every javascript file when most of the time there is nothing there that matters.
This mean that you have to set any javascript added cookies using the setCookie method. To help find where javascript cookies are being set we have added a Detect JS Cookies button in the proxy session progress tab.
For the button to work correctly you will want to clear your browser cookies before having the proxy session record all transactions. This makes it so that cookies already in existence are not considered to be javascript cookies.
Proxy Session: Scripts Tab
This feature has been deprecated and by default is not available in the workbench interface. To enable proxy scripting please add AllowProxyScripting=true
to your resource/conf/screen-scraper.properties
file and restart screen-scraper.
You are unlikely to use this tab unless you are running screen-scraper as a proxy in server mode.
Scripts Tab (enterprise edition only)
- Add Script: Adds a script association to filter requests and/or responses on the Proxy Server.
- Script Name: Specifies which script should be run.
- Sequence: The order in which the scripts should be run.
- When to Run: When the proxy server should request to run the script.
- Enabled: A flag to determine which scripts should be run and which shouldn't be.
Proxy Session: Log Tab
Log Tab
- Clear Log: Erase the current contents of the proxy session log.
If you are trying to troubleshoot problems with scripts not working the way you expected the log can give you clues as to where problems might exists. Likewise, you can have your scripts write to the log to help identify what they are doing. If you have selected to filter out binary files and/or less useful transactions a log of those transactions will be available here.
The proxy session log is not saved in the workbench, if you close screen-scraper you will lose the current contents of the proxy session log.
Proxy Session: Advanced Tab
Advanced Tab
- Key store file path: The path to a JKS file that contains the certificates required for this scrape
- Key store password: The password used when generating the JKS file
Some web sites require that you supply a client certificate, that you would have previously been given, in order to access them. This feature allows you to access this type of site while using screen-scraper.
For more info see our blog entry on the topic.