Developing first scrape for my work

Hi there-

I first off want to commend on what a great product this is! I have wrote screen scrape programs in the past via MS Access and it was a PAIN! I am an engineer for FedEx and I am trying to use screen scraper to pull some employee scanning information from our internal website each day.

Here are the problems I am running into

1. Lost on how to write a good extractor to get the data I want (Went through the tutorials and still lost...HA HA)
2. How to write a script to output the results so I can get them into MS Access
3. How to loop different variable's (loop through different employee numbers from a list)

Here is the html code from the website ( I only want to extract the information below where it says "

CONs Scanning Breakdown

", I need all the detail information inclosed within the "

2. You have a few options when saving scraped data to a database. Here is our FAQ on the topic.

http://community.screen-scraper.com/FAQ/Database

3. Assuming you mean to loop through different employee numbers as input into your scraping session, you could modify the script, "Initialize--Input from CSV".

Feel free to post any other questions.

-Scott

swilsonmc on 12/21/2009 at 11:31 am

I tired using that extractor

I tired using that extractor pattern, but it does not seem to be working. Any other ideas? I think the problem is that sometimes one of those fields could be blank for an employee and it gives me the wrong information. Other than that I got the loop to work and the output, but the extractor I think could be my problem. Any ideas? What information do I need to share to show you?

jsncochran on 01/04/2010 at 3:37 am

Don't forget to use regex

jsncochran,

It rare that you would NOT want to use some regex for an extractor pattern token. In this case I recommend using the pre-loaded "Non-HTML tags" (means, "match anything that is not an HTML tag") for all five tokens. This way if there is no value it will not return data from other fields.

-Scott

swilsonmc on 01/04/2010 at 5:08 pm

Hi Scott- I am kinda

Hi Scott-

I am kinda confused. How would I do that. I am completely lost on that one.

jsncochran on 01/04/2010 at 6:16 pm

Edit Token

jsncochran,

In the workbench, select a scrapeable file that you know has extractor patterns, click the Extractor Patterns tab, for any given extractor pattern token (identified by ~@@~) either double-click the token or highlight the text between the "@" symbols and right-click and choose edit token. In the dialog that pops up you'll see an area labeled "Regular Expressions". Click on the drop-down menu and scroll to "Non-HTML Tags". Close the dialog window.

Repeat the above for all five of your tokens in the extractor pattern we've been discussing.

-Scott

swilsonmc on 01/05/2010 at 11:40 am

" below where it says "CONs Scanning Breakdown" HTTP/1.1 200 OK Transfer-Encoding: chunked Server: Apache/2.0.59 (Unix) mod_python/3.1.3 Python/2.3.4 Content-Type: text/html Date: Mon, 14 Dec 2009 17:34:33 GMT <?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"> <head> <meta name="generator" content="HTML Tidy, see www.w3.org" /> <title>Employee Activity Report: Results</title> <link rel="stylesheet" type="text/css" title="css" href="/css/hoss.css" /> <link rel="stylesheet" type="text/css" title="css" href="/htr/hoss.css" /> <link rel="stylesheet" type="text/css" title="css" href="/htr-ind/hoss.css" /> <link rel="stylesheet" type="text/css" title="css" href="/htr-lax/hoss.css" /> <link rel="stylesheet" type="text/css" title="css" href="/htr-ewr/hoss.css" /> <link rel="stylesheet" type="text/css" title="css" href="/htr-anc/hoss.css" /> <link rel="stylesheet" type="text/css" title="css" href="/htr-ord/hoss.css" /> <link rel="stylesheet" type="text/css" title="css" href="/htr-afw/hoss.css" /> <link rel="stylesheet" type="text/css" title="css" href="/htr-gso/hoss.css" /> <link rel="stylesheet" type="text/css" title="css" href="/htr-oak/hoss.css" /> <link rel="stylesheet" type="text/css" title="css" href="/htr-alpha/hoss.css" /> <link rel="stylesheet" type="text/css" title="css" href="/htr-can/hoss.css" /> <link rel="stylesheet" type="text/css" title="css" href="/htr-cdg/hoss.css" /> <link rel="stylesheet" type="text/css" title="css" href="/htr-cgn/hoss.css" /> </head> <body> <div class="header_box"> <h1 class="header_text">Employee Activity Report: Results</h1> </div> <div class="menu_box"><label class="menu_category">Maintenance</label> <ul class="menu_section"> <li><a class="menu_link" href="http://psr-anc-cluster.lhsprod.fedex.com/org/o_master.psp">Org Codes</a></li> <li><a class="menu_link" href="http://psr-anc-cluster.lhsprod.fedex.com/splits/splitmopmenu.psp">Split Groups</a></li> <li><a class="menu_link" href="http://psr-anc-cluster.lhsprod.fedex.com/splits/splitgrouppick.psp">Splits</a></li> </ul> <label class="menu_category">Reports</label> <ul class="menu_section"> <li><a class="menu_link" href="http://psr-anc-cluster.lhsprod.fedex.com/report/ear_master.psp">Employee Activity</a></li> </ul> <label class="menu_category">Administrative</label> <ul class="menu_section"> <li><a class="menu_link" href="http://psr-anc-cluster.lhsprod.fedex.com/index.psp">Home</a></li> <li><a class="menu_link" href="http://psr-anc-cluster.lhsprod.fedex.com/log_out.psp">Log Out</a></li> </ul> </div> <div class="body_box"><!--=====================================================================--> <h1>12/11 @ 00:00 -- 12/11 @ 23:59</h1> <hr /> <h3>Activity by Scan Type</h3> <table> <tr> <th>SCAN</th> <th>ORG</th> <th>TOTAL</th> </tr> <tr> <td>CONS</td> <td>80305</td> <td>1323</td> </tr> </table> <h3>CONs Scanning Breakdown</h3> <table> <tr> <th>ORG</th> <th>DEST</th> <th>CONS#</th> <th>CONTAINER</th> <th>TOTAL</th> </tr> <tr> <td>80305</td> <td>GDLR</td> <td>304379757298</td> <td>AKE32633FX</td> <td>62</td> </tr> <tr> <td>80305</td> <td>GDLR</td> <td>304379757302</td> <td>AKE35821FX</td> <td>81</td> </tr> <tr> <td>80305</td> <td>MEMH</td> <td>304379757346</td> <td>AMJ40262FX</td> <td>203</td> </tr> <tr> <td>80305</td> <td>GDLR</td> <td>304379757357</td> <td>AMJ2967FX</td> <td>122</td> </tr> <tr> <td>80305</td> <td>MEMH</td> <td>304379757611</td> <td>AKE30335FX</td> <td>115</td> </tr> <tr> <td>80305</td> <td>MEMH</td> <td>304379757736</td> <td>AMJ0924FX</td> <td>120</td> </tr> <tr> <td>80305</td> <td>MEMH</td> <td>304379757791</td> <td>AMJ3058FX</td> <td>127</td> </tr> <tr> <td>80305</td> <td>MEMH</td> <td>304379757806</td> <td>SAA20519FX</td> <td>124</td> </tr> <tr> <td>80305</td> <td>MEMH</td> <td>304379757894</td> <td>AKE40389FX</td> <td>87</td> </tr> <tr> <td>80305</td> <td>MEMH</td> <td>304379757997</td> <td>AMJ47356FX</td> <td>282</td> </tr> </table> <!--=====================================================================--> </div> </body> </html> ‹ Screen scraper converting special characters to meta characters when not required The program can't start because verify.dll is missing for your computer. › jsncochran on 12/14/2009 at 12:27 pm screen-scraper support for licensed users Login or register to post comments jsncochran, Sorry for the jsncochran, Sorry for the late reply. Addressing your issues: 1. I took your sample html and created a simple scraping session with it. I was able to create one extractor patterned that looks like this to extract only the data below the "CONs Scanning Breakdown" section.
~@ORD@~	~@DEST@~	~@CONS@~	~@CONTAINER@~	~@TOTAL@~

Search

Community

screen-scraper

User login

Developing first scrape for my work

CONs Scanning Breakdown

I tired using that extractor

Don't forget to use regex

Hi Scott- I am kinda

Edit Token

jsncochran, Sorry for the