Developing first scrape for my work

Hi there-

I first off want to commend on what a great product this is! I have wrote screen scrape programs in the past via MS Access and it was a PAIN! I am an engineer for FedEx and I am trying to use screen scraper to pull some employee scanning information from our internal website each day.

Here are the problems I am running into

1. Lost on how to write a good extractor to get the data I want (Went through the tutorials and still lost...HA HA)
2. How to write a script to output the results so I can get them into MS Access
3. How to loop different variable's (loop through different employee numbers from a list)

Here is the html code from the website ( I only want to extract the information below where it says "

CONs Scanning Breakdown

", I need all the detail information inclosed within the "

2. You have a few options when saving scraped data to a database. Here is our FAQ on the topic.

http://community.screen-scraper.com/FAQ/Database

3. Assuming you mean to loop through different employee numbers as input into your scraping session, you could modify the script, "Initialize--Input from CSV".

Feel free to post any other questions.

-Scott

I tired using that extractor

I tired using that extractor pattern, but it does not seem to be working. Any other ideas? I think the problem is that sometimes one of those fields could be blank for an employee and it gives me the wrong information. Other than that I got the loop to work and the output, but the extractor I think could be my problem. Any ideas? What information do I need to share to show you?

Don't forget to use regex

jsncochran,

It rare that you would NOT want to use some regex for an extractor pattern token. In this case I recommend using the pre-loaded "Non-HTML tags" (means, "match anything that is not an HTML tag") for all five tokens. This way if there is no value it will not return data from other fields.

-Scott

Hi Scott- I am kinda

Hi Scott-

I am kinda confused. How would I do that. I am completely lost on that one.

Edit Token

jsncochran,

In the workbench, select a scrapeable file that you know has extractor patterns, click the Extractor Patterns tab, for any given extractor pattern token (identified by ~@@~) either double-click the token or highlight the text between the "@" symbols and right-click and choose edit token. In the dialog that pops up you'll see an area labeled "Regular Expressions". Click on the drop-down menu and scroll to "Non-HTML Tags". Close the dialog window.

Repeat the above for all five of your tokens in the extractor pattern we've been discussing.

-Scott

" below where it says "CONs Scanning Breakdown"

HTTP/1.1 200 OK
Transfer-Encoding: chunked
Server: Apache/2.0.59 (Unix) mod_python/3.1.3 Python/2.3.4
Content-Type: text/html
Date: Mon, 14 Dec 2009 17:34:33 GMT

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta name="generator" content="HTML Tidy, see www.w3.org" />
<title>Employee Activity Report: Results</title>
<link rel="stylesheet" type="text/css" title="css" href="/css/hoss.css" />
<link rel="stylesheet" type="text/css" title="css" href="/htr/hoss.css" />
<link rel="stylesheet" type="text/css" title="css" href="/htr-ind/hoss.css" />
<link rel="stylesheet" type="text/css" title="css" href="/htr-lax/hoss.css" />
<link rel="stylesheet" type="text/css" title="css" href="/htr-ewr/hoss.css" />
<link rel="stylesheet" type="text/css" title="css" href="/htr-anc/hoss.css" />
<link rel="stylesheet" type="text/css" title="css" href="/htr-ord/hoss.css" />
<link rel="stylesheet" type="text/css" title="css" href="/htr-afw/hoss.css" />
<link rel="stylesheet" type="text/css" title="css" href="/htr-gso/hoss.css" />
<link rel="stylesheet" type="text/css" title="css" href="/htr-oak/hoss.css" />
<link rel="stylesheet" type="text/css" title="css" href="/htr-alpha/hoss.css" />
<link rel="stylesheet" type="text/css" title="css" href="/htr-can/hoss.css" />
<link rel="stylesheet" type="text/css" title="css" href="/htr-cdg/hoss.css" />
<link rel="stylesheet" type="text/css" title="css" href="/htr-cgn/hoss.css" />
</head>
<body>
<div class="header_box">
<h1 class="header_text">Employee Activity Report: Results</h1>
</div>

<div class="menu_box"><label class="menu_category">Maintenance</label>

<ul class="menu_section">
<li><a class="menu_link" href="http://psr-anc-cluster.lhsprod.fedex.com/org/o_master.psp">Org Codes</a></li>

<li><a class="menu_link" href="http://psr-anc-cluster.lhsprod.fedex.com/splits/splitmopmenu.psp">Split Groups</a></li>

<li><a class="menu_link" href="http://psr-anc-cluster.lhsprod.fedex.com/splits/splitgrouppick.psp">Splits</a></li>
</ul>

<label class="menu_category">Reports</label>

<ul class="menu_section">
<li><a class="menu_link" href="http://psr-anc-cluster.lhsprod.fedex.com/report/ear_master.psp">Employee Activity</a></li>
</ul>

<label class="menu_category">Administrative</label>

<ul class="menu_section">
<li><a class="menu_link" href="http://psr-anc-cluster.lhsprod.fedex.com/index.psp">Home</a></li>

<li><a class="menu_link" href="http://psr-anc-cluster.lhsprod.fedex.com/log_out.psp">Log Out</a></li>
</ul>
</div>

<div class="body_box"><!--=====================================================================-->
<h1>12/11 @ 00:00 -- 12/11 @ 23:59</h1>

<hr />
<h3>Activity by Scan Type</h3>

<table>
<tr>
<th>SCAN</th>
<th>ORG</th>
<th>TOTAL</th>
</tr>

<tr>
<td>CONS</td>
<td>80305</td>
<td>1323</td>
</tr>
</table>

<h3>CONs Scanning Breakdown</h3>

<table>
<tr>
<th>ORG</th>
<th>DEST</th>
<th>CONS#</th>
<th>CONTAINER</th>
<th>TOTAL</th>
</tr>

<tr>
<td>80305</td>
<td>GDLR</td>
<td>304379757298</td>
<td>AKE32633FX</td>
<td>62</td>
</tr>

<tr>
<td>80305</td>
<td>GDLR</td>
<td>304379757302</td>
<td>AKE35821FX</td>
<td>81</td>
</tr>

<tr>
<td>80305</td>
<td>MEMH</td>
<td>304379757346</td>
<td>AMJ40262FX</td>
<td>203</td>
</tr>

<tr>
<td>80305</td>
<td>GDLR</td>
<td>304379757357</td>
<td>AMJ2967FX</td>
<td>122</td>
</tr>

<tr>
<td>80305</td>
<td>MEMH</td>
<td>304379757611</td>
<td>AKE30335FX</td>
<td>115</td>
</tr>

<tr>
<td>80305</td>
<td>MEMH</td>
<td>304379757736</td>
<td>AMJ0924FX</td>
<td>120</td>
</tr>

<tr>
<td>80305</td>
<td>MEMH</td>
<td>304379757791</td>
<td>AMJ3058FX</td>
<td>127</td>
</tr>

<tr>
<td>80305</td>
<td>MEMH</td>
<td>304379757806</td>
<td>SAA20519FX</td>
<td>124</td>
</tr>

<tr>
<td>80305</td>
<td>MEMH</td>
<td>304379757894</td>
<td>AKE40389FX</td>
<td>87</td>
</tr>

<tr>
<td>80305</td>
<td>MEMH</td>
<td>304379757997</td>
<td>AMJ47356FX</td>
<td>282</td>
</tr>
</table>

<!--=====================================================================-->
</div>
</body>
</html>

jsncochran, Sorry for the

jsncochran,

Sorry for the late reply. Addressing your issues:

1. I took your sample html and created a simple scraping session with it. I was able to create one extractor patterned that looks like this to extract only the data below the "CONs Scanning Breakdown" section.

~@ORD@~ ~@DEST@~ ~@CONS@~ ~@CONTAINER@~ ~@TOTAL@~