Not sure if sub-extractors is the way for this
Hi, I need to refresh my ScreenScraper knowledge, cause I used it long time ago. Could you tell me what is the best way for scraping this kind of html:
<tr>
<td>Title 1</td>
<td>Amount 1</td>
<td class='catcell'><small>Category 1<br />
Category 2<br />
Category 3<br />
</small></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td>Title 2</td>
<td>Amount 2</td>
<td class='catcell'><small>Category 1<br />
Category 2<br />
</small></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td>Title 3</td>
<td>Amount 3</td>
<td class='catcell'></td>
</tr>
<td>Title 1</td>
<td>Amount 1</td>
<td class='catcell'><small>Category 1<br />
Category 2<br />
Category 3<br />
</small></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td>Title 2</td>
<td>Amount 2</td>
<td class='catcell'><small>Category 1<br />
Category 2<br />
</small></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td>Title 3</td>
<td>Amount 3</td>
<td class='catcell'></td>
</tr>
In order to have:
Title1
Amount 1
Category 1, Category 2, Category3
Title 2
Amount 2
Category 1, Category 2
Title 3
Amount 3
As you can see, the difficulty comes from the fact that each record sometimes has several categories, and sometimes no category
Thank you,
Boga
What you need is
What you need is scrapeableFile.extractData. I think the example on the page should get you started.
It might, however be easier
It might, however be easier to parse the categories in a script. Depends on how you want it stored in the end though.
hey thanks. Not sure what you
hey thanks. Not sure what you mean by maybe being easier to parse categories in a script.
The end result will be written to a database and I think that it will be best to write one database row per category, so in my example I would want to end up with this in the database table:
----- ------ ----------
Title 1 Amount 1 Category 1
Title 1 Amount 1 Category 2
Title 1 Amount 1 Category 3
Title 2 Amount 2 Category 1
Title 2 Amount 2 Category 2
Title 3 Amount 3 NULL
So I might make an extractor
So I might make an extractor that looks like:
<td>~@TITLE@~</td>
<td>~@AMOUNT@~</td>
<td class='catcell'>~@CATEGORIES@~</td>
</tr>
Then on each pattern match, run a script that is something like this:
{
if (value != null)
{
value = sutil.convertHTMLEntities(value);
value = value.replaceAll("<[^<>]*>", " ");
value = value.replaceAll("\\s{2,}", " ");
value = value.trim();
}
return value==null ? "" : value;
}
dm = session.getv("_DM"); // Connection to the dataManager
cats = dataRecord.get("CATEGORIES");
if (!sutil.isNullOrEmptyString(cats))
{
cata = cats.split("<br />");
for (i=0; i<cata.length; i++)
{
cat = fixString(cata[i]);
dm.addData("table", fixString(dataRecord.get("TITLE")));
dm.addData("table", fixString(dataRecord.get("AMOUNT")));
dm.addData("table", cat);
dm.commit("table");
dm.flush();
}
}
else
{
dm.addData("table", fixString(dataRecord.get("TITLE")));
dm.addData("table", fixString(dataRecord.get("AMOUNT")));
dm.commit("table");
dm.flush();
}
wow. That is a hell of a
wow.
That is a hell of a tutorial you gave me here. I am really thankful... I will slowly digest it soon to properly understand and learn each bit and modify it to my willing :-)
Thanks!
Boga