Not sure if sub-extractors is the way for this

Hi, I need to refresh my ScreenScraper knowledge, cause I used it long time ago. Could you tell me what is the best way for scraping this kind of html:

<tr>
<td>Title 1</td>
<td>Amount 1</td>
<td class='catcell'><small>Category 1<br />
Category 2<br />
Category 3<br />
</small></td>
</tr>

<tr>
<td></td>
</tr>

<tr>
<td>Title 2</td>
<td>Amount 2</td>
<td class='catcell'><small>Category 1<br />
Category 2<br />
</small></td>
</tr>

<tr>
<td></td>
</tr>

<tr>
<td>Title 3</td>
<td>Amount 3</td>
<td class='catcell'></td>
</tr>

In order to have:

Title1
Amount 1
Category 1, Category 2, Category3

Title 2
Amount 2
Category 1, Category 2

Title 3
Amount 3

As you can see, the difficulty comes from the fact that each record sometimes has several categories, and sometimes no category

Thank you,
Boga

What you need is

What you need is scrapeableFile.extractData. I think the example on the page should get you started.

It might, however be easier

It might, however be easier to parse the categories in a script. Depends on how you want it stored in the end though.

hey thanks. Not sure what you

hey thanks. Not sure what you mean by maybe being easier to parse categories in a script.

The end result will be written to a database and I think that it will be best to write one database row per category, so in my example I would want to end up with this in the database table:

Title      Amount     Category
-----      ------     ----------
Title 1    Amount 1   Category 1
Title 1    Amount 1   Category 2
Title 1    Amount 1   Category 3
Title 2    Amount 2   Category 1
Title 2    Amount 2   Category 2
Title 3    Amount 3   NULL

So I might make an extractor

So I might make an extractor that looks like:

<tr>
<td>~@TITLE@~</td>
<td>~@AMOUNT@~</td>
<td class='catcell'>~@CATEGORIES@~</td>
</tr>

Then on each pattern match, run a script that is something like this:
String fixString(String value)
{
        if (value != null)
        {
                value = sutil.convertHTMLEntities(value);
                value = value.replaceAll("<[^<>]*>", " ");
                value = value.replaceAll("\\s{2,}", " ");
                value = value.trim();
        }
        return value==null ? "" : value;
}

dm = session.getv("_DM"); // Connection to the dataManager

cats = dataRecord.get("CATEGORIES");
if (!sutil.isNullOrEmptyString(cats))
{
        cata = cats.split("<br />");
        for (i=0; i<cata.length; i++)
        {
                cat = fixString(cata[i]);
                dm.addData("table", fixString(dataRecord.get("TITLE")));
                dm.addData("table", fixString(dataRecord.get("AMOUNT")));
                dm.addData("table", cat);
                dm.commit("table");
                dm.flush();
        }
}
else
{
        dm.addData("table", fixString(dataRecord.get("TITLE")));
        dm.addData("table", fixString(dataRecord.get("AMOUNT")));
        dm.commit("table");
        dm.flush();
}

wow. That is a hell of a

wow.

That is a hell of a tutorial you gave me here. I am really thankful... I will slowly digest it soon to properly understand and learn each bit and modify it to my willing :-)

Thanks!
Boga