Extractor pattern for varying HTML code

Hi,

I'm struggling to get an extractor pattern that can extract data from a table that can vary between pages.

I have 3 fields, in columns 2, 3 & 5 of the table (rows 06, 09 & 15 in the code) that I wish to extract the data from and have represented these as ~@DATA1@~, ~@DATA2@~ & ~@DATA3@~ in the example below: -

01<tr>
02<td valign="top" width="36">
03<p dir="RTL" align="center">1</p>
04</td>
05<td width="110">
06<p dir="RTL" align="center">~@DATA1@~</p>
07</td>
08<td colspan="2" width="66">
09<p dir="RTL" align="center">~@DATA2@~</p>
10</td>
11<td width="66">
12<p dir="RTL" align="center">-</p>
13</td>
14<td width="76">
15<p dir="RTL" align="center">~@DATA3@~</p>
16</td>
17<td width="95">
18<p dir="RTL" align="center">0</p>
19</td>
20<td width="94">
21<p dir="RTL" align="center">0</p>
22</td>
23<td width="133">
24<p dir="RTL" align="center">5753555</p>
25</td>
26</tr>

A second example: -

01<tr>
02<td valign="top" width="36">
03<p dir="RTL" align="center">1</p>
04</td>
05<td colspan="2" width="110">
06<p dir="RTL" align="center">~@DATA1@~</p>
07</td>
08<td valign="top" width="66">
09<p dir="RTL" align="center">~@DATA2@~</p>
10</td>
11<td valign="top" width="66">
12<p dir="RTL" align="center">-</p>
13</td>
14<td width="57">
15<p dir="RTL" align="center">~@DATA3@~</p>
16</td>
17<td width="55">
18
19</td>
20<td colspan="2" width="58">
21
22</td>
23<td width="61">
24
25</td>
26<td colspan="2" width="62">
27
28</td>
29<td width="104">
30<p dir="RTL" align="center">5753555</p>
31</td>
32</tr>

Finally a third example: -

01<tr>
02<td valign="top" width="36">
03<p dir="RTL" align="center">1</p>
04</td>
05<td colspan="2" width="110">
06<p dir="RTL" align="center">~@DATA1@~</p>
07</td>
08<td width="66">
09<p dir="RTL" align="center">~@DATA2@~</p>
10</td>
11<td width="66">
12
13</td>
14<td width="57">
15<p dir="RTL" align="center">~@DATA3@~</p>
16</td>
17<td width="55">
18
19</td>
20<td colspan="2" width="58">
21
22</td>
23<td width="61">
24
25</td>
26<td colspan="2" width="62">
27
28</td>
29<td width="104">
30<p dir="RTL" align="center">5753555</p>
31</td>
32</tr>

As you can see, the number of columns can vary between pages, and the code within the tags can also vary with colspan & valign being present on some corresponding rows in the code and not others, plus varying values for the width variable on corresponding rows.

Would it be possible that one extractor pattern could pull the data from all three examples above?

Any help / guidance on this would be appreciated.

I could see a way of doing this but it would be via a script...

Sorry to come in late on this thread. I agree with Jason, but I do see a way of doing this programmatically (i.e., outside of an extractor pattern in a script) via the following:
1. Capture the entire table in a single extractor variable (i.e., everything between tr and /tr)
2. Go through each table row and add its value (i.e., the part between td and /td) as a new string element to an ArrayList; this will give you an ArrayList with an equal # of elements as there are rows in the table.
3. Now go through each extracted value and strip out the HTML (if there) via a removeAll method; this should leave you with an ArrayList containing either data (what you're looking for) or empty strings (which were contained in the empty rows).
4. Remove the empty strings (i.e., the empty rows) from the array list via a second removeAll method; this should leave you with the 3 data items you're looking for.
5. Write each data item to your variables.

HTH!
Justin

Fair enough, I'll try to see

Fair enough, I'll try to see if I can achieve what I want using one extractor pattern, and see how many URL's I'm left with and then try another one for what's left and so on.

Worse thing is that the number of rows is also random so will have to cater for a large number of them too.

Ta.

No hope of a single pattern

No hope of a single pattern to get them all. You might be able to do it with 2 patterns that use scrapeableFile.extractData, if there is a way to isolate just the
tags you want.