Sub Extractor patterns for identical HTML list-items

Hi, please can anybody advise how I should make sub-extractor patterns from the following code?

I need a pattern for the content of every list item but how do I distinguish between them?

Many Thanks!

  • Aberdeen Monsoon and Accessorize
  • Dual/Monsoon Men
  • Units 5
  • 33 Bon Accord Centre
  • George Street
  • Aberdeen
  • AB25 1HZ
  • Monsoon Tel: 01224 649146
  • Accessorize Tel: 01224 649146
  • Men Tel: 01224 649146

One needed clarification:

One needed clarification: Does this list appear on the page more than once, but with different text?

I'm assuming that you want each of the above fields placed into different variables.

First, the more simple solution: Can you just make one big subextractor pattern? This would require that the list has the same number of elements every time you come across it, and that they're in the same order.

If you can't just make one big subextractor out of it, you could always try patterns like the following:

  • ~@TITLE__BLANK_PATTERN_IN_TOKEN@~

  • ~@SUBTITLE__BLANK_PATTERN_IN_TOKEN@~

  • ~@UNITS__BLANK_PATTERN_IN_TOKEN@~
  • ~@ACCORD_CENTER@~
  • (pattern on this one is "\d+[\w\s]+")

  • ~@GEORGE_STREET__BLANK_PATTERN_IN_TOKEN@~
  • ~@ABERDEEN__BLANK_PATTERN_IN_TOKEN@~

  • ~@POSTAL_CODE@~
  • (pattern is "[A-Z0-9]+\s[A-Z0-9]+")

    Each individual code block above represents a subextractor. And yes, the third one is supposed to have 4 tokens in it. Knowing that the data for "street" is always just below the main address line is the only way I see to distinguish the fields.

    As for the phone numbers.. those are a little more tricky. You might have to switch to a script, using Java or something to do a call to "dataRecord.get("DATARECORD");". At that point, you can use the value inside of the DATARECORD variable from the extractor pattern to do some more regular expressions on it. I'd make a regular expression like the following:

    import java.util.regex.*;

    // Java wants double-backslashes when doing character classes in regex.
    Pattern p = Pattern.compile("

  • [\\w\\s]+:\\s+(\\d+\\s\\d+)
  • ");
    Matcher m = p.matcher(dataRecord.get("DATARECORD"));

    for (i = 0; m.find(); i++)
    {
    session.setVariable("PHONE_" + i.toString(), m.group(1));
    }

    This way, if you call this script after each application of your main pattern, you'll end up with N phone number variables, where their names are:

    • PHONE_1
    • PHONE_2
    • ...

    • PHONE_N

    As a disclaimer, it still might be hard to determine which phone numbers are which, just given the numbers and not the leading text (eg, "Accessorize Tel:").

    If you'd like to preserve that leading text on each phone number, change the "Pattern.compile" line to this: (move the opening parinthesis)

    Pattern p = Pattern.compile("

  • ([\\w\\s]+:\\s+\\d+\\s\\d+)
  • ");

    Does this help? If there's some factor I'm not considering from your senario, please let me know!

    Tim

    Wow. Thanks so much Tim.

    Wow.

    Thanks so much Tim. That's very helpful. Apologies for the late reply.

    I've realised that the addrres may contain a different numbers of elements each time.

    This means that if the subextractor contains 4 elements and if an address contains 5, the fifth will not be returned. And if an address contains any less than 4, no elements are returned.

    If I understand correctly, your solution for handling phone numbers accounts for the fact that there may be single or multiple phone numbers, right?

    Is there a way this solution can be applied to the address?

    Thanks again for your guidance.

    Joe

    Redundancy

    I think the best solution to various numbers of lines in the address will be to use some redundancy:

    For your sub extractor patterns, I listed one that had 4 of the <li> tags in a single sub-pattern. Go ahead and keep that one in the list, but then add another sub-pattern just beneath it (ie, second-to-last in sequence).

    Copy the text from the 4-line pattern into the new one.

    Make sure you redo the tokens' regex patterns, since the copy-paste will not have preserved the regex patterns.

    Add a 5th line to the new sub-pattern and name the new token variable appropriately.

    Now, when the pattern is applied, the 4-liner may very well match, but the 5-liner will have the opportunity to overtake the 4-line one in the event that a 5-liner is present. If a 5-liner is not present, then our new sub-pattern will fail to match, and thus will have no consequential behavior working against you.

    But yeah.. redundancy. Sometimes when we've scraped real-estate websites for clients, we've had issues exactly like this. Just be sure to run test cases on both senarios so that you can be certain that it's working.

    I hope that helps! These are things that I've learned in my time working with the guys at e-kiwi for screen-scraper programming :)

    Ah, and yes, the phone number script *would* in fact save multiple phone numbers, as sessionVariables named "PHONE_1" and "PHONE_2", given your initial example. It has, however, the ability to save an infinite quantity of phone numbers, given the input.

    Tim