How to write an Extractor!
I am trying to see what do I need to look at to write an extractor pattern. I did one that I received from here and it worked but as I looked through the code, I could not figure out where it came from. I was looking at more examples similar to what we would be scraping with different fields and parameters and I wanted to know what you would be the extractor pattern for this site:
http://www.visitflorida.com/listings/taggroup.Aquariums
If you were trying to get:
Names, address, city, state, zip, phone, website
How would you get the extractor for this? And where would it come from?
Anytime that you have
Anytime that you have multiple entries on a single page, it is best to use the "DATARECORD" approach. This means that your main extractor pattern needs to be general... let me see if I can illustrate this clearly:
For instance, pretend like the website you want to scrape looks like this:
some text
asdfasdfasdfasdf asdf asdf
hi mom!
another paragraph
another paragraph with stuff inside of it.
I've put some irregularities in there on purpose.
Now, pretend like your goal is to get the text from each and every "<p>" tag (ie, a paragraph). Your extractor pattern would need to look like this:
~@DATARECORD@~
And that's it! If you were to push the "Apply Pattern to Last Scraped Data" button in screen-scraper, you would see that it matched all of those "p" tags. The "~@DATARECORD@~" variable has the text from between the "<p>" and "</p>".
Now, expand this example to encompass the site you gave a link for: Pretend that instead of matching between "<p>" and "</p>", you're matching between the top of each entry to the bottom of the entry. If you were to achieve that, you would then have all of the text between the top and bottom of each entry in your DATARECORD variable. Does that make sense? All you ever have to do is find the HTML that represents the top and bottom of each entry on the page, and put "~@DATARECORD@~" between them. You'll automatically be extracting every instance of that entry on the page, no matter how many there are.
Now, the magical thing about "DATARECORD", is that it allows you to use "Sub extractor" patterns. All a "sub extractor" does is look inside of the DATARECORD variable for information. So, you can find the website of the entry by having a SUB-EXTRACTOR like this:
This way, if there's a website to be had, you'll get it.
So, the top and bottom of each entry looks like this on the website you gave me:
~@junk@~: DON'T save in session variable Pattern: [^>]*
~@DATARECORD@~: DON'T save in session variable Pattern: No pattern needed
And then you can put the following sub extractors into the mix:-
-
~@ADDRESS@~
~@CITY@~, ~@STATE@~ ~@ZIP@~
-
~@PHONE@~
-
~@ENTRY_NAME@~
~@ENTRY_NAME@~: Save in session variable? Pattern: [^<>]*
~@ADDRESS@~: Save in session variable? Pattern: [^<>]*
~@CITY@~: Save in session variable? Pattern: [^,]+
~@STATE@~: Save in session variable? Pattern: [A-Z]{2}
~@ZIP@~: Save in session variable? Pattern: \d{5}
~@PHONE@~: Save in session variable? Pattern: [^<>]*
~@WEBSITE@~: Save in session variable? Pattern: [^"]*
Now, when you push that "apply pattern to last scraped data" button, you should have one row for every entry on the page, and then a column for each variable that the sub-extractors were able to find.
I hope this is making sense... the trick is finding HTML around the variable that you want, which is unique. The only reason I'm writing what I am writing is because I've looked at the HTML and found some unique bits that are around the info you want.
You can literally take anything out of each section. You can grab the URL to that little image, you could save the link for the "more" part for the description, etc, etc. Anything you can see, you can have. You just have to find the pattern in the HTML.
Tim
Thanks
Thanks Tim,
I understand now!
you are quite welcome! :)
you are quite welcome! :)
Tim