Extractor Generated Code Not Working!
I took a site named:
http://www.scottsla.com/guides-books-maps-photos.htm
I then scraped it. I was trying to extract the category name, listings, and the hyperlink. I was using this as a test to get familiar with the code:
I used this as a main extractor:
And this as a subextractor:
I am getting output, but it is just junk and not the information that I am looking for. What did I do wrong?
what is the main extractor
Tim,
I can't get any data from that one. I am trying to get the main extractor to work, but I am having trouble with it. Could you give me some advice on the main extractor for the page we were discussing. The one you had me use is not pulling any data. Just looking for a suggestion as to what I am doing wrong.
Ah, I'm sorry. I realized a
Ah, I'm sorry. I realized a mistake with the main extractor pattern that I gave to you.
Use this one instead:
The sub extractors should still be fine.
What are the "patterns"
What are the "patterns" you've put into the "Pattern" tab on each variable? If you double-click the variable name (or right-click it) you'll have to set the pattern to something. The wrong patterns will cause you to get really weird information.
I'd suggest making the main extractor's "~@junk@~" variable have this pattern: [^>]*
~@ENTRYNAME@~ could be [^<>]*
~@WEBSITE@~ could be [^"]*
~@NAME@~ could be [^<>]*
If you still can't figure out what is going wrong, you could make a new extractor pattern and ONLY put this in it:
It should only get 1 row of data, but check to make sure that the columns are correct. They should match the data from the last entry on the page.
If they ARE correct, then that means your main extractor (where "DATARECORD" is found) is a little wrong.
After the tests, you can get rid of that temporary pattern I just had you make.
I'll look at it again soon, and let you know if I see anything tricky.
Tim
Could Not Get It To Work.
I tried to change the pattern and I could not get it to work. So I took the time and found another page. I am trying to test different pages to see just how the extractor pattern is got. I keep drawing blanks because I can't get it to function properly. I tried this page:
http://www.yellowpages.com/nationwide/category_search/Travel-Agencies/state-AK?page=12&search_terms=travel+agents
I was pulling the name, address, city, state, zip, phone number, and web address.
I used this for the main extractor:
And this for the subextractor:
~@ENTRYNAME@~
~@ADDRESS@~
~@CITY@~,~@STATE@~ ~@ZIP@~
It seems to me that the main
It seems to me that the main extractor pattern is too general. Right now, it is going to try to match every div tag on the entire page. Extractor patterns need to be general, but as specific as you can get.
In the most recent example you've given, you've not gotten something specific enough.
Try:
It's not a super hard process, it's just a matter of getting a feel for where the top and bottom of the areas you want are.
In this case, I chose the main extractor because it hands it to you on a silver platter, with a class called "listing". The info you want occurs before the "tools" section of the HTML, where the "directions" and "map it" links are. It makes an easy place to cut off the search.
Given the page you supplied to me, you should see 9 records matched. After that, it's just a matter of making sure that the sub extractors are matching correctly.
Tried It!
I tried the main extractor you had but it did not work. I am trying to figure out the right combination. I will keep trying.
I'm running the extractor
I'm running the extractor pattern right now and it's working great to extract all 9 entries on that page.
Just copy and paste what i've written and it should work. No pattern is needed for the "DATARECORD" variable.
You were almost right will most of those extractor patterns. Something to remember is that you can't just leave out parts, like "href" value between quotes.
For instance:
is a very different thing from
You have to account for all text in your pattern, even if you're using a variable to represent it. Leaving text out will make the pattern fail to match.
That being said, here are the sub extractors I used to quickly snag some of your data:
~@ENTRYNAME@~
You almost had this one, but you left the href part without a variable. Additionally, there is more than just the href in there; there's an onmousedown javascript tag. That's why I've added "~@junk_parameters@~" into the pattern. ~@junk_href@~ should have a pattern of [^"]* and ~@junk_parameters@~ should have a pattern of [^<>]*.
~@ADDRESS@~
~@CITY@~, ~@STATE@~ ~@ZIP@~ Map
You were close on this one, too, but you left out all of the parameters on the a tag. Additionally, the word "Map" and then a "</a>" appear before final "</p>". Again, this is just a matter of accounting for everything.
This one looks fine in your subextractors.
In this one, you only had the "href" part of the <a> tag. The characteristic feature of this tag is the fact that it has a "class" part on the tag, too, which flags it as the "main_web_site". Your pattern also made it look like there was no text that the link spans across; again, <a href="~@variable@~"></a> is a very different thing than <a href="~@variable@~" class="~@anotherVariable@~">some text to click on</a>.
I would use copy and paste to your advantage more frequently. Literally copy the HTML (from the "Last Response" tab) that you want to match, then paste it into the extractor pattern, and then start replacing things with variables. If you ever delete something without making sure a variable will compensate for it, then the pattern won't match anymore.
The approach I always use is:
I always avoid typing the pattern text directly into the extractor pattern text box; copy-and-paste is the way to go. You'll be sure to match the pattern if you do it that way.
I hope you haven't become discouraged during this process. When I first started working at screen-scraper I had a rough time trying to get the process down. And even once you've got it, you'll be sure to encounter a website in the future which strains your abilities!! I know that I've found a few impossible ones :)
Best of luck. Keep the questions coming if you've got more!
Tim