Newbie Help with Parent/Child Relationships

Hi,
I've been trying to work a solution, but haven't been able to solve a parent-child issue. Hope you all can help!

Here is the structure:

Child1->ProductList->Product1_Detail
Product2_Detail
Child2->ProductList->Product1_Detail
Product2_Detail
Parent1->Child1->ProductList->Product1_Detail
Product2_Detail
Child2->ProductList->Product1_Detail
Product2_Detail

At any point there can be a child with a product, as well as a parent with children.

I can individually run a scrape for say:
Child2->ProductList->Product1_Detail
Product2_Detail

But, I am having difficulty with running a scrape for a given level to find the child->products and also the parent of another child at the same time. There are up to 5 levels of Parent/Child. I have so far been able to run the child and then run another scrape file for the parent, but then get bogged down with the stored variable (product code) to generate the parent code. Whether a parent or a child, the term used for identifying the page is category_code.

If this is not clear, please let me know.

Thanks.

NR

Well, I'm not sure if we're

Well, I'm not sure if we're well-enough equipped here at screen-scraper to handle such problems.. If I may suggest, you could check out this book, though:

http://www.drphilstore.com/famfirst.html

It's normally $26, but it's on sale on the site for $17.25.

... :) Just kidding. I couldn't help but chuckle at the title of the thread. I couldn't decide if I might see a spam posting tailored to look real, or if it'd be a legit object-oriented site question.

さて! Let me see.. I'm going to try to repeat your question so that I can demonstrate if I've understood you properly. I'll reorder the children a little bit to help generalize the senario.

I'm going to color-code to make this easier..

You've got:
Main page    (A category page)
    child A    (Product listings for category)
        product details
    child B    (Directory of sub-categories)
        sub-child 1    (Product listings for sub-category)
            product details
        sub-child 2    (Product listings for sub-category)
            product details
    child C    (Product listings for category)
        product details

One question: Does the page for child A follow the same layout as sub-child 1?

If so, you could try and get recursive on it. For instance:

  1. Your scrapeableFile for Main page would have 2 extractor patterns on it to match all categories listed there:
    1. The first will match any URLs for categories that will go to a product details page. This should be evidenced by the URL. Make the extractor pattern like the following:



      ~@PARAMETERS@~: Save in session variable?    Pattern: [^>]+    Scripts:     Goto listings page        After each pattern application    
    2. The second extractor pattern would match anything that goes to the another sub-category page, such as is the case with child B, whose link goes to another category page, this time with sub-categories sub-child 1 and sub-child 2:



      ~@PARAMETERS@~: Save in session variable?    Pattern: [^>]+    Scripts:     Redo category page (recursive)        After each pattern application    

      The script that this one is calling should just do a call to rescrape the very same scrapeableFile.

    One thing to note about this recursive approach is that you should only use it if there is 1, maybe 2, sub-category branches. For instance, if your diagram looks more like the following, then you're going to encounter memory issues pretty quickly:
    Main page
        child A
            product details
        child B
            sub-child 1
                sub-sub-child 1
                    sub-sub-sub-child 1
                        product details
                    sub-sub-sub-child 2
                        product details
                sub-sub-child 2
                    sub-sub-sub-child 1
                        product details
                    sub-sub-sub-child 2
                        product details
            sub-child 2
                product details

  2. After your extractor patterns are set up, make sure that your URL field on that scrapeableFile has text like the following:

    http://www.somewebsite.com/category_view.asp?~#PARAMETERS#~

    As a warning, you might want to use a script to process the "PARAMETERS" sessionVariable, to change "&" to "&". It will cause the page to fail to request properly if you've got a bunch of "&" HTML codes in your URL field.

    Using this method, you won't be listing your parameters in the "Parameters" tab. All you need is the URL that I've written just above.

  3. Finally, you'll have your "listings" scrapable file, which is equivalent to child A, sub-child 1, sub-child 2, and child B (from the first diagram that I supplied). The URL should be similar to the one in step 2:

    http://www.somewebsite.com/listings.asp?~#PARAMETERS#~

*big sigh* So. I hope I'm on the right track for what you need.

If you need more help, ask away!

Tim

Parent/Child Continuing

Tim,

Ha. Ha! Well, I for one appreciate your sense of humor! Good one...it took me a second while reading to figure out where you were going.

Thanks for the detailed reply (is there a quick and easy editor that convert to code the indents and coloring?).

You're right on track and very helpful. I've been digesting the information you provided. This of course leads to some more questions:

My example is more like the second, more detailed, site map, however, each child or sub-child or sub-sub-child gets called by the same code:
i.e.
http://www.xxxxxxxScreen=CTGY&Category_Code=Additives

More specifically, the site is lay-out is as follows:

Store_Front->category_code->product_list->product_details
or
Store_Front->category_code->parent_code->category_code->product_list->product_details
or
Store_Front->category_code->parent_code->category_code->parent_code->category_code->product_list->product_details

This can repeat for several levels.

Category_code used throughout the site as the level above the product list and then the product details. Except for when there is a parent_code thrown in. Then the process needs to continue one level further.

I figured out how to scrape the child and parent. Thanks. I am working on each level.

In order to re-use a scaping file, is there any way to store the child variable = "category_code" and use it to get the product details through-out the scraping? Or, do I need to create a new scrape for each level of child or sub-child or sub-sub-child to get the product details.

Similarly, when I am writing the output to a file, I would like to use variables used prior (i.e. I am sending output from a sub-sub-child, there is information at the child level which I would like to output include in each row of sub-sub-child data). I looked at the session.getVariable, but did not have any luck. I get "null".

Many thanks!

Nadir

My pleasure :)

My pleasure :)

Unfortunately, no, there isn't a handy editor for colors and indentation. I just cracked that up myself with HTML. The indentation can only be done by putting "non-breaking" spaces. To make a single one, you have to write "&nbsp;". I think I put 4 of those in a row in order to get a single indent, then 8 for two indents, etc, etc. Kind of hard to look at in plain old text while writing the reply :P The colors were a "<span style="color:blue>some text here</span>" , where the word "blue" can be replaced with lots of basic colornames, or with a 6-digit hex code, like "#000000" for black, or "#00ff00" for green.

Anyway.

Let me get one more thing clear..... When viewing an arbitrary page with a list of categories on it (a mix of end-of-the-line child categories and of new parent categories), the URLs are distinguished only by the parameters sent to the same page. So then, links on the page look like the following, for example:

Sean Connery, for $200 please.
fish heads, fish heads, rolly polly fish heads
boolean granola
breakfast cereals
Japan
10 little indians
squeeker
blah

You said that you've figured out how to do the initial scraping part, so I'll forgo further examination..

To keep copies of each category code as you go down the list might get a little tricksy. The best way that I've done this went something like this:

  1. I'd start off with something in an initializing script that does this:

    session.setVariable("DEPTH", 0);

    When to run:     Before scraping session begins    
  2. Make sure you've got your extractor pattern(s) in place where you match for the category links. You'll be calling a couple of scripts from the extractor pattern.

    /view.php?Screen=CTGY&~@DATARECORD@~">

    ~@DATARECORD@~: Don't save in session variable    Pattern: [^"]+
    Scripts:
        Process link        Squence: 1        After each pattern application    
        Redo category page (recursive)        Squence: 2        After each pattern application    
      Sub-extractor patterns:

    • Category_Code=~@CATEGORY_CODE@~

      ~@CHILD_CODE@~: Don't save in session variable    Pattern: [^&"]+

    • &Parent_Code=~@PARENT_CODE@~

      ~@PARENT_CODE@~: Don't save in session variable    Pattern: [^"]+

    This way, your pattern will match on every single link, whether it's for a child or parent category. The script that I've said to call ("Process link" from the main extractor pattern part) will have access to "CHILD_CODE" and "PARENT_CODE" via the dataRecord.

  3. Now, for that script, "Process link". You'll have to keep track of each category code as a session variable with unique name, or else you'll be overwriting the category variable each time you find a new category code. So, that means you'll need a variable variablename. As in, make a call to "session.setVariable" and pass it a String variable for it's name, based on that "DEPTH" variable I said to create on step 1:
    Script: Process link in Interpreted Java

    String parentCode = dataRecord.get("PARENT_CODE");
    String childCode = dataRecord.get("CHILD_CODE");
    String depth = session.getVariable("DEPTH");

    // This will be the URL for the next category page you want to browse into
    session.setVariable("PARAMETERS", "Category_Code=" + childCode);
    session.setVariable("CHILD_AT_DEPTH_" + depth, childCode);

    if (parentCode != null)
    {
    session.setVariable("PARENT_AT_DEPTH_" + depth, parentCode);
    session.setVariable("PARAMETERS", session.getVariable("PARAMETERS") + "&Parent_Code=" + parentCode);
    }


    Now you've got a variable that contains only the parameters needed to get into the category. You don't want to include a parentCategory parameter if there isn't supposed to be one.
    Script: Redo category page (recurisve) in Interpreted Java

    session.scrapeFile("The category-view scrapeableFile name here");
  4. And then, your scrapeableFile for the category view should have a URL field kind of like the following:

    http://www.someSite.com/view.php?Screen=CTGY&~#PARAMETERS#~

Now, I've left out something kind of critical... the "DEPTH" variable is never incremented. I'm not sure how you want to do that. If you increment it in the "Process list" script, you're going to find that every category on the site will have it's own depth number... so, it'd be more like an ID than a depth.

On the other hand, if you want to keep it as a true "depth" variable, where all child categories from the main page have the same depth number then you'd have to get even tricksier. It would involve adding 2 extra scripts:

  1. One that goes "Before pattern application" on that main extractor pattern for the category scrapeableFile. It's "sequence" wouldn't matter, but to make it look right, I'd put it's sequence at "1", making the "Process link" script number 2.

    All this script would do is take the "DEPTH" variable and add 1 to it:

    session.addToVariable("DEPTH", 1);

  2. Another script would need to be run "After pattern is applied", which subtracts 1 from the DEPTH variable, and wipes out the old variables so that they don't carry over to any other categories by accident:

    session.setVariable("CHILD_AT_DEPTH_" + session.getVariable("DEPTH"), null);
    session.setVariable("PARENT_AT_DEPTH_" + session.getVariable("DEPTH"), null);
    session.addToVariable("DEPTH", -1);

The recursive nature of the scrapeableFile would increment the DEPTH variable by 1 every time it begins a new nested category page. After it finishes, it'll decrement it by 1, putting it back where it needs to be for category list that called it. However, if a new nested category is encountered before the depth decrement, then the new recursive scrapeableFile will increment it *again* when it starts, and then decrement it back when it's finished. When you return to the previous scrapeableFile, it'll be okay and dandy.

With this method, you'll be overwriting the "PARENT_AT_DEPTH_X" and "CHILD_AT_DEPTH_X" (with "X" being the depth, not an "X") as you go down the list. So, the second category on the page will overwrite the first one's variables. But, so long as you're going down the child, sub-child, sub-sub-child tree, the variables will just be written as "..._DEPTH_1", ".._DEPTH_2", "..._DEPTH_3", etc, etc.

In other words, you'll have access to the whole line of category variables from depth 1 to depth X, so long as that's the line of variables that corresponds to the category branch that you're working down. And of course, the sessionVariable "DEPTH" is always available to you, to know how many categories you've nested yourself into.

Hope this is helping :)

Tim

Re-Follow-up

Tim,

It took a while, but following your guideline, I was able to set-up parent-child scripts to fully scrape the site! Thanks!

I am now working on the variable pass-through as my PDQ solution of just writing the parent_category_code to the output file to get an indented parent/child output works, but has flaws. :-(

I have a new issue which is keeping me from completing. I have followed all the tutorial instructions on logging into a site and everything seems to be working...except that I don't see my logged in prices (which are discounted). I have also included all the pre-steps to the login page in the scrape script (from a posted forum suggestion). Is there something that I am not doing or doing incorrectly?

Click sequence for the website is:
Website main
Click on "Enter On-line Store"
goes to Storefront (xxx./Merchant2/merchant.mvc)
Click on "Please Sign In" goes to page (xxx.xxxxMerchant2/merchant.mvc+Session_ID=xxxxx&Screen=LOGN&Order=0)
Logging on take me back to Storefront (https://xxxxxx/smivavm?xxxx/Merchant2/merchant.mvc+Session_ID=xxxx)

Many thanks for the great help!

Nadir