Newbie Help with Parent/Child Relationships
Hi,
I've been trying to work a solution, but haven't been able to solve a parent-child issue. Hope you all can help!
Here is the structure:
Child1->ProductList->Product1_Detail
Product2_Detail
Child2->ProductList->Product1_Detail
Product2_Detail
Parent1->Child1->ProductList->Product1_Detail
Product2_Detail
Child2->ProductList->Product1_Detail
Product2_Detail
At any point there can be a child with a product, as well as a parent with children.
I can individually run a scrape for say:
Child2->ProductList->Product1_Detail
Product2_Detail
But, I am having difficulty with running a scrape for a given level to find the child->products and also the parent of another child at the same time. There are up to 5 levels of Parent/Child. I have so far been able to run the child and then run another scrape file for the parent, but then get bogged down with the stored variable (product code) to generate the parent code. Whether a parent or a child, the term used for identifying the page is category_code.
If this is not clear, please let me know.
Thanks.
NR
Well, I'm not sure if we're
Well, I'm not sure if we're well-enough equipped here at screen-scraper to handle such problems.. If I may suggest, you could check out this book, though:
http://www.drphilstore.com/famfirst.html
It's normally $26, but it's on sale on the site for $17.25.
... :) Just kidding. I couldn't help but chuckle at the title of the thread. I couldn't decide if I might see a spam posting tailored to look real, or if it'd be a legit object-oriented site question.
さて! Let me see.. I'm going to try to repeat your question so that I can demonstrate if I've understood you properly. I'll reorder the children a little bit to help generalize the senario.
I'm going to color-code to make this easier..
You've got:
Main page (A category page)
child A (Product listings for category)
product details
child B (Directory of sub-categories)
sub-child 1 (Product listings for sub-category)
product details
sub-child 2 (Product listings for sub-category)
product details
child C (Product listings for category)
product details
One question: Does the page for child A follow the same layout as sub-child 1?
If so, you could try and get recursive on it. For instance:
~@PARAMETERS@~: Save in session variable? Pattern: [^>]+ Scripts: Goto listings page After each pattern application
~@PARAMETERS@~: Save in session variable? Pattern: [^>]+ Scripts: Redo category page (recursive) After each pattern application
The script that this one is calling should just do a call to rescrape the very same scrapeableFile.
One thing to note about this recursive approach is that you should only use it if there is 1, maybe 2, sub-category branches. For instance, if your diagram looks more like the following, then you're going to encounter memory issues pretty quickly:
Main page
child A
product details
child B
sub-child 1
sub-sub-child 1
sub-sub-sub-child 1
product details
sub-sub-sub-child 2
product details
sub-sub-child 2
sub-sub-sub-child 1
product details
sub-sub-sub-child 2
product details
sub-child 2
product details
http://www.somewebsite.com/category_view.asp?~#PARAMETERS#~
As a warning, you might want to use a script to process the "PARAMETERS" sessionVariable, to change "&" to "&". It will cause the page to fail to request properly if you've got a bunch of "&" HTML codes in your URL field.
Using this method, you won't be listing your parameters in the "Parameters" tab. All you need is the URL that I've written just above.
http://www.somewebsite.com/listings.asp?~#PARAMETERS#~
*big sigh* So. I hope I'm on the right track for what you need.
If you need more help, ask away!
Tim
Parent/Child Continuing
Tim,
Ha. Ha! Well, I for one appreciate your sense of humor! Good one...it took me a second while reading to figure out where you were going.
Thanks for the detailed reply (is there a quick and easy editor that convert to code the indents and coloring?).
You're right on track and very helpful. I've been digesting the information you provided. This of course leads to some more questions:
My example is more like the second, more detailed, site map, however, each child or sub-child or sub-sub-child gets called by the same code:
i.e.
http://www.xxxxxxxScreen=CTGY&Category_Code=Additives
More specifically, the site is lay-out is as follows:
Store_Front->category_code->product_list->product_details
or
Store_Front->category_code->parent_code->category_code->product_list->product_details
or
Store_Front->category_code->parent_code->category_code->parent_code->category_code->product_list->product_details
This can repeat for several levels.
Category_code used throughout the site as the level above the product list and then the product details. Except for when there is a parent_code thrown in. Then the process needs to continue one level further.
I figured out how to scrape the child and parent. Thanks. I am working on each level.
In order to re-use a scaping file, is there any way to store the child variable = "category_code" and use it to get the product details through-out the scraping? Or, do I need to create a new scrape for each level of child or sub-child or sub-sub-child to get the product details.
Similarly, when I am writing the output to a file, I would like to use variables used prior (i.e. I am sending output from a sub-sub-child, there is information at the child level which I would like to output include in each row of sub-sub-child data). I looked at the session.getVariable, but did not have any luck. I get "null".
Many thanks!
Nadir
My pleasure :)
My pleasure :)
Unfortunately, no, there isn't a handy editor for colors and indentation. I just cracked that up myself with HTML. The indentation can only be done by putting "non-breaking" spaces. To make a single one, you have to write " ". I think I put 4 of those in a row in order to get a single indent, then 8 for two indents, etc, etc. Kind of hard to look at in plain old text while writing the reply :P The colors were a "<span style="color:blue>some text here</span>" , where the word "blue" can be replaced with lots of basic colornames, or with a 6-digit hex code, like "#000000" for black, or "#00ff00" for green.
Anyway.
Let me get one more thing clear..... When viewing an arbitrary page with a list of categories on it (a mix of end-of-the-line child categories and of new parent categories), the URLs are distinguished only by the parameters sent to the same page. So then, links on the page look like the following, for example:
Sean Connery, for $200 please.
fish heads, fish heads, rolly polly fish heads
boolean granola
breakfast cereals
Japan
10 little indians
squeeker
blah
You said that you've figured out how to do the initial scraping part, so I'll forgo further examination..
To keep copies of each category code as you go down the list might get a little tricksy. The best way that I've done this went something like this:
session.setVariable("DEPTH", 0);
When to run: Before scraping session begins
/view.php?Screen=CTGY&~@DATARECORD@~">
~@DATARECORD@~: Don't save in session variable Pattern: [^"]+
Scripts:
Process link Squence: 1 After each pattern application
Redo category page (recursive) Squence: 2 After each pattern application
Sub-extractor patterns:-
-
Category_Code=~@CATEGORY_CODE@~
~@CHILD_CODE@~: Don't save in session variable Pattern: [^&"]+
&Parent_Code=~@PARENT_CODE@~
~@PARENT_CODE@~: Don't save in session variable Pattern: [^"]+
This way, your pattern will match on every single link, whether it's for a child or parent category. The script that I've said to call ("Process link" from the main extractor pattern part) will have access to "CHILD_CODE" and "PARENT_CODE" via the dataRecord.
Script: Process link in Interpreted Java
String parentCode = dataRecord.get("PARENT_CODE");
String childCode = dataRecord.get("CHILD_CODE");
String depth = session.getVariable("DEPTH");
// This will be the URL for the next category page you want to browse into
session.setVariable("PARAMETERS", "Category_Code=" + childCode);
session.setVariable("CHILD_AT_DEPTH_" + depth, childCode);
if (parentCode != null)
{
session.setVariable("PARENT_AT_DEPTH_" + depth, parentCode);
session.setVariable("PARAMETERS", session.getVariable("PARAMETERS") + "&Parent_Code=" + parentCode);
}
Now you've got a variable that contains only the parameters needed to get into the category. You don't want to include a parentCategory parameter if there isn't supposed to be one.
Script: Redo category page (recurisve) in Interpreted Java
session.scrapeFile("The category-view scrapeableFile name here");
http://www.someSite.com/view.php?Screen=CTGY&~#PARAMETERS#~
Now, I've left out something kind of critical... the "DEPTH" variable is never incremented. I'm not sure how you want to do that. If you increment it in the "Process list" script, you're going to find that every category on the site will have it's own depth number... so, it'd be more like an ID than a depth.
On the other hand, if you want to keep it as a true "depth" variable, where all child categories from the main page have the same depth number then you'd have to get even tricksier. It would involve adding 2 extra scripts:
All this script would do is take the "DEPTH" variable and add 1 to it:
session.addToVariable("DEPTH", 1);
session.setVariable("CHILD_AT_DEPTH_" + session.getVariable("DEPTH"), null);
session.setVariable("PARENT_AT_DEPTH_" + session.getVariable("DEPTH"), null);
session.addToVariable("DEPTH", -1);
The recursive nature of the scrapeableFile would increment the DEPTH variable by 1 every time it begins a new nested category page. After it finishes, it'll decrement it by 1, putting it back where it needs to be for category list that called it. However, if a new nested category is encountered before the depth decrement, then the new recursive scrapeableFile will increment it *again* when it starts, and then decrement it back when it's finished. When you return to the previous scrapeableFile, it'll be okay and dandy.
With this method, you'll be overwriting the "PARENT_AT_DEPTH_X" and "CHILD_AT_DEPTH_X" (with "X" being the depth, not an "X") as you go down the list. So, the second category on the page will overwrite the first one's variables. But, so long as you're going down the child, sub-child, sub-sub-child tree, the variables will just be written as "..._DEPTH_1", ".._DEPTH_2", "..._DEPTH_3", etc, etc.
In other words, you'll have access to the whole line of category variables from depth 1 to depth X, so long as that's the line of variables that corresponds to the category branch that you're working down. And of course, the sessionVariable "DEPTH" is always available to you, to know how many categories you've nested yourself into.
Hope this is helping :)
Tim
Re-Follow-up
Tim,
It took a while, but following your guideline, I was able to set-up parent-child scripts to fully scrape the site! Thanks!
I am now working on the variable pass-through as my PDQ solution of just writing the parent_category_code to the output file to get an indented parent/child output works, but has flaws. :-(
I have a new issue which is keeping me from completing. I have followed all the tutorial instructions on logging into a site and everything seems to be working...except that I don't see my logged in prices (which are discounted). I have also included all the pre-steps to the login page in the scrape script (from a posted forum suggestion). Is there something that I am not doing or doing incorrectly?
Click sequence for the website is:
Website main
Click on "Enter On-line Store"
goes to Storefront (xxx./Merchant2/merchant.mvc)
Click on "Please Sign In" goes to page (xxx.xxxxMerchant2/merchant.mvc+Session_ID=xxxxx&Screen=LOGN&Order=0)
Logging on take me back to Storefront (https://xxxxxx/smivavm?xxxx/Merchant2/merchant.mvc+Session_ID=xxxx)
Many thanks for the great help!
Nadir