extractData ???

I have a scraping session that is using an extractor to scrape topics in the main pattern, and then created another extractor pattern which will match however many matches there are within the first dataSet. which is OK, however, I need to be able to get the topic along side of each detail variables.

I.e.

The first Main Extractor has the following pattern text and the second Main Extractor being called manually from a script.

1st MAIN:

2nd MAIN:

~@COMMENTARY_DATE@~

~@COMMENTARY@~

I want to be able to get the Category from data that is being included within the extracted pattern from the 1st Main and include it in the row when processing out to a .csv file.

Like:
CATEGORY, COMMENTARY_DATE, COMMENTARY

The original HTML is as follows:

Support and Documentation - Positive

8/2009

The support people are very responsive to any issues we come across. The vendor is very proactive in communicating the changes we need to make.

2/2009

The customer service from HDX is great right now. HDX has made tremendous improvements, so their service is better and more responsive. As we make requests, they are doing a much better job in fulfilling them.

I would like the following retrieved from the above example.

CATEGORY COMMENTARY_DATE COMMENTARY
Support and Documentation - Positive 8/2009 The support people are very responsive to any issues we come across.
Support and Documentation - Positive 2/2009 The customer service from HDX is great right now.

This will be hard without me

This will be hard without me able to get to the page.

First, you need to expand your first main extractor to include one token that will encompass all of the comments, eg:

~@COMMENTS@~

The COMMENTS will get one segment of HTML that will hold all comments, no matter how many there are.

Then you will make a script that runs 2nd Main against the COMMENTS token:

import com.screenscraper.common.*;

myDataSet = scrapeableFile.extractData(dataRecord.get("COMMENTS"), "2nd Main" );

for (i = 0; i < myDataSet.getNumDataRecords(); i++)
{
myDataRecord = myDataSet.getDataRecord(i);

// Here is where you would do something with the results.
}

The logic you have in the

The logic you have in the First Main does not make sense to me. I have the following:

In the Second Main I have the script that runs 2nd Main against the CATEGORYHEADER token:


import com.screenscraper.common.*;

myDataSet = scrapeableFile.extractData(dataRecord.get("CATEGORYHEADER "), "2nd Main" );

for (i = 0; i < myDataSet.getNumDataRecords(); i++)
{
myDataRecord = myDataSet.getDataRecord(i);

// Here is where you would do something with the results.
}

While in the pattern text of the Second Main I have the following:

~@COMMENTARY_DATE@~

~@COMMENTARY@~

But I still need to get the Category which is part of the CATEGORYHEADER token.

Still confused.

I made this scrape for you as

I made this scrape for you as an example. It will go to the Amazon reviews page for the PC version of the game Modern Warfare 2. There is one extractor to get the game name, and then it will invoke a second extractor to get the comments and write them to a log. Just take this code, save it in a text file named "MW2 (scraping session).sss" and import to your screen-scraper. You should see it working then.

<?xml version="1.0" encoding="ISO-8859-1"?>

ScrapingSessionMW2 reviews

MW2 reviews1001falsefalse521November 16, 2009 09:46:40http://www.amazon.com/Call-Duty-Modern-Warfare-2-Pc/product-reviews/B00269QLJ2/ref=cm_cr_dp_all_summary?ie=UTF8&showViewpoints=1&sortBy=bySubmissionDateDescendingReviews
<h1 class="sans-serif" style="margin:2px 0 0 0; font-size:140%;"><a href="~@URL@~">~@PRODUCT_NAME@~</a></h1>
~@COMMENTS@~
Recent discussions in the
Product information[^"]*URLCOMMENTS[^<>]*PRODUCT_NAME


<div style="margin-bottom:0.5em;">~@HELPFUL@~ of ~@TOTAL@~ people found the following review helpful:</div>
~@DATARECORD@~
<table cellspacing="0" cellpadding="0" border="0">
CommentsDATARECORD\d*TOTAL\d*HELPFUL
Fun:</span><img src="~@JUNK@~" width="64" alt="~@FUN_RATING@~ out of 5 stars"[^<>]*FUN_RATINGJUNK

&nbsp;</div> ~@DESCRIPTION@~ <div style="padding-top: 10px; clear: both; width: 100%;"> DESCRIPTION

I can understand and get this

I can understand and get this to work, my question is how do you flatten out an output file to show game, comments. i.e.


GAME COMMENTS
Modern Warfare 2 Comment1
Modern Warfare 2 Comment2

P.S. > Where you scraping

P.S. > Where you scraping comments? I have some things to say about the PC version of MW2.

That is a problem I see often

That is a problem I see often too. The ideal solution would be to save to a relational database, and barring that XML is good at relational data.

If you have to use a flat file, you can:
-Make a row for each comment, and repeat the game details per row
-Make a field with an alternate delimiter that holds all of the comments
-Make a file with the game info and an ID, and a separate file with just the comments and a column to show you the game ID referred to (like a faux database).