extractData ???

I have a scraping session that is using an extractor to scrape topics in the main pattern, and then created another extractor pattern which will match however many matches there are within the first dataSet. which is OK, however, I need to be able to get the topic along side of each detail variables.

I.e.

The first Main Extractor has the following pattern text and the second Main Extractor being called manually from a script.

1st MAIN:

2nd MAIN:

~@COMMENTARY_DATE@~

~@COMMENTARY@~

I want to be able to get the Category from data that is being included within the extracted pattern from the 1st Main and include it in the row when processing out to a .csv file.

Like:
CATEGORY, COMMENTARY_DATE, COMMENTARY

The original HTML is as follows:

Support and Documentation - Positive

8/2009

The support people are very responsive to any issues we come across. The vendor is very proactive in communicating the changes we need to make.

2/2009

The customer service from HDX is great right now. HDX has made tremendous improvements, so their service is better and more responsive. As we make requests, they are doing a much better job in fulfilling them.

I would like the following retrieved from the above example.

 CATEGORY                               COMMENTARY_DATE        COMMENTARY

Support and Documentation - Positive   8/2009                 The support people are very responsive to any issues we come across.

Support and Documentation - Positive   2/2009                 The customer service from HDX is great right now.

ms527 on 11/10/2009 at 11:19 am

screen-scraper public support

This will be hard without me

This will be hard without me able to get to the page.

First, you need to expand your first main extractor to include one token that will encompass all of the comments, eg:

~@COMMENTS@~

The COMMENTS will get one segment of HTML that will hold all comments, no matter how many there are.
Then you will make a script that runs 2nd Main against the COMMENTS token:
`import com.screenscraper.common.*;`

myDataSet = scrapeableFile.extractData(dataRecord.get("COMMENTS"), "2nd Main" );

for (i = 0; i < myDataSet.getNumDataRecords(); i++)
{
myDataRecord = myDataSet.getDataRecord(i);

// Here is where you would do something with the results.
}

jason on 11/12/2009 at 9:45 am

The logic you have in the

The logic you have in the First Main does not make sense to me. I have the following:

In the Second Main I have the script that runs 2nd Main against the CATEGORYHEADER token:

`import com.screenscraper.common.*;`

myDataSet = scrapeableFile.extractData(dataRecord.get("CATEGORYHEADER "), "2nd Main" );

for (i = 0; i < myDataSet.getNumDataRecords(); i++)
{
myDataRecord = myDataSet.getDataRecord(i);

// Here is where you would do something with the results.
}

While in the pattern text of the Second Main I have the following:

~@COMMENTARY_DATE@~

~@COMMENTARY@~

But I still need to get the Category which is part of the CATEGORYHEADER token.

Still confused.

ms527 on 11/13/2009 at 3:15 pm

I made this scrape for you as

I made this scrape for you as an example. It will go to the Amazon reviews page for the PC version of the game Modern Warfare 2. There is one extractor to get the game name, and then it will invoke a second extractor to get the comments and write them to a log. Just take this code, save it in a text file named "MW2 (scraping session).sss" and import to your screen-scraper. You should see it working then.
<?xml version="1.0" encoding="ISO-8859-1"?>

ScrapingSession

MW2 reviews

MW2 reviews1001falsefalse521November 16, 2009 09:46:40http://www.amazon.com/Call-Duty-Modern-Warfare-2-Pc/product-reviews/B00269QLJ2/ref=cm_cr_dp_all_summary?ie=UTF8&showViewpoints=1&sortBy=bySubmissionDateDescendingReviews
<h1 class="sans-serif" style="margin:2px 0 0 0; font-size:140%;"><a href="~@URL@~">~@PRODUCT_NAME@~</a></h1>
~@COMMENTS@~
Recent discussions in theProduct information[^"]*URLCOMMENTS[^<>]*PRODUCT_NAME

<div style="margin-bottom:0.5em;">~@HELPFUL@~ of ~@TOTAL@~ people found the following review helpful:</div>
~@DATARECORD@~
<table cellspacing="0" cellpadding="0" border="0">
CommentsDATARECORD\d*TOTAL\d*HELPFUL
Fun:</span><img src="~@JUNK@~" width="64" alt="~@FUN_RATING@~ out of 5 stars"[^<>]*FUN_RATINGJUNK

 </div> ~@DESCRIPTION@~ <div style="padding-top: 10px; clear: both; width: 100%;">

DESCRIPTION

jason on 11/16/2009 at 10:49 am

I can understand and get this

I can understand and get this to work, my question is how do you flatten out an output file to show game, comments. i.e.

GAME COMMENTS Modern Warfare 2 Comment1 Modern Warfare 2 Comment2

ms527 on 12/16/2009 at 10:26 am

P.S. > Where you scraping

P.S. > Where you scraping comments? I have some things to say about the PC version of MW2.

jason on 12/16/2009 at 6:00 pm

That is a problem I see often

That is a problem I see often too. The ideal solution would be to save to a relational database, and barring that XML is good at relational data.

If you have to use a flat file, you can:
-Make a row for each comment, and repeat the game details per row
-Make a field with an alternate delimiter that holds all of the comments
-Make a file with the game info and an ID, and a separate file with just the comments and a column to show you the game ID referred to (like a faux database).

jason on 12/16/2009 at 5:59 pm

Search

Community

screen-scraper

User login

extractData ???

~@COMMENTARY_DATE@~

Support and Documentation - Positive

8/2009

2/2009

This will be hard without me

The logic you have in the

~@COMMENTARY_DATE@~

I made this scrape for you as

I can understand and get this

P.S. > Where you scraping

That is a problem I see often