Three variables read from three text files to generate URLs?
I'm trying to read three sets of variables from three .txt files, in order to generate the URLs for scraping.
So, for instance, I have three text files:
1) a list of domains
2) a list of dates (yyyy-mm-dd format)
3) a list of search types (a and n are the available types that I currently use)
I want to be able to create a URL that will incorporate each domain for each of the dates and search types.
Here's a sample URL:
http://sitetoscrape.com/detail/?ns={DOMAIN}&date={DATE}&net=9&changes=15&act={SEARCH TYPE}
So, I'd want to create a URL like this:
http://sitetoscrape.com/detail/?ns=EXAMPLE.COM&date=2006-06-10&net=9&changes=15&act=n
http://sitetoscrape.com/detail/?ns=EXAMPLE.COM&date=2006-06-11&net=9&changes=15&act=a
http://sitetoscrape.com/detail/?ns=EXAMPLE.COM&date=2006-06-11&net=9&changes=15&act=n
http://sitetoscrape.com/detail/?ns=EXAMPLE.COM&date=2006-06-12&net=9&changes=15&act=a
http://sitetoscrape.com/detail/?ns=EXAMPLE.COM&date=2006-06-12&net=9&changes=15&act=n
http://sitetoscrape.com/detail/?ns=EXAMPLE2.COM&date=2006-06-10&net=9&changes=15&act=a
http://sitetoscrape.com/detail/?ns=EXAMPLE2.COM&date=2006-06-10&net=9&changes=15&act=n
http://sitetoscrape.com/detail/?ns=EXAMPLE2.COM&date=2006-06-11&net=9&changes=15&act=a
http://sitetoscrape.com/detail/?ns=EXAMPLE2.COM&date=2006-06-11&net=9&changes=15&act=n
http://sitetoscrape.com/detail/?ns=EXAMPLE2.COM&date=2006-06-12&net=9&changes=15&act=a
http://sitetoscrape.com/detail/?ns=EXAMPLE2.COM&date=2006-06-12&net=9&changes=15&act=n
etc., etc...
Here's the code I've got so far, but it doesn't work right:
File inputFile = new File( "search_domains.txt" );
File inputFile2 = new File( "search_dates.txt" );
File inputFile3 = new File( "search_terms.txt" );
// These objects are needed to read the file.
FileReader in = new FileReader( inputFile );
FileReader in2 = new FileReader( inputFile2 );
FileReader in3 = new FileReader( inputFile3 );
BufferedReader buffRead = new BufferedReader( in );
BufferedReader buffRead2 = new BufferedReader( in2 );
BufferedReader buffRead3 = new BufferedReader( in3 );
// Read the files in line-by-line. Each line in the text files should contain a search term.
while( ( searchDomain = buffRead.readLine() )!=null)
{
session.pause( 5000 );
while( ( searchDate = buffRead2.readLine() )!=null)
{
while( ( searchTerm = buffRead3.readLine() )!=null)
{
session.setVariable( "sDOMAIN", searchDomain ); // Set a session variable corresponding to the domain.
session.setVariable( "sDATE", searchDate ); // Set a session variable corresponding to the date.
session.setVariable( "sTERM", searchTerm ); // Set a session variable corresponding to the search type [l|a|n|d].
session.scrapeFile( "Search results" ); // Get search results for this particular search term.
}
}
}
// Close up the objects to indicate we're done reading the file.
in.close();
in2.close();
in3.close();
buffRead.close();
buffRead2.close();
buffRead3.close();
Any ideas on how to get it working? I know it's got to do with the looping, but I'm very weak on Java, so I'm not sure how to do it.
Three variables read from three text files to generate URLs?
Great news! Good luck on the rest of your project.
Kind regards,
Todd
Three variables read from three text files to generate URLs?
Ok, I sat down and drew out exactly what the flow should be, and came up with this code, which (surprisingly) works great...
// Create a file object that will point to the file containing the search terms.
File inputFile = new File( "search_domains.txt" );
File inputFile2 = new File( "search_dates.txt" );
File inputFile3 = new File( "search_types.txt" );
// These objects are needed to read the file.
FileReader in = new FileReader( inputFile );
FileReader in2 = new FileReader( inputFile2 );
FileReader in3 = new FileReader( inputFile3 );
BufferedReader buffRead = new BufferedReader( in );
BufferedReader buffRead2 = new BufferedReader( in2 );
BufferedReader buffRead3 = new BufferedReader( in3 );
ArrayList arrsearchDomain = new ArrayList();
ArrayList arrsearchDate = new ArrayList();
ArrayList arrsearchType = new ArrayList();
Loop1Counter = 0;
Loop2Counter = 0;
Loop3Counter = 0;
// Read the files in line-by-line. Each line in the text files should contain a search term.
while( ( searchDomain = buffRead.readLine() )!=null)
{
arrsearchDomain.add(searchDomain);
LatestDomain = arrsearchDomain.get(Loop1Counter);
session.log( "Domain added: " + LatestDomain );
Loop1Counter++;
}
while( ( searchDate = buffRead2.readLine() )!=null)
{
arrsearchDate.add(searchDate);
LatestDate = arrsearchDate.get(Loop2Counter);
session.log( "Date added: " + LatestDate );
Loop2Counter++;
}
while( ( searchType = buffRead3.readLine() )!=null)
{
arrsearchType.add(searchType);
LatestType = arrsearchType.get(Loop3Counter);
session.log ( "Search Type added: " + LatestType );
Loop3Counter++;
}
arrsearchDomainSize = arrsearchDomain.size();
arrsearchDateSize = arrsearchDate.size();
arrsearchTypeSize = arrsearchType.size();
session.log( "Number of Domains: " + arrsearchDomainSize );
session.log( "Number of Dates: " + arrsearchDateSize );
session.log( "Number of Search Types: " + arrsearchTypeSize );
for( i = 0; i < arrsearchDomainSize; i++ )
{
LatestDomain = arrsearchDomain.get(i);
session.setVariable( "sDOMAIN", LatestDomain ); // Set a session variable corresponding to the domain.
for( ii = 0; ii < arrsearchDateSize; ii++ )
{
LatestDate = arrsearchDate.get(ii);
session.setVariable( "sDATE", LatestDate ); // Set a session variable corresponding to the date.
for( iii = 0; iii < arrsearchTypeSize; iii++ )
{
LatestType = arrsearchType.get(iii);
session.setVariable( "sTERM", LatestType ); // Set a session variable corresponding to the search type [l|a|n|d].
session.pause( 5000 );
session.scrapeFile( "Search results" ); // Get search results for this particular search term.
// session.log ( Integer.toString(i) );
// session.log ( session.getVariable( "sDOMAIN" ) );
// session.log ( session.getVariable( "sDATE" ) );
// session.log ( session.getVariable( "sTERM" ) );
}
}
}
// Close up the objects to indicate we're done reading the file.
in.close();
in2.close();
in3.close();
buffRead.close();
buffRead2.close();
buffRead3.close();
I still have a bit of cleaning up to do on the code, but it's working, so all is good.
Three variables read from three text files to generate URLs?
Nope, not working quite right yet... it ends after the first domain in the list, i.e.: after the first entry in the search_domains.txt file.
But, aside from that, it's working exactly as I want it to. I'm sure it's a simple matter to figure out how to keep the loop working for all the domains in the list... perhaps some sort of jump in the code to a separate function that calls the session.scrapeFile after each new URL is created?
Unfortunately, I'm still not sure how to do that...
Three variables read from three text files to generate URLs?
Would this do the trick?
// Create a file object that will point to the file containing the search terms.
File inputFile = new File( "search_domains.txt" );
File inputFile2 = new File( "search_dates.txt" );
String[] searchTerms = { "l", "a", "n", "d" };
// These objects are needed to read the file.
FileReader in = new FileReader( inputFile );
FileReader in2 = new FileReader( inputFile2 );
BufferedReader buffRead = new BufferedReader( in );
BufferedReader buffRead2 = new BufferedReader( in2 );
// Read the files in line-by-line. Each line in the text files should contain a search term.
while( ( searchDomain = buffRead.readLine() )!=null)
{
session.pause( 5000 );
while( ( searchDate = buffRead2.readLine() )!=null)
{
for( i = 0; i < searchTerms.length; i++ )
{
searchTerm = searchTerms[i];
session.setVariable( "sDOMAIN", searchDomain ); // Set a session variable corresponding to the domain.
session.setVariable( "sDATE", searchDate ); // Set a session variable corresponding to the date.
session.setVariable( "sTERM", searchTerm ); // Set a session variable corresponding to the search type [l|a|n|d].
session.scrapeFile( "Search results" ); // Get search results for this particular search term.
}
}
}
// Close up the objects to indicate we're done reading the file.
in.close();
in2.close();
buffRead.close();
buffRead2.close();
Three variables read from three text files to generate URLs?
{bump}
Anyone got any ideas on how to get the looping to work? I've exhausted all of my ideas, so I'm up against the wall on getting it to work.
Three variables read from three text files to generate URLs?
Ah, I see what's happening... after it reads in the two variables from the search types (a and n) file, it thinks its job is done... when in fact, I want it to loop back and start over again on that one. So, it should alternate between the search types (a, n, a, n, a, n, etc.).
The problem is, I can't just create code telling it to alternate between those two, because there are actually 4 search types (l, a, n, d) available, and I might want to also use those other search types in the future. The easiest way of doing that is to simply put the search types to be used into a .txt file.
Three variables read from three text files to generate URLs?
Well, the code *seems* to work, but it stops after only the first two URLs.
Here's the log file:
Starting scraper.
Running scraping session: Test
Processing scripts before scraping session begins.
Processing script: "Read Search Terms"
Scraping file: "Search results"
Search results: Preliminary URL: http://sitetoscrape.com/detail/?ns=~#sDOMAIN#~&date=~#sDATE#~&net=9&changes=15&act=~#sTERM#~
Search results: Resolved URL: http://sitetoscrape.com/detail/?ns=EXAMPLE.COM&date=2006-06-10&net=9&changes=15&act=a
Search results: Sending request.
Search results: Processing scripts before all pattern applications.
Search results: Extracting data for pattern "FOUNDDOMAIN"
Search results: The pattern did not find any matches.
Search results: Processing scripts after all pattern applications.
Search results: Warning! No matches were made by any of the extractor patterns associated with this scrapeable file.
Scraping file: "Search results"
Search results: Preliminary URL: http://sitetoscrape.com/detail/?ns=~#sDOMAIN#~&date=~#sDATE#~&net=9&changes=15&act=~#sTERM#~
Search results: Resolved URL: http://sitetoscrape.com/detail/?ns=EXAMPLE.COM&date=2006-06-10&net=9&changes=15&act=n
Search results: Sending request.
Search results: Processing scripts before all pattern applications.
Search results: Extracting data for pattern "FOUNDDOMAIN"
Search results: The pattern did not find any matches.
Search results: Processing scripts after all pattern applications.
Search results: Warning! No matches were made by any of the extractor patterns associated with this scrapeable file.
Processing scripts after scraping session has ended.
Scraping session finished.
Notice that it only ran through the first two URLs... meaning that it's stopping after it loops through the two search types (a and n), and not looping back up to the other While loops.
Three variables read from three text files to generate URLs?
Hi,
Your code looks okay to me. Could you give more detail on what isn't working?
Also, it might help to insert some session.log statements to ensure that the values are getting set as you think they should be. For example, you might insert a line in your loop after you set the value for sDOMAIN, like this:
session.log( "sDOMAIN: " + session.getVariable( "sDOMAIN" ) );
Kind regards,
Todd Wilson