multi-threading questions
I'm writing a project to scrape some very large forums.
I have one scraping session which collects all the config data and pretty much sets everything up for the main scraping exercise which is the posts.
I want to scrape the posts in several threads at once. The number of threads will be set by a session var in the init script (along with lots of other parameters) and basically I'm just planning to use an iterative loop to check if each thread is finished then spawn another one if it is...
I'm assuming I will be able to use an array of RunnableScrapingSessions. This will allow me to reference them with a variable name. Can you suggest a better data type for handling this?
second question... because there is so much config info already saved in session vars it would be tidy if I could just pass the entire set of session vars to each thread. I would assume that if they are passed on using the session inheritance in the RunnableScrapingSession constructor that they would in their own schild scope but I've experimented and that's not the case. If I change a session var in the child thread it's changed in the parent thread as well. As you can imagine with multiple asynchronous threads all accessing the same set of session vars it will create havoc.
The only way I can see to get around it is to not pass the session vars to the child thread and use session.saveVariables(), session.loadVariables(). Is there a neater way to do this without having to use files?
third question... I have to use lazyScrape to get multiple threads running concurrently... The only way I can find to actually monitor the child threads to find out when they are finished (so my loop doesn't spawn thousand of threads at once) is to set a session variable in the child thread as a flag and read it from the parent thread. Is there a better way to do this? I originally thought that the "session" object was actually an instance of RunnableScrapingSession but it doesn't seem to be so I can't use the isFinished() method. I guess my concern is if there's a bug in the child thread and it crashes before the 'isFinished' flag is set then the parent thread will never release the thread and move onto the next one...
Speaking of the
Speaking of the loadVariables() method... after a fair bit of hair-pulling I've just worked out it doesn't handle null variables too well.
session.setVariable("FILE_SUFFIX","");
tranlsates into:
FILE_SUFFIX=
which causes:
The error message was: The application script threw an exception: java.lang.ArrayIndexOutOfBoundsException: 1 BSF info: null at line: 0 column: columnNo
Deleting that line from the saved variables file gets rid of the error.
turns out the third point
turns out the third point (monitoring thread status) is a bigger deal than I thought... when I run it slowly with breakpoints interspersed it's ok but if I take them out and run at full speed with 5 threads (10 max in the workbench) I get very random results. Each thread appends to a different csv file. I'm iterating through a list of records so in theory I should get:
thread 1:
records 0-4
thread 2:
records 5-9
thread 3:
records 10-14
thread 4:
records 15-19
thread 5:
records 20-24
then back around to
thread 1:
records 25-29
and so on...
When I run it without breakpoints some of the files appear, some don't, some update with gaps.... i.e. thread might write records 0-4 then skip 25-29 then maybe write the next cycle...
I don't think it's a logic problem as I get completely different results if I run it twice in a row without changing anything...
I suspected the filewriter may be taking a while to finish and when the next thread comes around it can't open the file because it's still in use... either that or threads themselves aren't closing in time and I'm running out available threads... though neither of those options explains the fact that some of the threads don't write their files in the initial cycle when there shouldn't be any files open or any previous threads running...
basically I'm setting a session variable to the child thread childThread[i].setVariable("THREAD_COMPLETE",0);
and then in the very last line of the script I set it to 1 and use that var to test if the thread is complete or not. I've used the session log to confirm the variable is actually being changed...
Well, there's a lot to take
Well, there's a lot to take in and consider here..
Let's start with the multiple threads firing at once:
The workbench isn't really designed to handle multiple instances of the same scrape. You will in fact find some erratic behavior this way. You're idea about the filewriters getting mixed up may also have something to do with it. I wouldn't worry about collision of 2 writes happening at the same time, but if one scrape isn't closing the file fast enough between writes, I could definitely see some weirdness popping up.
Hmm... this is kind of complicated, and the smoothest way I can think to really do what you want is by a mini database. I'm going to explain something that we do here for our own scrapes, and I'll try to connect that to what you're trying to accomplish.
We scrape lots of car, real estate, and insurance sites for clients, and often we have to iterate their sites by zip code. Well, there are 42,000+ zip codes in the US, so that takes a long time. And what if the scrape or computer or server dies only part way through? We have no choice but to restart the scrape at the first zip code. Starting it part way through a list of zip codes takes a file to track it, and requires the scrape to write out its current index in the list.
But we decided to avoid that approach. We wanted to be able to just start up any variable number of scrapes from the server mode (web interface or not, doesn't matter), and we wanted each scrape to simply take the next zip code in our big list. When it was done with that single zip code, it spawns a new dynamic, lazy scraping session (which in turn will dynamically take the next zip in the list all by itself), and then the parent scrape finishes and goes away. So this way we run 42,000 scrapes, each with only one zip. Aside from simple memory advantages, we can choose exactly how many are running, and they could all be running from multiple computers, anywhere, so long as they have access to the list of zip codes.
Best way to do that? A database. MySQL is really simple, and free, so we chose that. This is what we did:
The PHP just receives a GET/POST parameter "scrape ID" which can simply be the session's name.
This PHP file connects to the MySQL database, which has a few simple tables, and echoes back the zip code that the scrape should do. It's then up to the scrape to contact this PHP file (scrapeableFile is easiest, since you have extractor patterns at your disposal) and read the response text (the echoed zip code).
scrapeIDs
columns:scrapeID VARCHAR(50) PRIMARY KEY
offset INT(11) NOT NULL
zipcodes
columnszipcode VARCHAR(5) PRIMARY KEY
That above PHP simply looks up a query like this:
SELECT offset FROM scrapeIDs WHERE scrapeID = _utf8'$scrapeName'
(where$scrapeName
is the GET/POST parameter received)From there, the PHP can do another lookup:
SELECT zipcode FROM zipcodes OFFSET $offset LIMIT 1
(where$offset
is the offset retrieved from the first query)And there you have the zip code.
The PHP should then increment that offset value, and save it back into the database:
UPDATE scrapeIDs SET offset=$newOffset
(where$newOffset
is the old $offset + 1)So, even if you don't have any experience in PHP or MySQL, this isn't a tough design. The point is that the "iterator" is kept in the database itself, and is free to iterate over any data that you have in the database. (There should be an extra tiny bit of error handling when you hit the end of the list. Make the PHP restart the offset at 0, or just not echo anything back so that your scrapes stop.)
So for your situation, I think it would be ideal in reply to your comment: I'm assuming I will be able to use an array of RunnableScrapingSessions. This will allow me to reference them with a variable name. Can you suggest a better data type for handling this?
Although it's a completely different approach, it would really simplify this task, especially if you are scraping multiple forums.
You could further simplify our design to suit your own needs: One table called "forumScrapes", which could have one or more entries according to your scrape names. Instead of the "offset" column referring to another database table's offset, you could just use that offset in your scrape itself.
Quick example of some control script used in your scrape:
// This scrapeableFile can easily save the echo'd offset from the PHP into a session variable.
session.scrapeFile("Get next offset");
int start = Integer.parseInt(session.getVariable("OFFSET"))
// since that's what you used in your own examples
int step = 5;
int stop = start + step;
for (int i = start; i < stop; i++)
{
session.scrapeFile("Do the real work");
}
import com.screenscraper.scraper.RunnableScrapingSession;
// Note that I'm not using the "inheriting" version of this constructor. As you mentioned, if you alter one variable in the child, you alter one in the parent-- Not entirely thread safe if you're not expecting it.
nextScrape = RunnableScrapingSession(session.getName());
// Set up any variables you'd like to give to the new scrape.
String[] variables = new String[]{
"var1", "var2", "var3",
};
for (int i = 0; i < variables.length; i++)
nextScrape.setVariable(variables[i], session.getVariable(variables[i]));
nextScrape.scrape();
I went skiing today and really wacked my head pretty good .. having trouble thinking much more than this, currently. Tell me what you think, and if you think you could fit this model to your situation. If you need help with any PHP or MySQL, we do plenty of that as well, so you can ask away. This all really comes down to the difficulty with having a parent scrape keep track of its children. Honestly, you could go down that road, but I think this little database approach is good, since it has the ability to resume at any time, and to run as many of them concurrently as you can (limit imposed by license of the product or by computer power or by bandwidth).
Let me know how this is sounding, and we'll keep working from there.
Tim
ok I think I see what your
ok I think I see what your getting... actually it's kind of similar to what I was planning to implement in the parent iterator thread. Currently I'm just using a DataSet in place of a DB but I was intending to replace the dataset with a mySQL table shortly so I could actually resume later on.
I think the fundamental difference between the two approaches is actually the control structure.
The way I'm trying to do it is parent thread spawns x (call it 5) child sessions and passes them the list of data to work on (whether from a DB/file/DataSet).
In your approach the child sessions get the list of data direct from the DB rather than having it passed from the parent.
The other main difference is in my approach the parent thread monitors the child threads then respawns when they are finished.
In your approach the parent spawns the initial 5 threads then the child threads run and spawn a new child just before they finish. The parent isn't really involved once the first generation of children have been started. They've kind of left the nest to have babies of their own... Is that correct?
To clarify the point of using multiple threads at once is twofold... 1/ to make better use of the available bandwidth to speed things up... 2/ so I can come from different IPs at the same time.
You're right in your
You're right in your assertions about the differences in the approaches.
Ultimately, either way will accomplish the multi-threaded bit, since with your approach, you can have the parent monitoring the variable number of children, while the one I presented has the multiple generations chaining off of each other. If you've got an anonymizing solution in place, then yes, your child threads will be able to come from different IP addresses. Under the solution I presented, the different IPs are built into the fact that you can run from multiple computers (and you could add in more anonymization if desired).
So if you've got that end under control, then really it's just a matter of this whole variables-by-reference issue you were talking about below.
Hi Tim, Not bad for someone
Hi Tim,
Not bad for someone with concussion! :)
I'll have to reread this a couple of times... just for you reference I thought I'd post the code I've got so far... This is obviously run after the userList is generated...
masterUserList = session.getVariable("USERLIST");
tNum = session.getVariable("MAX_CONCURRENT_THREADS").intValue();
com.screenscraper.scraper.RunnableScrapingSession[] thread = new com.screenscraper.scraper.RunnableScrapingSession[tNum];
String[] sessionVarFile = new String[tNum];
//initialise RunnableScrapingSessions
for (i = 0; i < tNum; i++)
{
sessionVarFile[i] = session.getVariable("OUTPUT_PATH") + "temp/sVars" + i + ".txt";
thread[i] = new com.screenscraper.scraper.RunnableScrapingSession("fe - a1 topic list");
thread[i].setVariable("THREAD_COMPLETE",1);
session.log("init index: " + i);
}
for (i = 0; i < masterUserList.getNumDataRecords();) // iterate through userList
{
session.log("---------------------------- new cycle -------- records in userList: " + masterUserList.getNumDataRecords());
session.log("i index: " + i);
session.log("\t\tj index: " + j);
DataSet userList = new DataSet();
noFreeThreadFound = true;
waitCount = 0;
while (noFreeThreadFound)
{
if (!session.isRunning())
break;
for (j = 0; j < tNum && i < masterUserList.getNumDataRecords(); j++) // iterate through threads and check if still busy
{
if (thread[j].getVariable("THREAD_COMPLETE")==1) // if a thread is free load the userList dataset and launch new thread.
{
session.pause(2000);
for (k = 0; k < session.getVariable("MAX_RECORDS_PER_THREAD").intValue()
&& i < masterUserList.getNumDataRecords(); k++) // collect user records for thread
{
session.log("\t\t\t\tk index: " + k + " --- i = " + i + " ---- user_id = " + masterUserList.get(i,"USER_ID"));
userList.addDataRecord(masterUserList.getDataRecord(i));
i++;
}
noFreeThreadFound = false;
thread[j].setVariable("THREAD_COMPLETE",0);
session.saveVariables(sessionVarFile[j]);
thread[j] = new com.screenscraper.scraper.RunnableScrapingSession("fe - a2 user details");
thread[j].setDoLazyScrape(true);
thread[j].setVariable("SESSION_VAR_FILE",sessionVarFile[j]);
thread[j].setVariable("USERLIST",userList);
thread[j].setVariable("THREAD_ID",j);
thread[j].setVariable("THREAD_STATUS","BEGIN");
thread[j].setVariable("THREAD_COMPLETE",0);
session.log("thread id: [" + j + "] userList records: " + userList.getNumDataRecords()
+ " first user_ID: " + userList.get(0,"USER_ID"));
thread[j].scrape();
session.log("*********************************: " + userList.get(0,"USER_ID"));
userList.clearDataRecords();
}
else
{
//session.log("Thread[" + j + "] status: " + thread[j].getVariable("THREAD_STATUS"));
}
}
waitCount++;
}
session.log("waitCount: " + waitCount);
}
currently all the "fe - a2 user details" session does is run the following script:
dumpFile = dumpFile + ".csv";
DataSet userList = session.getVariable("USERLIST");
//session.log("thread id: [" + session.getVariable("THREAD_ID") + "] userList records: " + userList.getNumDataRecords()
// + " first user_ID: " + userList.get(0,"USER_ID"));
userList.writeToFile(dumpFile);
session.setVariable("THREAD_STATUS","WAITING");
session.setVariable("THREAD_STATUS","FINISHED");
//if (session.getVariable("THREAD_ID")!=1)
session.setVariable("THREAD_COMPLETE",1);
session.pause(0);
I'm already using the method you described above to write data to a mySQL DB so modifying it to track progress shouldn't be too hard. I was actually planning to do that, just hadn't got that far...
ok now I'll go and reread your post...
p.s. just realising it keeps getting rid of the tabs so all the indentation is gone... that's going to make it hard to read!... any way I can stop it trimming the tabs?
Just worked out what was
Just worked out what was wrong... I was passing the userList dataSet to the RunnableScrapingSession as a session variable. I thought since I wasn't using the inheritance option it would send a copy but it seems to pass it by reference:
thread[j] = new com.screenscraper.scraper.RunnableScrapingSession("fe - a2 user details");
thread[j].setDoLazyScrape(true);
thread[j].setVariable("SESSION_VAR_FILE",sessionVarFile[j]);
thread[j].setVariable("USERLIST",userList);
the line after I launched the scrape I was clearing all the datarecords in userList. Sometimes the child thread was fast enough to dump it to a file, sometimes it wasn't and it gets cleared before the child has a chance to use it... so I guess I need an array of dataSets so each thread gets their own to play with so it won't get overwritten until the child thread is finished with it...
so the big question is... If I use the same name for the USERLIST sessionVar for all of the different RunnableScrapingSessions... Are they actually seperate instances of the session variables or will the common name cause more conflicts?
Some of those primitive types
Some of those primitive types will be copies, but it seems like Java insists on using references.. You might have to make deep copies of your variables in the script, and then save the new copy to the child session variable.
What is a 'deep copy'? I
What is a 'deep copy'?
I guess this is one of the problems with learning a language as you go. You don't discover these sort of oddities until they bite you.
This was something I was doing as a bit of a test...
tempRecord = masterUserList.getDataRecord(i-1);
tempRecord.put("USERNAME",j + " - AARRRGGGHHH!!!!");
userList[j].addDataRecord(tempRecord);
Turned out that tempRecord.put would actually overwrite the record in masterUserList DataSet. Then I discovered the wonder of clone()
tempRecord = masterUserList.getDataRecord(i-1).clone();
tempRecord.put("USERNAME",j + " - AARRRGGGHHH!!!!");
userList[j].addDataRecord(tempRecord);
which leaves masterUserList untouched... The Java doco is a bit hazy though on whether clone() will always make a completely seperate copy or not (for different classes)...
Exactly-- I should have
Exactly-- I should have qualified what I meant by "deep copy". Conceptually, it's simply the act of making a "real" copy, and not just making a new variable that still points to the old one.
With object-oriented programming, the default is usually always just copying pointers to objects, since that's faster and takes up less memory.
Clone is a way to get a new, completely copied value from the source, as you've discovered. That should hopefully help you to avoid the weirdness I'm sure you were having. Even if the dataRecord you wish to copy is rather large and takes a while to clone, there isn't really a work-around if you need that deep copy.
Leave it to the JavaDoc to be vague. Blasted Java.
:)
Tim