Multithreading same scraping session
Hi,
We have problems with multithreading runs of same scrape session.
So, we have a script that creates an xmlWriter and stores it in a session. After this Initialization script, a scrape is run that gahteres data data from a site.
After the scrape is run, another script gets data from the scrapeSession object and writes it to a file using the xmlWriter.
So, what I did was to do a multithread(10 to 20 threads) run of the same Scraping session. Often the result files contained text that indicated info was
written by several threads.
The conclusion is that threads shared the same "code space" so they were not "insulated" one from another.
I tried to use java ReentrantLock object and instantiated a locker object and did a locker.lock(), but this just gave me an error message.
If scrape file scode is shared\access simultaneously by threads, what happens with the scraping operation itself. I may ask the question: can a thread searching
for a name "Big Name" launch a scrape and another one associated with another name search 'Little John' get the results?
What can we do about this?: Running in multiple simultaneous threads the same scraping session without interfering?
What about different scraping Sessions? Is there possible for different scraping sessions to interfere. If they don't interfere we could define like 10
sessions that do the same scraping in order to serve 10 users at the same time? - a bad case scenario
We use .net 3.5.
Multi-threading scraping session
I think I understand the problem, but I don't entirely see the situation.
How is your project set up?
We scrape a web page and
We scrape a web page and return results in xml.
The scripts I'm talking about are in a session on a Screen Scraper instalation on a different machine than the C# code we are using. 1 script that initializes the xmlWriter and creates the root node , 1 that writes the info as xml nodes childs of the root node and 1 that closes the xmlWriter and writes the info to the output file.
The scraping is done after the 1 script and is followed by the execution of the 2nd and 3rd described script. Each scraping writes to a single xml file.
What we do is initialize a RemoteScrapingSession object in C# and call scrape on the object and afterwards we get the results from the xml file which was saved on the server hosting the ScreenScraper instance
RemoteScrapingSession scrapingSession = new RemoteScrapingSession(
sessionName,
ConfigurationManager.AppSettings["ScreenScraperServer"],
Convert.ToInt32(ConfigurationManager.AppSettings["ScreenScraperPort"]));
...various variable settings...
scrapingSession.Scrape();
... read the results received as xml files
While testing:
- I get overlapping(more threads writing to the same file apparently) more rarely in this scenario:
generate threads - > thread instantiates web service and calls a method -> webservice method instantiates ScreenScraperSession object and runs scrape().
- I get overlapping often in this scenario:
generate threads - > thread instantiates web service and calls 2! methods -> webservice methods instantiate 2 different named ScreenScraperSession sessions and run scrape().
The name of the xmlWriter objet stored in SS sessions is the same, but they are different sessions. (Could be a problem of scraping session sharing the same 'Session Repository'? Like the same xmlWriter object being used by two scraping sessions running the same time.)
I encountered overlapping for first scenario but can't reproduce it today, but the second case appears often.
Among the error types found in error logs while scraping are:
- Redirect requested to location....
- An IOException occurred. The message was: Connection reset
java.net.SocketException: Connection reset
- the error message was: NullPointerException (line 51): xmlWriter .close ( ) -- Null Pointer in Method Invocation. com.screenscraper.scraper.ScriptException: NullPointerException (line 51): xmlWriter .close ( ) -- Null Pointer in Method Invocation. This one when finishing writing.
I ran the same session (the first one) using SOAP. I clicked 10 times on the run button and i got 10 results files in about 3 min. I tried to run 8 threads as described above using C# code simultaneous and received timouts and an overall execution time of 5 minutes. All this time, the processor of the machine where Screen Scraper is installed had a load of above 50%.
So, the 2 problems would be:
- multithreaded run of the same session seems can get to instances overlaping over the same code, or accesing same session objects.
- the time it takes to run 10 threads\10 scrape calls - 3-5 min. The machine is an Intel Xeon 2.80 dual core with lots of RAM. Is the machine ok or for many simultanous accesses or we would need somthing better? Something much better?
Should we rely on multithreading per scrape session, or avoid multithreading for the same scraping session?
What alternatives would be for live scraping on web users request?
Is multiple Screen Scrapers installed ok? Do many ScreenScrapers instances running at the same time eat up lots of resources?
Multi-threading scraping session
Your machine is plenty powerful to run multiple instances of screen-scraper.
For the errors in the log, none of those have to do with multi-threading. The
Null Pointer in Method Invocation
means that the script running tried to call a method onnull
. In this case, it looks likexmlWriter
wasnull
for some reason when the script tried to executexmlWriter.close()
.We run multiple threads of the same scraping session all the time; screen-scraper is very scalable. Your separate instances of the same session aren't sharing their XmlWriters, not if I understand the way you've set them up.
Are all of your XmlWriters writing to the same file?
The issue was that the
The issue was that the generated filenames were not always unique, so we had more threads writing to the same file. My bad.