Error/Extractor Logging
I am currently working on scripting some errorlogging into my scraping projects and one thing I am struggling over is logging the misses of the specific extractor patterns. I want a log file to be written listing the names of the extractors and the respective URLs that produced the misses - as well as a statistic evaluating each extractor pattern and how many hits and misses it produced.
The way I thought I could do it, would be executing a script with the event "Once if no matches" to write the information into the log, but I see no way to get the name of the extractor that called it from within the script. So that essentially I can only determine that a miss occurred but not the extractor pattern that produced it.
Sure I could hard-script the name into it - but that would essentially mean writing one script each for every extractor pattern.
Is there any way to get the name of the object (here the extractor pattern) that called the script? Or any easier way or method to produce such a logging I described that I just don't know about? (Maybe I am going at it the wrong way.)
On a side-note:
There is one other way I could think of as a workaround - to parse the scraping log screen-scraper creates itself. But I very much would prefer if it would be possible any other way.
There is a way to get the
There is a way to get the extractor pattern name. It's pretty new, so you need to be updated to version 6.0.59a or newer.
Rather than trying to explain, I've attached a sample scrape that does it. It's pretty simple, but should get you started.
The guy who implemented this also noted that you can expect a bit of a speed hit on your scrape when using this. It may not be a big deal, but something to watch for.
Thanks, it works like a charm
Thanks, it works like a charm and I don't really see much of a performance drop - at least not during testing. I will keep my eyes open and see if it there is any significant increase in runtime once I have the time to setup one of my larger scrapes.