Newsbank Extraction
Hello ye wise forumites!
I am at a loss. I have fiddled through the screen-scraper tutorials, done the 'hello world' bit, but I cannot figure out how to get the precise information I need from the precise site I need it from, which is frustrating.
To preface - I am working in a Psych lab, and we are studying the Name-Letter effect (and variants thereof) on larger samples, and right now are looking into Hurricane Katrina in relation to the authors.
I am attempting to extract all the text, the title, and the author's name of every story that mentions 'Hurricane Katrina'. The site I am using is NewsBank: Access World News (http://www.newsbank.com/), though unfortunately you need a subscription to use their search engine. The site is similar to PsycInfo - it is database that searches through millions of newsfiles from around the world.
I mainly am having an issue with the coding (I am, unfortunately, not familiar with any programming languages). I have attached a copy of the main newsbank database HTML file as well as a search for "Hurricane Katrina".
If you have any advice, or want me to post more information about the site, please feel free to ask.
(Also - is there any way to change where your computer sends the extracted information?)
Thank you,
Bellow.
Attachment | Size |
---|---|
Newsbank Source.txt | 163.53 KB |
"Hurricane Katrina" search.txt | 71.91 KB |
There are two ways to do
There are two ways to do this, and I'm going to use the slightly more complicated way. My reason is because there are going to be so many pages from your search that we want to be as efficient as possible.
I'm going to start from scratch as I explain this.
// Interpreted Java
session.setVariable("HAS_NEXT_PAGE", "yes");
session.setVariable("FIRST_RESULT", "1");
while (session.getVariable("HAS_NEXT_PAGE") != null)
{
session.setVariable("HAS_NEXT_PAGE", null);
session.scrapeFile("Search Results");
}
// Interpreted Java
session.setVariable("URL", session.getVariable("URL").replaceAll("&", "&"));
session.scrapeFile("Details Page");
~#FIRST_RESULT#~
~@TITLE@~
Patterns for the variables are:
URL: [^"]*
TITLE: [^<>]*
SOURCE: no pattern needed
DATE: \w+\s\d+,\s\d{4}
You should make sure that the "Save to session variable?" box is checked on each of these.
Also add a script to this extractor pattern. The script to execute is "Goto Details Page". Execute it "After each pattern application".
junk_parameters: [^"]*
Be sure to mark these to save as session variables as well.
http://www.someSite.com~#URL#~
I don't know what the actual text will need to be. You should examine what the URL of an actual details page is, compared to the value of the "URL" variable when you test the pattern that matches the URL/SOURCE, etc. Hopefully, you should only need to replace that leading text in my example URL textbox with the actual location of the website.
Lots of steps, I know, but I intentionally broke it down to very small pieces.
I can't test this for real, since I don't have access to the site. However, this concept is a well practiced one here in our office.
Let me know if there are any errors in that code-- I've typed it out just by concept :P
Tim
Error - Login issues?
Hello,
It's been some time since my last post, as I've been busy with a few other things, but finally getting back to the tutorial (which is very in-depth and fantastic, thank you Tim), I've found a little issue.
When I get through all the steps, up to running the session, I get an error (posted below), which looks like it's being redirected around through back to login sites. Is there any way to 'input' a username/password into Screen-scraper to bring it back to my desired site for scraping, or is there some other issue I seem to be missing?
Thank you for all of your help!
Starting scraper.
Running scraping session: Newsbank
Processing scripts before scraping session begins.
Processing script: "Initialize"
Scraping file: "Search Results"
Search Results: Preliminary URL: http://infoweb.newsbank.com.proxy.lib.umich.edu/iw-search/we/InfoWeb?p_topdoc=51&p_action=list
Search Results: Using strict mode.
Search Results: Resolved URL: http://infoweb.newsbank.com.proxy.lib.umich.edu/iw-search/we/InfoWeb?p_topdoc=51&p_action=list&P_topdoc=value
Search Results: Sending request.
Search Results: Redirecting to: http://proxy.lib.umich.edu/login?url=http://infoweb.newsbank.com/iw-search/we/InfoWeb?P_topdoc=value
Search Results: Redirecting to: http://login.umdl.umich.edu/cgi/proxy-session-init?url=http://infoweb.newsbank.com/iw-search/we/InfoWeb?P_topdoc
Search Results: Redirecting to: https://login.umdl.umich.edu/enter-password?aHR0cDovL2xvZ2luLnVtZGwudW1pY2guZWR1L2NnaS9wcm94eS1zZXNzaW9uLWluaXQ/dXJsPWh0dHA6Ly9pbmZvd2ViLm5ld3NiYW5rLmNvbS9pdy1zZWFyY2gvd2UvSW5mb1dlYj9QX3RvcGRvYw==
Search Results: Redirecting to: https://login.umdl.umich.edu/cgi/cosign/proxy?aHR0cDovL2xvZ2luLnVtZGwudW1pY2guZWR1L2NnaS9wcm94eS1zZXNzaW9uLWluaXQ/dXJsPWh0dHA6Ly9pbmZvd2ViLm5ld3NiYW5rLmNvbS9pdy1zZWFyY2gvd2UvSW5mb1dlYj9QX3RvcGRvYw==
Search Results: Redirecting to: https://weblogin.umich.edu/?cosign-login.umdl=BWOryEA4Z4yqu0W45xO+QvPTfOeBBtmcE8gNp3PPQXTQLi3dwSuOAA8riL7IJfRlESwdd6X-AWLHSUUAaX2oNarBuXUjYRM0YyVycnliIEybzUHzogV7I0PiTIaW;&https://login.umdl.umich.edu/cgi/cosign/proxy?aHR0cDovL2xvZ2luLnVtZGwudW1pY2guZWR1L2NnaS9wcm94eS1zZXNzaW9uLWluaXQ/dXJsPWh0dHA6Ly9pbmZvd2ViLm5ld3NiYW5rLmNvbS9pdy1zZWFyY2gvd2UvSW5mb1dlYj9QX3RvcGRvYw==
Search Results: Processing scripts before all pattern applications.
Search Results: Extracting data for pattern "Untitled Extractor Pattern"
Search Results: The pattern did not find any matches.
Search Results: Processing scripts after all pattern applications.
Search Results: Extracting data for pattern "Untitled Extractor Pattern"
Search Results: The pattern did not find any matches.
Search Results: Warning! No matches were made by any of the extractor patterns associated with this scrapeable file.
Scraping file: "Search Results"
Search Results: Preliminary URL: http://infoweb.newsbank.com.proxy.lib.umich.edu/iw-search/we/InfoWeb?p_topdoc=51&p_action=list
Search Results: Using strict mode.
Search Results: Resolved URL: http://infoweb.newsbank.com.proxy.lib.umich.edu/iw-search/we/InfoWeb?p_topdoc=51&p_action=list&P_topdoc=value
Search Results: Sending request.
Search Results: Redirecting to: http://proxy.lib.umich.edu/login?url=http://infoweb.newsbank.com/iw-search/we/InfoWeb?P_topdoc=value
Search Results: Redirecting to: http://login.umdl.umich.edu/cgi/proxy-session-init?url=http://infoweb.newsbank.com/iw-search/we/InfoWeb?P_topdoc
Search Results: Redirecting to: https://login.umdl.umich.edu/enter-password?aHR0cDovL2xvZ2luLnVtZGwudW1pY2guZWR1L2NnaS9wcm94eS1zZXNzaW9uLWluaXQ/dXJsPWh0dHA6Ly9pbmZvd2ViLm5ld3NiYW5rLmNvbS9pdy1zZWFyY2gvd2UvSW5mb1dlYj9QX3RvcGRvYw==
Search Results: Redirecting to: https://login.umdl.umich.edu/cgi/cosign/proxy?aHR0cDovL2xvZ2luLnVtZGwudW1pY2guZWR1L2NnaS9wcm94eS1zZXNzaW9uLWluaXQ/dXJsPWh0dHA6Ly9pbmZvd2ViLm5ld3NiYW5rLmNvbS9pdy1zZWFyY2gvd2UvSW5mb1dlYj9QX3RvcGRvYw==
Search Results: Redirecting to: https://weblogin.umich.edu/?cosign-login.umdl=O8Q3+GrksgYD4nHKKVCDF+YJp01x8fuWjp19lzOjhUHzzPlpNoz3IimS6o61mbnNAQ9GmdG6BusSmaRujzDb3yOfVxZc1aEaP3kTjKLl5U35Nc0Iy8mql74LywB2;&https://login.umdl.umich.edu/cgi/cosign/proxy?aHR0cDovL2xvZ2luLnVtZGwudW1pY2guZWR1L2NnaS9wcm94eS1zZXNzaW9uLWluaXQ/dXJsPWh0dHA6Ly9pbmZvd2ViLm5ld3NiYW5rLmNvbS9pdy1zZWFyY2gvd2UvSW5mb1dlYj9QX3RvcGRvYw==
Search Results: Processing scripts before all pattern applications.
Search Results: Extracting data for pattern "Untitled Extractor Pattern"
Search Results: The pattern did not find any matches.
Search Results: Processing scripts after all pattern applications.
Search Results: Extracting data for pattern "Untitled Extractor Pattern"
Search Results: The pattern did not find any matches.
Search Results: Warning! No matches were made by any of the extractor patterns associated with this scrapeable file.
Scraping file: "Details Page"
Details Page: Skipping this scrapeable file because the URL field is empty.
Processing scripts after scraping session has ended.
Scraping session "Newsbank" finished.