Newsbank Extraction

Hello ye wise forumites!

I am at a loss. I have fiddled through the screen-scraper tutorials, done the 'hello world' bit, but I cannot figure out how to get the precise information I need from the precise site I need it from, which is frustrating.

To preface - I am working in a Psych lab, and we are studying the Name-Letter effect (and variants thereof) on larger samples, and right now are looking into Hurricane Katrina in relation to the authors.

I am attempting to extract all the text, the title, and the author's name of every story that mentions 'Hurricane Katrina'. The site I am using is NewsBank: Access World News (http://www.newsbank.com/), though unfortunately you need a subscription to use their search engine. The site is similar to PsycInfo - it is database that searches through millions of newsfiles from around the world.

I mainly am having an issue with the coding (I am, unfortunately, not familiar with any programming languages). I have attached a copy of the main newsbank database HTML file as well as a search for "Hurricane Katrina".

If you have any advice, or want me to post more information about the site, please feel free to ask.

(Also - is there any way to change where your computer sends the extracted information?)

Thank you,

Bellow.

Attachment Size
Newsbank Source.txt 163.53 KB
"Hurricane Katrina" search.txt 71.91 KB

There are two ways to do

There are two ways to do this, and I'm going to use the slightly more complicated way. My reason is because there are going to be so many pages from your search that we want to be as efficient as possible.

I'm going to start from scratch as I explain this.

  1. Make a new scraping session (the blue gear).
  2. Don't proxy anything yet, but go ahead and get on the website and perform your search.
  3. Change the drop-down box near the bottom of the page from 10 results per page to 50.
  4. Turn your proxy server on, both in screen-scraper and in your web browser.
  5. Click on the "Next" page link near the bottom of the web page.
  6. Back in screen-scraper, make a scrapeableFile out of what you just proxy'd. You can turn off the proxy server now.
  7. Name your new scrapeableFile something like "Search Results".
  8. Having some foresight for what I'm going to have you do, I'm going to have make a script. Name it something like "Initialize", or something equally as intuitive:

    // Interpreted Java
    session.setVariable("HAS_NEXT_PAGE", "yes");
    session.setVariable("FIRST_RESULT", "1");
    while (session.getVariable("HAS_NEXT_PAGE") != null)
    {
        session.setVariable("HAS_NEXT_PAGE", null);
        session.scrapeFile("Search Results");
    }
  9. Add another script, and name it something like "Goto Details Page":

    // Interpreted Java
    session.setVariable("URL", session.getVariable("URL").replaceAll("&", "&"));
    session.scrapeFile("Details Page");
  10. On your scraping session (in the tree on the left), go to the "Scripts" tab, and add a script. Make sure that the script to be executed is "Initialize".
  11. On your "Search Results" scrapeableFile, go to the parameters tab. There should only be 2 or maybe 3 items on this list. One will be called "p_topdoc". Change its "value" cell to this:

    ~#FIRST_RESULT#~
  12. Switch to the scrapeableFile's "Extractor Pattern" tab. Add an extractor pattern:

    ~@TITLE@~
    ~@SOURCE@~ - ~@DATE@~

    Patterns for the variables are:
    URL: [^"]*
    TITLE: [^<>]*
    SOURCE: no pattern needed
    DATE: \w+\s\d+,\s\d{4}
    You should make sure that the "Save to session variable?" box is checked on each of these.

    Also add a script to this extractor pattern. The script to execute is "Goto Details Page". Execute it "After each pattern application".

  13. Add another extractor pattern:
  14. \d+
    junk_parameters: [^"]*

    Be sure to mark these to save as session variables as well.

  15. Now all you have to do is make a scrapeableFile for your details page, and name it "Details Page" and extract any info that you would like to grab from it. The scrapeableFile will need to include something like this in it's first tab's "URL" textbox:

    http://www.someSite.com~#URL#~

    I don't know what the actual text will need to be. You should examine what the URL of an actual details page is, compared to the value of the "URL" variable when you test the pattern that matches the URL/SOURCE, etc. Hopefully, you should only need to replace that leading text in my example URL textbox with the actual location of the website.

Lots of steps, I know, but I intentionally broke it down to very small pieces.

I can't test this for real, since I don't have access to the site. However, this concept is a well practiced one here in our office.

Let me know if there are any errors in that code-- I've typed it out just by concept :P

Tim

Error - Login issues?

Hello,

It's been some time since my last post, as I've been busy with a few other things, but finally getting back to the tutorial (which is very in-depth and fantastic, thank you Tim), I've found a little issue.

When I get through all the steps, up to running the session, I get an error (posted below), which looks like it's being redirected around through back to login sites. Is there any way to 'input' a username/password into Screen-scraper to bring it back to my desired site for scraping, or is there some other issue I seem to be missing?

Thank you for all of your help!

Starting scraper.
Running scraping session: Newsbank
Processing scripts before scraping session begins.
Processing script: "Initialize"
Scraping file: "Search Results"
Search Results: Preliminary URL: http://infoweb.newsbank.com.proxy.lib.umich.edu/iw-search/we/InfoWeb?p_topdoc=51&p_action=list
Search Results: Using strict mode.
Search Results: Resolved URL: http://infoweb.newsbank.com.proxy.lib.umich.edu/iw-search/we/InfoWeb?p_topdoc=51&p_action=list&P_topdoc=value
Search Results: Sending request.
Search Results: Redirecting to: http://proxy.lib.umich.edu/login?url=http://infoweb.newsbank.com/iw-search/we/InfoWeb?P_topdoc=value
Search Results: Redirecting to: http://login.umdl.umich.edu/cgi/proxy-session-init?url=http://infoweb.newsbank.com/iw-search/we/InfoWeb?P_topdoc
Search Results: Redirecting to: https://login.umdl.umich.edu/enter-password?aHR0cDovL2xvZ2luLnVtZGwudW1pY2guZWR1L2NnaS9wcm94eS1zZXNzaW9uLWluaXQ/dXJsPWh0dHA6Ly9pbmZvd2ViLm5ld3NiYW5rLmNvbS9pdy1zZWFyY2gvd2UvSW5mb1dlYj9QX3RvcGRvYw==
Search Results: Redirecting to: https://login.umdl.umich.edu/cgi/cosign/proxy?aHR0cDovL2xvZ2luLnVtZGwudW1pY2guZWR1L2NnaS9wcm94eS1zZXNzaW9uLWluaXQ/dXJsPWh0dHA6Ly9pbmZvd2ViLm5ld3NiYW5rLmNvbS9pdy1zZWFyY2gvd2UvSW5mb1dlYj9QX3RvcGRvYw==
Search Results: Redirecting to: https://weblogin.umich.edu/?cosign-login.umdl=BWOryEA4Z4yqu0W45xO+QvPTfOeBBtmcE8gNp3PPQXTQLi3dwSuOAA8riL7IJfRlESwdd6X-AWLHSUUAaX2oNarBuXUjYRM0YyVycnliIEybzUHzogV7I0PiTIaW;&https://login.umdl.umich.edu/cgi/cosign/proxy?aHR0cDovL2xvZ2luLnVtZGwudW1pY2guZWR1L2NnaS9wcm94eS1zZXNzaW9uLWluaXQ/dXJsPWh0dHA6Ly9pbmZvd2ViLm5ld3NiYW5rLmNvbS9pdy1zZWFyY2gvd2UvSW5mb1dlYj9QX3RvcGRvYw==
Search Results: Processing scripts before all pattern applications.
Search Results: Extracting data for pattern "Untitled Extractor Pattern"
Search Results: The pattern did not find any matches.
Search Results: Processing scripts after all pattern applications.
Search Results: Extracting data for pattern "Untitled Extractor Pattern"
Search Results: The pattern did not find any matches.
Search Results: Warning! No matches were made by any of the extractor patterns associated with this scrapeable file.
Scraping file: "Search Results"
Search Results: Preliminary URL: http://infoweb.newsbank.com.proxy.lib.umich.edu/iw-search/we/InfoWeb?p_topdoc=51&p_action=list
Search Results: Using strict mode.
Search Results: Resolved URL: http://infoweb.newsbank.com.proxy.lib.umich.edu/iw-search/we/InfoWeb?p_topdoc=51&p_action=list&P_topdoc=value
Search Results: Sending request.
Search Results: Redirecting to: http://proxy.lib.umich.edu/login?url=http://infoweb.newsbank.com/iw-search/we/InfoWeb?P_topdoc=value
Search Results: Redirecting to: http://login.umdl.umich.edu/cgi/proxy-session-init?url=http://infoweb.newsbank.com/iw-search/we/InfoWeb?P_topdoc
Search Results: Redirecting to: https://login.umdl.umich.edu/enter-password?aHR0cDovL2xvZ2luLnVtZGwudW1pY2guZWR1L2NnaS9wcm94eS1zZXNzaW9uLWluaXQ/dXJsPWh0dHA6Ly9pbmZvd2ViLm5ld3NiYW5rLmNvbS9pdy1zZWFyY2gvd2UvSW5mb1dlYj9QX3RvcGRvYw==
Search Results: Redirecting to: https://login.umdl.umich.edu/cgi/cosign/proxy?aHR0cDovL2xvZ2luLnVtZGwudW1pY2guZWR1L2NnaS9wcm94eS1zZXNzaW9uLWluaXQ/dXJsPWh0dHA6Ly9pbmZvd2ViLm5ld3NiYW5rLmNvbS9pdy1zZWFyY2gvd2UvSW5mb1dlYj9QX3RvcGRvYw==
Search Results: Redirecting to: https://weblogin.umich.edu/?cosign-login.umdl=O8Q3+GrksgYD4nHKKVCDF+YJp01x8fuWjp19lzOjhUHzzPlpNoz3IimS6o61mbnNAQ9GmdG6BusSmaRujzDb3yOfVxZc1aEaP3kTjKLl5U35Nc0Iy8mql74LywB2;&https://login.umdl.umich.edu/cgi/cosign/proxy?aHR0cDovL2xvZ2luLnVtZGwudW1pY2guZWR1L2NnaS9wcm94eS1zZXNzaW9uLWluaXQ/dXJsPWh0dHA6Ly9pbmZvd2ViLm5ld3NiYW5rLmNvbS9pdy1zZWFyY2gvd2UvSW5mb1dlYj9QX3RvcGRvYw==
Search Results: Processing scripts before all pattern applications.
Search Results: Extracting data for pattern "Untitled Extractor Pattern"
Search Results: The pattern did not find any matches.
Search Results: Processing scripts after all pattern applications.
Search Results: Extracting data for pattern "Untitled Extractor Pattern"
Search Results: The pattern did not find any matches.
Search Results: Warning! No matches were made by any of the extractor patterns associated with this scrapeable file.
Scraping file: "Details Page"
Details Page: Skipping this scrapeable file because the URL field is empty.
Processing scripts after scraping session has ended.
Scraping session "Newsbank" finished.