screen-scraper public support

Questions and answers regarding the use of screen-scraper. Anyone can post. Monitored occasionally by screen-scraper staff.

Ignoring HTML in scrapping session results

Hi,

I am able to scrap webpages using screen scraper, but because there are variations between the pages I am scraping (from the same site) I have to make the sub extractor patterns wider than I want to ensure I don't miss scraping some data from a paragraph that may have lots of p code, for example.

The result is that when I get my data, I have to do a lot of clearing up (for example, going through and deleting with find and replace html code like h1 div br etc ...

[Solved] Scraping a dynamic web page

Dear Community, dear Jason,

I would like to scrape data from a specific URL (https://www.kickstarter.com/projects/597507018/pebble-e-paper-watch-for-iphone-and-android/backers). Thanks to the fantastic tutorials I’m now able to scrape exactly what I need (Surename and amount of funded projects). I need this data to create a statistic on the distribution of male/female donors on this page and the average number of funded projects for each gender.

looping the scrape changing the URL each time

OK so this is part 2 from my previous post current date scrape. I would like to scrape the same URL over and over each time changing part of the URL in a loop. This way I only capture 1 proxy server URL and create 1 session and 1 scrapable file.

Here is a sample to show what I am trying to do. The 3 letters in CAPS before the date I would like to change. The captured URL is the same but maybe some type of script similar to the current date one shown in the previous post but a looping one so that it scrapes for each change.

Captured URL

scrape file with todays date

My scrap-able URL has the current date in it. Everything else is static. How can I grab that portion of the URL (maybe make a GET parameter?) and change it to todays date so that it brings in that HTML? I've seen bits of ideas for this in my search but I am not sure. Thanks.

Here is a sample URL to show what I am trying to do. The date of 052814 is always the current date.

http://www.thesite.com/static/entry/xyz052814test.html

Directly-Observed Page <> Display Response in Browser?

It's finally dawned on me that my extractor is actually returning the right information - as it sees it.

The weirdness is that what it sees differs from what my browser sees when I aim it at the extractor's URL within seconds of the extract.

What my browser sees: http://tinyurl.com/n99oqgh

What ScreenScraper's extractor sees (same URL): http://tinyurl.com/pxxzwrm

Two of the data diffs are in "Avg Wind:" - direction and speed.

Could ScreenScraper's extractor be working out of a cache?

Something else?

Multiple Scrapable Files Followed By Script?

I've got a scraping session that runs A-OK, extracting multiple values from a single web page.

After the session runs, a script runs that does a series of out.writes using session.getVariables.

Now I have added a second scraping session - against a different web page.

The session populates the variable, no problem.

But I am unable to tell the script not to run until both scraping sessions have finished. I stumbled around and managed to get the script to run twice - once after each session - but that doesn't sound like the Good-Right-And-Holy-Path....

screen scraper with captcha breaker

Hi,

I would like to use screen scraper with captcha breaker in order to automate captcha solving.

Captcha Breaker runs as a webserver listening on 192.168.2.120 Port 80.

When testing Captcha Breaker in Firefox Normal Mode, then it can solve a captcha.

But with screen scraper (either in proxy or scrapable mode), then captcha breaker cannot solve any captchas.

Captcha Breaker lists in the lower main window of the gui all processed captcha images.

And with screen scraper not even a proper image of the captcha does appear in the list.

Scrape within a scrape: SCRAP-CEPTION!

Hi,

Sorry, I am not much of a programmer so I am referencing from the "Manual Data Extraction" page using apples =).

From 100 apples (level 1 categories), I am trying to pick out just 5 apples (5 x level 1 categories).
But instead of gathering all 5 apples, is it possible to get Screen-Scraper to:

grab 1 of 5 apples,
go to 1 of 5 apples: level 2 sub-category,
grab 2 of 5 apples
go to 2 of 5 apples: level 2 sub-category,
etc.

Using the examples scripts provided from the "Manual Data Extraction", it is currently doing this:
DATARECORD 1:
grab 1 of 5 apples

pages with infinite scroll

Hi everyone,

Is it possible to scrape pages with infinite/reactive scroll (e.g. where extra content loads as you scroll down the page)?

Specifically is there a general way to instruct screen scraper to "scroll down" or do you have to fake a "load more stuff" event request as per how the site is actually doing it?

Thanks in advance,

Dan

BufferedReader (Read Search terms tutorial)

I've been looking for a long time now at a way to remove a list of words from a string.

I've been using:
.replaceAll("Remove this", "");

The problem is my list of words to remove is large and i'm also having to use that list multiple times.

I've created a script from the Read Search Terms script which works great. That way when I find a word I need to be removed I can post it to my .txt file without any code.

My question/problem is hopefully an easy one I've been using (?i) to make it in-sensitive to case meaning it will remove caps and non-caps