screen-scraper support for licensed users

Questions and answers regarding the use of screen-scraper. Only licensed Professional and Enterprise Edition users can post; anyone can read. Licensed users please contact support with your registered email address for access. This forum is monitored closely by screen-scraper staff. Posts are generally responded to in one business day.

Problems with Website using Javascript

Hi,

I have been sucessfully scraping a website for some time now, but recently I get a message of the "Javascript has to turned on..." type. Now, I know that SS does not run JS and therefore I have started to analyse what is going on "behind the curtains". There are several JSs that are sent as responses and calculate cookies and so on. I am now busy trying to emulate all this, using JS SS-scripts. So far w/o success.

504 Error only when scraping session run from cron

Hi.

A scraping session that I it´s ready for production use, works seamlessly in my local mac. When I exported it and try to run it in a linux EC2 this very strange thing happens:

If I remotely access the linux server with ssh from a terminal window and run it directly with the following command, the scrape runs as it supposed to, with no errors:

sh /home/myuser/screen-scraper_pro/myscrapeshellscript.sh

myscrapeshellscript.sh just contains:

cd /home/myuser/screen-scraper_pro
jre/bin/java -jar screen-scraper.jar -s "My Scraping Session"

Tor & Polipo processes left running on memory

Hello.

I am using Tor and Polipo from screen scraper(with the java library that you guys kindly provided in another post) succesfully.

scrapable files with same URL present different patterns in screen-scraper

We have run into an example where we created the extractor patterns for a scrapable file, but they patterns do not match when the same URL is called programmatically while running the scraping engine.

We can even cut and paste the URL (the page URL that is generated programmatically) from the logs into the first scrapable file and see that the original extractor patterns still work. However they don't work in the file scraped during the scraping session.

forum structure traversing question

We need to scrape a forum that has a forum list where each forum (in the list of all forums) has the following structure: www.siteaddress/forum_identifier.html. However, each page of threads in the individual forum (after the first page) has the structure www.siteaddress/forum_identifier_site_identifer_index_number.html.

check for string value of session variable

Hi.

Is there a problem with the if syntax below?

(having previously set the value of the STATUS variable to either "ON" or "OFF")

if (session.getv("STATUS") == "ON"){

     .. do this;
     .. and do that;   
                       
}

problem with unwanted tags inside text extracted

Hi. I am scraping the posts in a discussion forum. Not the content, but the posts titles, date, user, etc... My extractor pattern looks roughly like this:

                   <span id="~@DUMMY@~">#~@POSTID@~</span>
                    &nbsp;
                </td><td>
                    <a id="~@DUMMY@~" href="~@DUMMY@~">~@POSTTITLE@~</a>
                </td><td style="white-space:nowrap;">
                    <a id="~@DUMMY@~" href="/boards/profilea.aspx?user=~@USERID@~">~@DUMMY@~</a>
                </td><td style="white-space:nowrap;">~@DATEOFPOST@~</td>
                </tr>

Read from file (large amount of data)

Hello,
Please help me to understand how should I set scraper.
I have a .txt file with around 500000 urls I need to scrape.

I think about the following - I make one scrape and it's job is to read in the .txt input and loop, and then launch a separate scrape that will go get the data for each line. All this using RunnableScrapingSession

Is it good solution for such big amount? or you can suggest me something better

Really appreciate your help.

navigating a multipage forum

I am attempting to scrape a forum (healingwell.com) that has multiple pages in some subforums - but no obvious cues that there are additional pages. In other words no "next" or "previous" patterns to use as cues to augment the page counter in the URL.

The best way to augment he page counter that I have come up with is the following:

Since each page in the subforum contains links to the next few pages and links to the last pages in the form of page numbers (i.e.

screen scraper permissions issue in Linux

Hello,

I recently installed screen scraper pro in an Ubuntu Amazon EC2 instance. I had managed to run it succesfully a few times, but now when I connect to the desktop with Microsoft Remote Desktop and try to run the workbench I get a messagebox saying:

Write access error:

In order for screen-scraper to function properly , please ensure  you have write access to the folder in which screen-scraper is installed, as well as all of its sub-folders.

For example, I found that you don´t have write access to the following file(s):