can I programmatically click multiple links on same page
I'm able to get to a secure page thru screen-scraper, but when I get there, it has many links that dynamically expand the content on that page. For example, if there are 20 links, I need to click all 20 in order to show the final content to be scraped from that page.
Is this possible? tia
can I programmatically click multiple links on same page
Pars,
We're not sure what is causing this error. We have had consistent problems using VBScript with screen-scraper but it's never been fully clear what the causes have been.
One thing we suspect is if you are calling a script from within a script multiple times that it can cause instability. We recommend that you not use VBScript if you have a choice.
If you must the best we can do is wish you luck,
Scott
can I programmatically click multiple links on same page
Thanks for the info
It's good to know I can build the entire URL in the Properties tab.
Next question...is there a limit to the size of a session variable?
Currently I'm storing the HTML table (after ALL the clicks are expanded) into a session variable. Each time I click, the table on the page gets bigger. When I try to print it out at end, I get the following in my log...
Processing script: "write expanded HTML to file"
An error occurred while processing the script: write expanded HTML to file
The error message was: ActiveScriptEngine.cpp.,1231:Unhandled C++ exceptiong
I'm using VBscript to do this. Trying a proof of concept before trying to do the sub-extractors. How can I write large chunks of HTML to file? If I can, I may even stop there, and just convert the HTML table that gets written out to a delimited file to import into a db.
continued thanks
Pars
can I programmatically click multiple links on same page
Pars,
Ok, I think I'm understanding the situation better. We ran into this once before. screen-scraper is always going to URLencode a variable that is used as a POST or GET parameter since that's what [url=http://www.ietf.org/rfc/rfc2396.txt]Tim Berners-Lee[/url] said to do back in 1998.
Within screen-scraper it is unnecessary to include a reference to a bookmark ("a name") in your URL (because it won't affect the server's response), so we don't account for the possibility that one would be included as part of a POST or GET parameter. We assume the entire variable is meant as either a POST or GET parameter (not just a portion) that's why the bookmark portion gets URLencoded along with the rest of the variable.
So, my advice is to extract only the parameter portion of what follows positionID= and try not to worry the bookmark.
If you really, really need/love/want to include the bookmark in the URL you can always bypass the parameters tab and simply construct the URL under the properties tab.
Leave the parameter's tab empty and have this be your URL:
https://ssl.myDomain.com/data/control?action=expand&positionId=~#POSITION_ID#~
screen-scraper will pass this on to the server with the variable untouched.
As for why IE will alter the case and firefox won't, well, um...not sure.
-Scott
can I programmatically click multiple links on same page
I'm aware of the bookmark/positioning that # provides. At this point I was assuming that my target site may be doing things to make it difficult to scrape. For example, if I replace # with %23 directly in browser, I get the error page.
So back to one of my earlier points...i was extracting the URL param (with the # sign) just fine. So my session variable holding it already has the # sign. So your suggestion to do the replaceAll("%23", "#") doesn't make sense. That escape sequence is not there to be replaced.
But when the page is requested thru screen-scraper, it (seemingly) gets encoded automatically, so it seems like a catch-22 to me.
However, this morning, I realized this wasn't my actual problem. Amazing how often this happens when building something. I've been testing so many variations, with # sign, without # sign, etc.
It turns out that the site I'm scraping uses param "positionId", where I was using "positionID" in screen-scraper. Note the case difference in the last character. Are you kidding me?
If you change the case directly in IE, the site works fine. I ran several tests, such as all URL params all upper case, and the site changes back to case as shown in headers and resolves.
Then, I ran same tests in Firefox and the site generates an error message.
If the case is different in screen-scraper parameter names, it does NOT work. I've tried setting the user-agent to both browsers. How is that even possible? Is this related to windows vs unix case sensitivity?
can I programmatically click multiple links on same page
Pars,
It's very likely that you will not need to include the # sign in your URL. The only purpose of the # sign in a URL is to force the position of the browser window when the page loads. It's sometimes called a "bookmark" (not to be confused with Favorites/Bookmarks) and makes use of the "name" attribute in an href tag.
http://www.w3schools.com/HTML/html_links.asp
If you omit it from a URL it should not affect the data that is returned to you. So, If you modify your extractor pattern for positionID such that it no longer grabs the # and proceeding text then your current problem should be solved.
However, if you needed to retain the # sign for some reason, you would place my sample code in the script where you're calling the scrapeable file.
Right now you have something like...
session.scrapeFile("My Scrapeable File");
And now you'll have:
positionID = positionID.replaceAll("%23", "#");
session.setVariable("POSITION_ID", positionID);
session.scrapeFile("My Scrapeable File");
I was thinking of a different feature in the other editions of screen-scraper that may have helped with this. You have the option of converting HTML entities at the time you apply an extractor pattern token. Here we're dealing with URL encoding and not HTML entities. My bad.
-Scott
can I programmatically click multiple links on same page
Scott
Thanks for the reply, but I don't think this will work for me.
When I'm at my page that contains all the links I want to click to expand, for example there could be 5 links like these:
...
my extractor is currently finding them and loading "action" and "positionID" into session vars. Note that positionID will already contain the # sign.
As suggested in this thread, I have a script to call each of these links "after each pattern application". That script just calls another scrape file that is built as such:
Properties tab
URL = https://ssl.myDomain.com/data/control
invoked manually from script
Parameters tab
action ~#ACTION#~ 1 GET
positionID ~#POSITION_ID#~ 2 GET
the positionID session var should be the way I want it at this point, no?
So there are no "%23" patterns for me to replace. Then when I run it, screen-scraper automatically encodes the # sign. That is what I want to prevent.
Would I have to build this differently to be able to even do a replace?
And could you elaborate on your initial statement in previous post..."There are some controls in the professional and enterprise editions that may allow you to fix this from within the screen-scraper interface"?
thanks for your continued help
Pars
can I programmatically click multiple links on same page
Prs Ethis,
There are some controls in the professional and enterprise editions that may allow you to fix this from within the screen-scraper interface. But for all editions you can utilize the replaceAll() method on the URL string to swap the %23 with a # before setting the value to a session variable to be used as all/part of the URL.
Something like....
urlString = urlString.replaceAll("%23", "#");
session.setVariable("myURL", urlString);
Hope that helps,
Scott
can I programmatically click multiple links on same page
After going thru everything line by line, I think I have found the actual problem. My extractor pattern is correctly identifying and grabbing the links that I need to click to expand the page. Turns out that the links all have a pound sign in them...
eg https://ssl.myDomain.com/data/control?action=expand&positionID=1234-0001-L#Row1234-0001-L
when my script goes thru each link, it is converting the # sign to %23 in the URL. By sending with the escape chars, that is what generates the error.
How can I prevent the # from being escaped?
thanks
Pars
can I programmatically click multiple links on same page
Pars Ethis,
The best and most simple approach is to take each request one by one that you've recorded using Live HTTP Headers ([url=http://www.xk72.com/charles/]Charles Proxy[/url] is great, too) and compare it to the request you're making in screen-scraper. Make sure you are not missing anything.
Do this for each request in the processes and resist moving on to the next request until you have the first one working.
-Scott
can I programmatically click multiple links on same page
I've created the extractor pattern for the expand links, and the corresponding script to click them. Based on log, it seemed as if it was working...problem I'm seeing is when all clicks are completed (very quickly) there is an error page when I'm done. So I downloaded/installed the professional version to use the session.pause function. That did not help. Neither did changing the extractor to match only 1 of the many, to see if it would work on a single click.
So now I'm in research mode, trying to figure out a way to determine what is actually happening differently from an actual web session. I cannot record with proxy (corp firewall), so I've been using firefox live HTTPHeaders (among other things) to capture what is happening.
can I programmatically click multiple links on same page
tia,
Let me suggest a few ideas to you and have you explore them. The first thing you'll probably need to do is create an extractor pattern for all the links on the page that when clicked expand for more content. Once you have this extractor pattern matching for each of the links you'll then create a corresponding script that will handle the request to the server that clicking the link otherwise would do. This usually includes passing post/get parameters to another scrapeable file with, probably in your case, a similar URL to the one with the extracted links. Set your extractor pattern to call this script "After each pattern application".
What this does is creates a recursive loop to call the script and visit the link for every link you've designated on the page. Once it visits all of the links in the matching extractor pattern's dataSet it ends up where it began--at the extractor pattern that began the loop.
Now, while it is visiting each of the links on the page you can have it do additional things like extract data from the expanded content and write that data out to a file. It will do these additional tasks for each of the links and will always return back to where it began when the loop has completed.
Give this a try and post back questions you have them.
-Scott
can I programmatically click multiple links on same page
The content is not there until the "expansion" links are clicked, as each link is a server request. There will be a variable number of links per user account. I have recorded a few clicks thru the proxy, and see that each link will contain the string "&ugl_expand_position&" when collapsed, and "&ugl_collapse_position&" when expanded.
Here is a basic example of how the chart looks...
at initial page load...
+ AEG - 185.000 $2,699.15
+ CFC - 185.000 $1,141.45
+ DELL - 155.000 $3,168.20
+ EDS - 70.000 $91.80
after clicking the + icon for stock symbol AEG...
- AEG -> 185.000 $2,699.15
08/02/06 182.000 $2,655.38
08/10/07 3.000 $43.77
+ CFC - 185.000 $1,141.45
+ DELL - 155.000 $3,168.20
+ EDS - 70.000 $91.80
Seems like I'll need to loop over the chart and click x number of links per account. Is this scriptable? If so, can you provide a sample or tutorial example?
thanks
Pars
(first time user of product)
can I programmatically click multiple links on same page
tia,
It depends on how the additional content is made available on the page. If the content is always there but hidden via CSS then you would need to understand the JavaScript logic that determines, if user clicks here show this data. This might be challenging.
If for each click the data is loaded from the server (even via AJAX) you will need to record the clicks via the screen-scraper proxy and possibly create scrapeable files for each click. Less challenging because the logic is handled by the server.
I hope this helps,
Scott