Saving the URL Redirect in a CVS File
How would I save a redirect url to my cvs file? For example, the link to the Agent's website via Trulia is
http://www.trulia.com/transfer.php?s_id=10390477&p_id=1077568477 which is redirected to http://listings.listhub.net/pages/DMAARIA/344485/?channel=trulia but can be reduced to http://listings.listhub.net/pages/DMAARIA/344485
I would like to save the http://listings.listhub.net/pages/DMAARIA/344485 to my cvs file. Thanks in advance for your help.
So, do you have variables
So, do you have variables that you put into that first URL's parameters (which then resolves and redirects)?
You'll have to use a script to figure out the string value of that first URL.
This is kind of non-public API stuff, but you could maybe do something like this (haven't tested):
String rawURL = scrapeableFile.getURL();
String resolvedURL = session.getScrapingSessionState().resolveVariables(rawURL);
Notice that I'm using 'getURL()' instead of 'getCurrentURL()'. Then, I pass that into screen-scraper's variable parser (which fixes up those '~#var#~' notations). Assuming you haven't cleared out any variables, I would imagine that this would return to you the scrapeableFile's actual address, previous to any redirects (at least, it returns what your scrapeableFile's hard-coded 'url' field is).
Let me know if that helps at all,
Tim
Yes, I have the variables...
Hi Tim,
I’m able to extract and save in session the variables necessary to go to the Agent’s website via Trulia (http://www.trulia.com/transfer.php?s_id=10390477&p_id=1077568477) but do not know what to do next in order to save the redirect url to my cvs file. Thanks for your help.
What does your URL/parameters
What does your URL/parameters tabs look like? I'm assuming your parameters tab does something like this:
(the names of those variables, though, would be whatever you've named them.)
If so, then make a script which will write out your URL to a CSV. If you already have one, then just add the following to it.
// Interpreted Java
String beforeRedirect = scrapeableFile.getURL() + "?s_id=" + session.getVariable("S_ID") + "&p_id" + session.getVariable("P_ID");
That string is the URL you're after.
To write (anything) to a file, just use the basic example found here: http://community.screen-scraper.com/script_repository/Write_to_CSV , which is found in the "Tips, Tricks & Samples" section, under the "Script Repository". The only thing special about a CSV is that each line should end with a "\n", and items should be separated by a "," and should always be wrapped in a pair of quotes. (To do quotes in Java, you can't type
""mydata","moredata""
, because it's confusing about what quotes go to what. Instead, any quotes that you want to appear in your output, put a backslash in front of:"\"mydata\",\"moredata\""
)Tim
I’m very grateful for your
Hi Tim,
I’m very grateful for your help, as you might suspect I have limited programming knowledge. I have read your reply about 20 times and every time I try to implement your instructions I get an error message in my log. Since purchasing Screen-Scraper (SS) I feel I have learned so much, but still have so much to learn. Please be patient with me, I really want to be proficient in SS.
I did not create a URL/parameters tab, I extracted the "BROKER_ID" and the "PROPERTY_ID" and saved it in a CVS file and then manually clicked on the URL in the CVS file which redirected me to the URL I’m trying to save. I saved the Agent’s website via Trulia in the CVS file with the following:
out.write( "http://www.trulia.com/transfer.php?s_id="+ session.getVariable( "BROKER_ID" )+ "&p_id="+ session.getVariable( "PROPERTY_ID" )+ "," );
Should I create a new scrapeable file for the Broker’s website via Trulia?
If so, should the URL be the following?
http://www.trulia.com/transfer.php?s_id=~#BROKER_ID#~&p_id=~#PROPERTY_ID#~
If so, should I have this scrapeable file invoked manually from a script?
If so, should the script be the following?
session.scrapeFile( "Broker Link" );
If so, should the Broker Link Parameters be the following?
Value: ~#BROKER_ID#~ Sequence: 1 Type: GET
Value: ~#PROPERTY_ID#~ Sequence: 2 Type: GET
In order to save it in the CVS file would I create the following script?
outwrite( String beforeRedirect = scrapeableFile.getURL("Broker Link") + "?s_id=" + session.getVariable("BROKER_ID") + "&p_id" + session.getVariable("PROPERTY_ID")+ "," );
out.write( "\n" );
Any suggestions will be helpful and thanks again for your help!!!
You're very close
Adrian,
You're very close with this one. I believe you should only need to change the last part of your sequence. When called /after/ a scrapeable file is scraped, the scrapeableFile.getCurrentURL() (that's, getCurrentURL) method will return the very last URL that gets resolved. So, in your case it will return...
http://listings.listhub.net/pages/DMAARIA/344485/?channel=trulia
To see it work, create a script with just this one line.
session.log("**" + scrapeableFile.getCurrentURL() + "**");
Create a scrapeable file just like you suggest using your broker url (http://www.trulia.com/transfer.php) and two get parameters (s_id and p_id). Then, call your new script from that scrapeable file and run it /after/ the file is scraped.
It should write the URL you're after out to the log.
It might be tricky to append the URL to the current line of your CSV, but so long as everything's running in sequence you could close out your script and instead of writing your "beforeRedirect" string, call...
session.executeScript("Broker Link");
...where your Broker Link script contains...
session.scrapeFile("Broker Link");
...then follow it all up with a script that terminates the current CSV line.
out.write( "\n" );
It's too bad you can't accomplish all of this in a script, but you need that scrapeable file to follow the redirects for you.
Also, I noticed that you mentioned previously that when you tried to implement Tim's suggestion that screen-scraper produced an error. Next time, if you could, it's very helpful if you include what the error message was along with the steps you took to generate the error.
Hope this all makes sense.
-Scott
*If you ever need to export your scraping session, be sure to manually export any scripts being called using the executeScript method (unless they're also being called via a scrapeable file or an extractor pattern).
As you suggested I'm trying to resolve
As you suggested I'm trying to resolve the redirect before saving it in the CVS file. I'm getting an error while connecting to the redirect. I noticed that the Resolved URL has a double entry for the BROKER_ID and PROPERTY_ID(&key=36614&key=1079187204)but if I manually copy and paste it in Internet Explorer's URL I can get to the redirected website. Below is the portion of the log for the Broker Link.
Broker Link: Processing scripts before a file is scraped.
Broker Link: Preliminary URL: http://www.trulia.com/transfer.php?s_id=~#BROKER_ID#~&p_id=~#PROPERTY_ID#~
Broker Link: Using strict mode.
Broker Link: Resolved URL: http://www.trulia.com/transfer.php?s_id=36614&p_id=1079187204&key=36614&key=1079187204
Broker Link: Sending request.
Broker Link: Redirecting to: http:
Broker Link: An error occurred while connecting to 'http:'. The message was Connection refused: connect.
Broker Link: Processing scripts after a file is scraped.
Processing script: "Broker Link Log"
**http:**
Thanks in advance for your help!!!
Doubled-up GET parameters
Adrian,
Sorry for the delay. I have good news and good news. The first good news is you may have discovered a bug in screen-scraper (we like when our users reveal bugs so we can fix them ;). The second piece of good news is a slight correction in your approach should make this work just fine.
The bug I'm referring to is the error message you're seeing. "An error occurred while connecting to 'http:'." That just ain't right and we'll look in to it.
To correct your scrape take a look under the parameters tab for your "Broker Link" scrapeable file. My guess is you'll see two parameters, both named "key", one set to use the BROKER_ID session variable and the other the PROPERTY_ID session variable. Now, go back to the Properties tab of your Broker Link scrapeable file and my guess is your URL looks like this:
http://www.trulia.com/transfer.php?s_id=~#BROKER_ID#~&p_id=~#PROPERTY_ID#~
What you have here is a case of two GET parameters being set twice (sort of). You're able to set GET parameters either from under the Parameters tab OR from the querystring of your URL. If you do it in both places screen-scraper apparently burps and sends only "http:" as your request URL.
So, I suggest you choose between passing your parameters in the URL querystring OR under the parameters tab. If you choose the parameters tab, be sure to change your two "key"'s to their respective "s_id" & "p_id".
Remember, parameters always come in the form of key/value. The key is the variable name and the value is whatever information is assigned to the key. I prefer to name the session variables that are going to be used as GET or POST parameters the same name as their respective keys.
Hope this helps,
Scott
Yes, you are right about the doubled-up Get parameters
Hi Scott,
Yes, you are right about the doubled-up Get parameters. The Broker Link scrapeable file URL is as follows:
http://www.trulia.com/transfer.php?s_id=~#BROKER_ID#~&p_id=~#PROPERTY_ID#~&t_id=odpl1
and I changed the Get parameters to the following:
~#BROKER_ID#~ ~#BROKER_ID#~ 1 GET
~#PROPERTY_ID#~ ~#PROPERTY_ID#~ 2 GET
but received the following:
Processing script: "Scrape Broker Link"
Scraping file: "Broker Link"
Broker Link: Processing scripts before a file is scraped.
Broker Link: Preliminary URL: http://www.trulia.com/transfer.php?s_id=~#BROKER_ID#~&p_id=~#PROPERTY_ID#~&t_id=odpl1
Broker Link: Using strict mode.
Broker Link: Resolved URL: http://www.trulia.com/transfer.php?s_id=10051115&p_id=1003845699&t_id=odpl1&10051115=10051115&1003845699=1003845699
Broker Link: Sending request.
Broker Link: Redirecting to: http:
Broker Link: An error occurred while connecting to 'http:'. The message was Connection refused: connect.
Broker Link: Processing scripts after a file is scraped.
Processing script: "Broker Link Log"
**http:**
Processing script: "Broker Link Log"
Thanks again for your help!
Redundant GET parameters
Adrian,
Actually, what I meant was, you should either have GET parameters specified in the querystring OR set under the Parameters tab. But, not both.
So, go ahead and change your URL under the Properties tab to:
http://www.trulia.com/transfer.php
Then, go to the Parameters tab. You may recall from the first tutorial how extractor patterns and sessions variables were compared to a stencil. In our situation, the session variable left over after applying the extractor pattern stores the value of what was extracted.
So, any place you refer to a session variable using, ~#my_variable#~, the value of that variable will be used.
With this in mind, change your parameters to be...
s_id,~#BROKER_ID#~,1,GET
p_id,~#PROPERTY_ID#~,2,GET
I noticed that you're now passing in a third parameter, "t_id" with a value of "odpl1". One key thing to troubleshooting problems is to strip the problem down to its simplest form. Which would mean, not introducing new elements until the current issue is resolved.
-Scott
Works but having trouble...
Hi Scott,
It works but I'm having trouble saving the redirect in the cvs file. The log is as follows:
Details Page: Processing scripts after a pattern application.
Processing script: "Write data to a file"
Writing data to a file.
UNFIXED: PRICE = 329,900
FIXED PRICE = 329900
UNFIXED: GLA = n/a
FIXED GLA = n/a
UNFIXED: BROKER = null
Executing script: "Scrape Broker Link".
Scraping file: "Broker Link"
Broker Link: Preliminary URL: http://www.trulia.com/transfer.php
Broker Link: Using strict mode.
Broker Link: Resolved URL: http://www.trulia.com/transfer.php?s_id=69266&p_id=1059034986
Broker Link: Sending request.
Broker Link: Redirecting to: http://listings.listhub.net/pages/MLSLINY/2162353/?channel=trulia
Broker Link: Redirecting to: http://www.mlsli.com/cyberhomesredirect.cfm?orgid=nylibor-c&listingid=2162353
Broker Link: Redirecting to: http://www.mlsli.com/unideterminepropertytype.cfm?mlnum=2162353&CFID=7392757&CFTOKEN=84229116
Broker Link: Redirecting to: http://www.mlsli.com/unidetails.cfm?mlnum=2162353&typeprop=1&bn=1&CFID=7392757&CFTOKEN=84229116
Broker Link: Sorry, tidying HTML failed. Returning the original HTML.
and the "Write data to a file" is as follows:
//Write columns.
out.write( session.getVariable( "PROPERTY_ID" )+ "," );
out.write( "http://www.trulia.com/property/"+ session.getVariable( "PROPERTY_ID" )+ "-"+ session.getVariable( "FULL_ADDRESS" )+ "," );
out.write( session.getVariable( "ADDRESS" )+ "," );
out.write( session.getVariable( "CITY" ) + "," );
out.write( session.getVariable( "NEIGHBORHOOD" ) + "," );
out.write( session.getVariable( "COUNTY" ) + "," );
out.write( session.getVariable( "STATE" ) + "," );
out.write( session.getVariable( "ZIP" ) + "," );
out.write( session.getVariable( "PRICE" ) + "," );
out.write( session.getVariable( "TYPE" ) + "," );
out.write( session.getVariable( "BUILT" ) + "," );
out.write( session.getVariable( "BED" ) + "," );
out.write( session.getVariable( "BATH" ) + "," );
out.write( session.getVariable( "GLA" ) + "," );
out.write( session.getVariable( "LOT" ) + "," );
out.write( session.getVariable( "PHOTO_ID" ) + "," );
out.write( session.getVariable( "PHOTO_LINK" ) + "," );
out.write( session.getVariable( "BROKER" ) + "," );
out.write( session.getVariable( "BROKER_ID" ) + "," );
out.write( "http://www.trulia.com/transfer.php?s_id="+ session.getVariable( "BROKER_ID" )+ "&p_id="+ session.getVariable( "PROPERTY_ID" )+ "," );
session.executeScript("Scrape Broker Link");
out.write(("**" + scrapeableFile.getCurrentURL() + "**") + ",");
out.write( session.getVariable( "AGENT" ) + "," );
out.write( "\n" );
//Close up the file.
out.close();
One more weird thing, it appears to be in a circular pattern because the script repeats a few times. The log is as follows:
Details Page: Processing scripts after a pattern application.
Processing script: "Write data to a file"
Writing data to a file.
UNFIXED: PRICE =
FIXED PRICE =
UNFIXED: GLA =
FIXED GLA =
UNFIXED: BROKER =
FIXED BROKER =
Attempting a file download with the following maximum number of attempts: 5
Executing script: "Scrape Broker Link".
Scraping file: "Broker Link"
Broker Link: Preliminary URL: http://www.trulia.com/transfer.php
Broker Link: Using strict mode.
Broker Link: Resolved URL: http://www.trulia.com/transfer.php?s_id=&p_id=
Broker Link: Sending request.
Wrote file from: http://thumbs.trulia.com/pictures/thumbs_3/ps.3/3/2/d/5/picture-uh=b8c784e0e7a27fb0adb910de1f6d1d94-ps=32d54b18c9bdad88d745edb559704d95.jpg to file: C:/Users/Jaysan/Desktop/Screen Scraper/Trulia/Images/b8c784e0e7a27fb0adb910de1f6d1d94-ps=32d54b18c9bdad88d745edb559704d95.jpg
The file download succeeded.
Broker Link: Redirecting to: http:
Broker Link: An error occurred while connecting to 'http:'. The message was Connection refused: connect.
Trulia Site Detail Page: Downloading http://thumbs.trulia.com/pictures/thumbs_.jpg in its own thread.
Attempting a file download with the following maximum number of attempts: 5
AGENT--DataRecord 2:
AGENT=SHARON WACHTER & DIANE LARSEN
Storing this value in a session variable.
Details Page: Processing scripts after a pattern application.
Processing script: "Write data to a file"
Thanks for your help.
Previous suggestions
Adrian,
I'm sorry again for the delay. I only have enough time right now to recommend that you re-read my earlier message titled, "You're very close". Specifically, try out my "to see it work" idea and considering implementing the additional steps I suggest at the end to output your data.
It would make me feel all warm inside if you were to try the things I suggest. Because it's not practical to be exchanging your scraping sessions back and forth and because we're not able sit over your shoulder, when you try our suggestions and ask questions related to our suggestions then it makes troubleshooting much easier.
Thanks,
Scott
Broker Link Log Works
Hi Scott,
I'm sorry that I didn't follow your suggestions step by step, I guess I was trying to do a short-cut that ended being a long-cut.
The redirect worked, so now I'm trying to outwrite the redirect to the cvs file. The order of the scripts is as follows:
After the "Detail Page" is scraped the "Scrape Broker Link" script is executed after each pattern application and than the "Write data to a file" is executed after each pattern application. The "Scrape Broker Link" has the script session.scrapeFile("Broker Link");. After the "Broker Link" is scraped the script "Broker Link Log" is executed. The "Broker Link Log" has the script session.log("**" + scrapeableFile.getCurrentURL() + "**");
I would like to append the redirect URL to the current line of the CSV. In order to do that, would I have the following script in the "Write data to a file" script?
//Write columns.
out.write( session.getVariable( "Name of Get Varible 1" ) + "," );
out.write( session.getVariable( "Name of Get Varible 2" ) + "," );
out.write(("**" + scrapeableFile.getCurrentURL() + "**") + ",");
out.write( "\n" );
//Close up the file.
out.close();
Sorry again for being a pain in the butt and thanks again for your help.
Jedi training
Ah, Adrian. In order to become a Jedi Padawan, one must learn patience and take each step one at a time.
I recommend that you break down your last message and try each of the steps one at a time. If the result of one step is not desirable, reconsider your approach. If you are confident in your approach but continue to have an unintended outcome, break down your approach even further and try again.
The key to solving flaws in either your approach or your implementation is to simplify and isolate the problem as much as possible. This might mean...
- Make a script a call it, "00--breakpoint"
- Hold it in your arms for a moment because it will become very precious to you (I keep mine in a little mini recliner on my desk)
- Include this one line in it
session.breakpoint();
- Call this script from the scripts tab of your "Broker Link" scrapbeable file and have it run "before file is scraped"
- Call your "Scrape Broker Link" script for each link extracted from the "Detail Page" scrapeable file
- For each link the breakpoint should fire before the broker link page is scraped
- Click play on the first breakpoint and let it stop again
- Observe the log to see if each of the redirects are happening like they should
- Now, refer to my previous examples AND the getCurrentURL documentation to know where you should be calling the script from that contains the getCurrentURL method from
- You will now be able to access the value of the getCurrentURL method when calling it from your "Broker link" scrapeable file, "after the file is scraped"
The next challenge I foresee will be in combining the data that may have previously written out to a file with this data you're working on now. Hint: Resist the urge to write out the earlier data and instead store it all as a session variable in a multi-dimensional object such as an array or hashtable (screen-scraper will retain the data-type of whatever variable you give it) then write to the output file that data and this latest data in the same script at the same time.
Now, go scrape on the day!
Scott