Commas in the URL replaced by plus
This is the URL of the search result I want to scrap..
http://bellmarc.com/search/searchresults.asp?fireplace=0&outdoor=0&modern=0&doorman=0&loft=0&prewar=0&bedrooms=&neighborhood=ES%2C,WS%2C,LC%2C,MH%2C,GV%2C,CG%2C,BK%2C,BX%2C,QS%2C,O&buildingtype=&maxlimit=&minlimit=&search=s&sort=&page=1&pagesize=20
I created the scrapeable file using the URL from the proxy.
When I run the scrapping session, the URL resolved by screen scrapper looks like this
http://bellmarc.com/search/searchresults.asp?fireplace=0&outdoor=0&modern=0&doorman=0&loft=0&prewar=0&bedrooms=&neighborhood=ES%2C+WS%2C+LC%2C+MH%2C+GV%2C+CG%2C+BK%2C+BX%2C+QS%2C+O&buildingtype=&maxlimit=&minlimit=&search=s&sort=&page=1&pagesize=20
The commas are replaced by plus sign.So the scrapper is redirected to the home page.
Why is it so?? How can I scrap this search ?
Starting scraper.Running
Running scraping session: Bellmarc Property Search
Processing scripts before scraping session begins.
Scraping file: "New Scrapeable File"
New Scrapeable File: Preliminary URL: <a href="http://bellmarc.com/search/searchresults.asp?fireplace=0&outdoor=0&modern=0&doorman=0&loft=0&prewar=0&bedrooms=&neighborhood=ES,%20WS,%20LC,%20MH,%20GV,%20CG,%20BK,%20BX,%20QS,%20O&buildingtype=&maxlimit=&minlimit=&search=s&sort=&page=1&pagesize=20<br />
New" title="http://bellmarc.com/search/searchresults.asp?fireplace=0&outdoor=0&modern=0&doorman=0&loft=0&prewar=0&bedrooms=&neighborhood=ES,%20WS,%20LC,%20MH,%20GV,%20CG,%20BK,%20BX,%20QS,%20O&buildingtype=&maxlimit=&minlimit=&search=s&sort=&page=1&pagesize=20<br />
New">http://bellmarc.com/search/searchresults.asp?fireplace=0&outdoor=0&moder...</a> Scrapeable File: Using strict mode.
New Scrapeable File: Resolved URL: <a href="http://bellmarc.com/search/searchresults.asp?fireplace=0&outdoor=0&modern=0&doorman=0&loft=0&prewar=0&bedrooms=&neighborhood=ES,%20WS,%20LC,%20MH,%20GV,%20CG,%20BK,%20BX,%20QS,%20O&buildingtype=&maxlimit=&minlimit=&search=s&sort=&page=1&pagesize=20<br />
New" title="http://bellmarc.com/search/searchresults.asp?fireplace=0&outdoor=0&modern=0&doorman=0&loft=0&prewar=0&bedrooms=&neighborhood=ES,%20WS,%20LC,%20MH,%20GV,%20CG,%20BK,%20BX,%20QS,%20O&buildingtype=&maxlimit=&minlimit=&search=s&sort=&page=1&pagesize=20<br />
New">http://bellmarc.com/search/searchresults.asp?fireplace=0&outdoor=0&moder...</a> Scrapeable File: Sending request.
New Scrapeable File: Redirecting to: <a href="http://bellmarc.com/home.asp<br />
Processing" title="http://bellmarc.com/home.asp<br />
Processing">http://bellmarc.com/home.asp<br />
Processing</a> scripts after scraping session has ended.
Scraping session "Bellmarc Property Search" finished.
This is what i got when i tried as you have said.
I need your help to sort this problem.
Maybe you're missing a
Maybe you're missing a cookie? It looks like it resolves properly, but then the site is redirecting you.
Since the website is built on ASP, I'll be willing to bet the nickle in my pocket that there is a VIEWSTATE variable (with optional EVENTTARGET, EVENTVALIDATION, EVENTARGUEMENT variables) which need to be scraped by the previous page in order to navigate properly. ASP sucks that way. Blame Micro$oft :)
If you do have VIEWSTATE variables, then you'll have to proceed through the site the entire path, scraping VIEWSTATEs along the way. The VIEWSTATEs are unique each time you visit a page. It's a magical, often big and clumsy and way too cumbersome string of nonsense that the site understands, but that no human ever will.
Is there a VIEWSTATE variable in your file's parameter tab?
If not, then maybe it's just a cookie issue. Screen-scraper doesn't have the cookies that your web browser has, unless screen-scraper follows the same navigation. You may have to make a scrapeableFile for each page on the site as you navigate, beginning with the home page. A way to test this would be to clear your browser's cache and cookies, and then try to paste that URL you have me into your browser. If it redirects you, then you've got your answer: you're missing cookies.
Re:
I tried this,
1.copy the URL before deleting the cookies.
2.Delete the cookies
3.went to home page
4.went to search page
5.Pasted the search result URL to the address bar
6.Browser was redirected to the search result page.
The browser will be redirected correctly if I pasted the URL only after doing a search(click the search button). Can we overcome this situation.? Simply making a screapable file for each file won't do the job, right?? Can we scrap this site??
Yes! It 's the Cookies.
There wasn't any viewstate variable in the parameter list.Thank God I have n't seen one yet!!
When I pasted the URL in the browser after deleting all the cache and cookies, it was redirected to the home page.
I created a screapeable file for the home page, search page and the search results page.
Now this is the log when i ran the session.
Starting scraper.
Running scraping session: Bellmarc Property Search
Processing scripts before scraping session begins.
Scraping file: "Home"
Home: Preliminary URL: http://bellmarc.com/home.asp
Home: Using strict mode.
Home: Resolved URL: http://bellmarc.com/home.asp
Home: Sending request.
Scraping file: "Search"
Search: Preliminary URL: http://bellmarc.com/search/search.asp
Search: Using strict mode.
Search: Resolved URL: http://bellmarc.com/search/search.asp?search=sales
Search: Sending request.
Scraping file: "Search Results"
Search Results: Preliminary URL: http://bellmarc.com/search/searchresults.asp?fireplace=0&outdoor=0&modern=0&doorman=0&loft=0&prewar=0&bedrooms=&neighborhood=ES,%20WS,%20LC,%20MH,%20GV,%20CG,%20BK,%20BX,%20QS,%20O&buildingtype=&maxlimit=&minlimit=&search=sales&sort=&page=1&pagesize=20
Search Results: Using strict mode.
Search Results: Resolved URL: http://bellmarc.com/search/searchresults.asp?fireplace=0&outdoor=0&modern=0&doorman=0&loft=0&prewar=0&bedrooms=&neighborhood=ES,%20WS,%20LC,%20MH,%20GV,%20CG,%20BK,%20BX,%20QS,%20O&buildingtype=&maxlimit=&minlimit=&search=sales&sort=&page=1&pagesize=20
Search Results: Sending request.
Search Results: Redirecting to: http://bellmarc.com/search/search.asp
Processing scripts after scraping session has ended.
Scraping session "Bellmarc Property Search" finished.
Do i have to save any varibles from the previous pages???
In the home page there is no parameters.I the search page ther is one get parameter sales with value 1.
And in the search results page there are may .But I didnt give any parameters as it is behaving the same (commas being replaced). I gave the entire url in the propety page itself.
Now what shall I do??
It looks like the problem was
It looks like the problem was resolved by catching the cookies correctly-- I was a little confused by the last part of what you just said, though.
Is it just redirecting you to the general search page, but with no results?
I'll think about it a bit more and reply soon.
Huh..
Well, I wanted to confirm what character the hex code "%2C" refers to.. I checked on http://www.obkb.com/dcljr/charstxt.html and it says that it's a comma.
If %2C is a comma, then that means that "ES%2C,WS" (etc) would be equal to "ES,,WS". That seems a bit odd to me..
Are you sure that the scrapableFile URL is unaltered from when you made the scrapeableFile from the proxy? When screen-scraper makes a scrapableFile from the proxy, it should dump the URL as "
http://bellmarc.com/search/searchresults.asp
", and then have a list of parameters in the parameters tab. Those parameters shouldn't appear with the URL. The only way you'd get two commans next to each other would be if there were blank parameters in the parameters tab.In addition, any parameters listed in the parameters tab should not be hex-encoded, as in the "%2C".
I don't think this has much to do with anything, but I decided I should ask: Are you using the most recent version of screen-scraper, version 4.0? I'm not very familiar with the way that 3.0 worked. I've only been around since May of this year :)
Can you try making another scrapeableFile out of the page you want to scrape? It should be as I've described in the 3rd paragraph.
What makes me wonder about the integrity of the scrapeableFile is the fact that there are mixed encodings in the URL you provided... all commas should be "%2C" or they should all be ","... not mixed.
And are you sure that you're not accidentally scraping variables that already contain a comma, and then putting them into your scrapableFile, which already has the "%2C"s for commas?
Hope this helps to get you going in the right direction! Of course, if you have any new developements, let me know.
Tim
Re:
Hi Tim, Thankyou for your reply.
I followed the following steps to scrap the site
1. Record the pages using the proxy server.
2. selected the url from proxy and created scrapeable file to the session names "bellmarc"
3. changed the page parameter in the parameters tab into a variable.
4. defined the patterns
5. Ran the session.
6. Log showed "Request redirected to home.asp"
7. session ended.
when I created the scrapeable file from proxy it reduced to as http://bellmarc.com/search/searchresults.asp and with a list of get parameters in the parameters tab. The url I gave was in the proxy from which I created the scrapeable file. When I ran the session, it resolved the url as I gave with the plus sign and the session ended without matching any patterns. The log showed that request is redirected to bellmarc.com/home.asp. Can I scrap this site? I use 4.0 version. It 's been just a couple of months since I started using screenscrapper. I scrapped 3 sites before. This one "bellmarc.com" is the 4th one I am trying to scrap.
There's got to be some bad
There's got to be some bad information inside of (at least one of) your session variables.
If you just proxy the search results page, or the details page, it works pretty well. I didn't insert any variables.
I would try just a plain old scrapeableFile without any variables in the parameters tab. Make it run instead of your *actual* scrapeableFile which has variables in it.
If it works, take one of the variables from your old one, and put it into the right spot on the new one.
If it still works, put another variable into the new scrapableFile.
Keep it going.
As soon as it stops working, you'll know which variable is breaking it. Examine the contents of that variable and figure out what is strange about it. Usually it'll be because there's something like an "&" or " " in the variable, which is not okay. You may have to make a script to process the parameter variables before you scrape the file, which does something like the following:
// Interpreted Java
session.setVariable("someVar", session.getVariable("someVar").replaceAll("&| ", ""));