Scraping this site with directories, then search results using Ajax.
We are testing several sites for scraping, and came across this site which seems to be something different. We can't find any guides online for our interns to follow, so we are wondering if you could give some heads-up on this.
We are trying to scrape this: http://www.hotelscombined.com/CountryAll/Argentina.htm
As you notice, each city in the page above returns one search result, then details page, so it's a 3-level search, instead of 2 shown in the e-commerce tutorial. We could have done this in 2-level search, but we rather have something more automated. How can we have screen-scraper to perform its operation at this level?
Next in the search result, it is fetched via Ajax. How can we go to the 3rd level via Ajax? What is the scraping method?
Many thanks.
In looking at this, you could
In looking at this, you could go even 4 levels if you wanted to automate through the countries. I whipped up a quick demo. You need to save this block to a text file, name it "hotel compete.sss" and import it to screen-scraper
<scraping-session use-strict-mode="true"><script-instances><owner-type>ScrapingSession</owner-type><owner-name>Hotels combined</owner-name></script-instances><name>Hotels combined</name><notes></notes><cookiePolicy>0</cookiePolicy><maxHTTPRequests>1</maxHTTPRequests><external_proxy_username></external_proxy_username><external_proxy_password></external_proxy_password><external_proxy_host></external_proxy_host><external_proxy_port></external_proxy_port><external_nt_proxy_username></external_nt_proxy_username><external_nt_proxy_password></external_nt_proxy_password><external_nt_proxy_domain></external_nt_proxy_domain><external_nt_proxy_host></external_nt_proxy_host><anonymize>false</anonymize><terminate_proxies_on_completion>false</terminate_proxies_on_completion><number_of_required_proxies>5</number_of_required_proxies><originator_edition>2</originator_edition><logging_level>1</logging_level><date_exported>May 04, 2011 15:37:44</date_exported><character_set>ISO-8859-1</character_set><scrapeable-files sequence="-1" will-be-invoked-manually="true" tidy-html="jtidy"><last-scraped-data></last-scraped-data><URL>http://www.hotelscombined.com/~#URL#~</URL><last-request></last-request><name>Hotel</name><extractor-patterns sequence="1" automatically-save-in-session-variable="false" if-saved-in-session-variable="0" filter-duplicates="false" cache-data-set="false" will-be-invoked-manually="false"><pattern-text><h1 class="hc_htl_intro_name">~@NAME@~</h1>
</pattern-text><identifier>Hotel info</identifier><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="true" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="1"><regular-expression>[^<>]*</regular-expression><identifier>NAME</identifier></extractor-pattern-tokens><script-instances><owner-type>ExtractorPattern</owner-type><owner-name>Hotel info</owner-name></script-instances></extractor-patterns><script-instances><owner-type>ScrapeableFile</owner-type><owner-name>Hotel</owner-name></script-instances></scrapeable-files><scrapeable-files sequence="-1" will-be-invoked-manually="true" tidy-html="dont"><last-scraped-data></last-scraped-data><URL>http://www.hotelscombined.com/City/~#URL#~</URL><BASICAuthenticationUsername></BASICAuthenticationUsername><last-request></last-request><name>City</name><extractor-patterns sequence="1" automatically-save-in-session-variable="false" if-saved-in-session-variable="0" filter-duplicates="false" cache-data-set="false" will-be-invoked-manually="false"><pattern-text><h3><a p=~@DATARECORD@~/a></h3>
</pattern-text><identifier>Hotels</identifier><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="1"><identifier>DATARECORD</identifier></extractor-pattern-tokens><extractor-patterns sequence="2" automatically-save-in-session-variable="false" if-saved-in-session-variable="0" filter-duplicates="false" cache-data-set="false" will-be-invoked-manually="false"><pattern-text>>~@NAME@~<</pattern-text><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="true" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="1"><regular-expression>[^<>]*</regular-expression><identifier>NAME</identifier></extractor-pattern-tokens><script-instances/></extractor-patterns><extractor-patterns sequence="1" automatically-save-in-session-variable="false" if-saved-in-session-variable="0" filter-duplicates="false" cache-data-set="false" will-be-invoked-manually="false"><pattern-text>href="~@URL@~"</pattern-text><extractor-pattern-tokens optional="false" save-in-session-variable="true" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="1"><regular-expression>[^"]*</regular-expression><identifier>URL</identifier></extractor-pattern-tokens><script-instances/></extractor-patterns><script-instances><script-instances when-to-run="80" sequence="1" enabled="true"><script><script-text>session.scrapeFile("Hotel");</script-text><name>HC--scrape hotel</name><language>Interpreted Java</language></script></script-instances><owner-type>ExtractorPattern</owner-type><owner-name>Hotels</owner-name></script-instances></extractor-patterns><script-instances><owner-type>ScrapeableFile</owner-type><owner-name>City</owner-name></script-instances></scrapeable-files><scrapeable-files sequence="1" will-be-invoked-manually="false" tidy-html="jtidy"><last-scraped-data></last-scraped-data><URL>http://www.hotelscombined.com/CountryAll/Argentina.htm</URL><last-request></last-request><name>Country</name><extractor-patterns sequence="1" automatically-save-in-session-variable="false" if-saved-in-session-variable="0" filter-duplicates="false" cache-data-set="false" will-be-invoked-manually="false"><pattern-text>"/City/~@URL@~">~@tag@~~@CITY@~<</pattern-text><identifier>Cities</identifier><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="3"><identifier>CITY</identifier></extractor-pattern-tokens><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="2"><regular-expression>(<b>)*</regular-expression><identifier>tag</identifier></extractor-pattern-tokens><extractor-pattern-tokens optional="false" save-in-session-variable="true" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="1"><regular-expression>[^"]*</regular-expression><identifier>URL</identifier></extractor-pattern-tokens><script-instances><script-instances when-to-run="80" sequence="1" enabled="true"><script><script-text>session.scrapeFile("City");</script-text><name>HC--scrape city</name><language>Interpreted Java</language></script></script-instances><owner-type>ExtractorPattern</owner-type><owner-name>Cities</owner-name></script-instances></extractor-patterns><script-instances><owner-type>ScrapeableFile</owner-type><owner-name>Country</owner-name></script-instances></scrapeable-files></scraping-session>