AjaxControlToolkit -- NoBot : Have you ever encountered this?

I am trying to scrape a site that uses the NoBot anti-scraping method. Here is the documentation for NoBot:

http://www.asp.net/ajaxLibrary/AjaxControlToolkitSampleSite/NoBot/NoBot.aspx

It's a .Net thing that does some little fun things to defeat anyone who's not using a proper browser. At the very least you need to execute a javascript snippet and provide the response they're looking for. There are other elements to NoBot but the snippet is the only one (currently) I'm wrestling with.

Thoughts?

Robert,At least in their

Robert,

At least in their example, this wasn't too hard to fool. If you take a look at the post parameters when you submit the form you'll notice the usual VIEWSTATE and EVENTVALIDATION. They're also sending ctl00_SampleContent_ScriptManager1_HiddenField which was easy to grab from the originating page.

The only tricky one was the value of ctl00$SampleContent$NoBot1$NoBot1_NoBotExtender_ClientState. If you dig a little you'll find a function being called which parses the current width and height of the div tag labeled ctl00_SampleContent_NoBot1_NoBotSamplePanel. It then multiplies these values together to create the value for the post parameter ctl00$SampleContent$NoBot1$NoBot1_NoBotExtender_ClientState.

Mimic this behavior and pass the results as the value of ctl00$SampleContent$NoBot1$NoBot1_NoBotExtender_ClientState, then add a short 5 second pause before you submit the form and it should work.

Copy and paste the following into a file named NoBot.sss then import it into screen-scraper to see how it works.

<?xml version="1.0" encoding="UTF-8"?>
<scraping-session use-strict-mode="true"><script-instances><owner-type>ScrapingSession</owner-type><owner-name>asp.net</owner-name></script-instances><name>asp.net</name><notes></notes><cookiePolicy>0</cookiePolicy><maxHTTPRequests>1</maxHTTPRequests><external_proxy_username></external_proxy_username><external_proxy_password></external_proxy_password><external_proxy_host></external_proxy_host><external_proxy_port></external_proxy_port><external_nt_proxy_username></external_nt_proxy_username><external_nt_proxy_password></external_nt_proxy_password><external_nt_proxy_domain></external_nt_proxy_domain><external_nt_proxy_host></external_nt_proxy_host><anonymize>false</anonymize><terminate_proxies_on_completion>false</terminate_proxies_on_completion><number_of_required_proxies>5</number_of_required_proxies><originator_edition>2</originator_edition><logging_level>1</logging_level><date_exported>April 25, 2012 18:06:22</date_exported><character_set>UTF-8</character_set><created_by_version>6.0</created_by_version><scrapeable-files sequence="1" will-be-invoked-manually="false" tidy-html="jtidy"><last-scraped-data></last-scraped-data><URL>http://www.asp.net/ajaxLibrary/AjaxControlToolkitSampleSite/NoBot/NoBot.aspx</URL><last-request></last-request><name>start</name><extractor-patterns sequence="1" automatically-save-in-session-variable="false" if-saved-in-session-variable="0" filter-duplicates="false" cache-data-set="false" will-be-invoked-manually="false"><pattern-text>TSM_CombinedScripts_=~@HiddenField@~"</pattern-text><identifier>HiddenField</identifier><extractor-pattern-tokens optional="false" save-in-session-variable="true" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="1"><regular-expression>[^"]*</regular-expression><identifier>HiddenField</identifier></extractor-pattern-tokens><script-instances><owner-type>ExtractorPattern</owner-type><owner-name>HiddenField</owner-name></script-instances></extractor-patterns><extractor-patterns sequence="5" automatically-save-in-session-variable="false" if-saved-in-session-variable="0" filter-duplicates="false" cache-data-set="false" will-be-invoked-manually="false"><pattern-text>NoBotSamplePanel" style="height:~@height@~px;width:~@width@~px</pattern-text><identifier>height &amp; width</identifier><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="1"><regular-expression>[\d,]+</regular-expression><identifier>height</identifier></extractor-pattern-tokens><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="2"><regular-expression>[\d,]+</regular-expression><identifier>width</identifier></extractor-pattern-tokens><script-instances><script-instances when-to-run="85" sequence="1" enabled="true"><script><script-text>width = Integer.valueOf(dataRecord.get("width"));
height = Integer.valueOf(dataRecord.get("height"));
noBotExtenderClientState = (width * height);
session.setVariable("NoBotExtender_ClientState", noBotExtenderClientState);
session.log("NoBotExtender_ClientState: " + session.getVariable("NoBotExtender_ClientState"));
session.pause(5000);
session.scrapeFile("submit");</script-text><name>me no bot</name><language>Interpreted Java</language></script></script-instances><owner-type>ExtractorPattern</owner-type><owner-name>height &amp; width</owner-name></script-instances></extractor-patterns><extractor-patterns sequence="4" automatically-save-in-session-variable="false" if-saved-in-session-variable="0" filter-duplicates="false" cache-data-set="false" will-be-invoked-manually="false"><pattern-text>NoBotExtender_ClientState" /&gt;&#xd;
&lt;div id="~@Identifier@~"</pattern-text><identifier>Untitled Extractor Pattern</identifier><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="1"><regular-expression>[^"]*</regular-expression><identifier>Identifier</identifier></extractor-pattern-tokens><script-instances><owner-type>ExtractorPattern</owner-type><owner-name>Untitled Extractor Pattern</owner-name></script-instances></extractor-patterns><extractor-patterns sequence="2" automatically-save-in-session-variable="false" if-saved-in-session-variable="0" filter-duplicates="false" cache-data-set="false" will-be-invoked-manually="false"><pattern-text>VIEWSTATE" value="~@VIEWSTATE@~"</pattern-text><identifier>Viewstate</identifier><extractor-pattern-tokens optional="false" save-in-session-variable="true" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="1"><regular-expression>[^"]*</regular-expression><identifier>VIEWSTATE</identifier></extractor-pattern-tokens><script-instances><owner-type>ExtractorPattern</owner-type><owner-name>Viewstate</owner-name></script-instances></extractor-patterns><extractor-patterns sequence="3" automatically-save-in-session-variable="false" if-saved-in-session-variable="0" filter-duplicates="false" cache-data-set="false" will-be-invoked-manually="false"><pattern-text>EVENTVALIDATION" value="~@EVENTVALIDATION@~"</pattern-text><identifier>Eventvalidation</identifier><extractor-pattern-tokens optional="false" save-in-session-variable="true" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="1"><regular-expression>[^"]*</regular-expression><identifier>EVENTVALIDATION</identifier></extractor-pattern-tokens><script-instances><owner-type>ExtractorPattern</owner-type><owner-name>Eventvalidation</owner-name></script-instances></extractor-patterns><script-instances><owner-type>ScrapeableFile</owner-type><owner-name>start</owner-name></script-instances></scrapeable-files><scrapeable-files sequence="-1" will-be-invoked-manually="true" tidy-html="jtidy"><last-scraped-data></last-scraped-data><URL>http://www.asp.net/ajaxLibrary/AjaxControlToolkitSampleSite/NoBot/NoBot.aspx</URL><last-request></last-request><name>submit</name><extractor-patterns sequence="1" automatically-save-in-session-variable="false" if-saved-in-session-variable="0" filter-duplicates="false" cache-data-set="false" will-be-invoked-manually="false"><pattern-text>&lt;span id="ctl00_SampleContent_Label1" style="font-weight:bold;"&gt;~@message@~&lt;</pattern-text><identifier>message</identifier><extractor-pattern-tokens optional="false" save-in-session-variable="false" compound-key="true" strip-html="false" resolve-relative-url="false" replace-html-entities="false" trim-white-space="false" exclude-from-data="false" null-session-variable="false" sequence="1"><regular-expression>[^&lt;&gt;]*</regular-expression><identifier>message</identifier></extractor-pattern-tokens><script-instances><owner-type>ExtractorPattern</owner-type><owner-name>message</owner-name></script-instances></extractor-patterns><HTTPParameters sequence="9"><key>ctl00$SampleContent$NoBot1$NoBot1_NoBotExtender_ClientState</key><type>POST</type><value>~#NoBotExtender_ClientState#~</value></HTTPParameters><HTTPParameters sequence="8"><key>ctl00$SampleContent$Button1</key><type>POST</type><value>Submit</value></HTTPParameters><HTTPParameters sequence="6"><key>ctl00$SampleContent$TextBox1</key><type>POST</type><value>Anonymous</value></HTTPParameters><HTTPParameters sequence="1"><key>ctl00_SampleContent_ScriptManager1_HiddenField</key><type>POST</type><value>~#HiddenField#~</value></HTTPParameters><HTTPParameters sequence="4"><key>__VIEWSTATE</key><type>POST</type><value>~#VIEWSTATE#~</value></HTTPParameters><HTTPParameters sequence="2"><key>__EVENTTARGET</key><type>POST</type><value></value></HTTPParameters><HTTPParameters sequence="11"><key>ctl00$SampleContent$cpeProperties_ClientState</key><type>POST</type><value>true</value></HTTPParameters><HTTPParameters sequence="10"><key>ctl00$SampleContent$cpeDescription_ClientState</key><type>POST</type><value>false</value></HTTPParameters><HTTPParameters sequence="7"><key>ctl00$SampleContent$TextBox2</key><type>POST</type><value>User</value></HTTPParameters><HTTPParameters sequence="5"><key>__EVENTVALIDATION</key><type>POST</type><value>~#EVENTVALIDATION#~</value></HTTPParameters><HTTPParameters sequence="3"><key>__EVENTARGUMENT</key><type>POST</type><value></value></HTTPParameters><script-instances><owner-type>ScrapeableFile</owner-type><owner-name>submit</owner-name></script-instances></scrapeable-files></scraping-session>

Hopefully, the actual site you're working with doesn't get much trickier than this.

-Scott