Redirection
Im currently trying to get a number that is written in the title of a certain website (compete.com) and write it out into a csv. When I test the extractor (peppercom.com ~@UVs@~ UVs for September 2011 | Compete) it works. I see 35 where it says ~@UVs@~ - sequence 0. But when I start the scraping session (to write the file out) instead of looking in the site (http://siteanalytics.compete.com/peppercom.com/) it automatically redirects me to http://www.compete.com/ie6/. I found that the redirected site is written below the one Im interested in. My scraping session looks in the redirected one and not the one Im interested in. Any suggestions would be more than helpful. I included the Scraping log, and the last response from the site Im interested in.
Thanks Thank you Gracias
Dan
Scraping log:
Starting scraper.
Running scraping session: Peppercom
Processing scripts before scraping session begins.
Scraping file: "File from New Proxy Session"
File from New Proxy Session: Resolved URL: http://siteanalytics.compete.com/peppercom.com/
File from New Proxy Session: Sending request.
File from New Proxy Session: Redirecting to: http://www.compete.com/ie6/
File from New Proxy Session: Applying extractor pattern: Untitled Extractor Pattern
File from New Proxy Session: Extracting data for pattern "Untitled Extractor Pattern"
File from New Proxy Session: The pattern did not find any matches.
File from New Proxy Session: Warning! No matches were made by any of the extractor patterns associated with this scrapeable file.
Processing scripts after scraping session has ended.
Scraping session "Peppercom" finished.
HTML from Compete.com:
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- HOSTNAME: prodweb44
DEBUG: False
VERSION: 41a36c315d5e605bcd06dd272df95a8e097482ab
COOKIE_DOMAIN: .compete.com -->
<!--[if lt IE 7 ]> <html lang="en" class="no-js ie6"> <![endif]-->
<!--[if IE 7 ]> <html lang="en" class="no-js ie7"> <![endif]-->
<!--[if IE 8 ]> <html lang="en" class="no-js ie8"> <![endif]-->
<!--[if IE 9 ]> <html lang="en" class="no-js ie9"> <![endif]-->
<!--[if (gte IE 10)|!(IE)]><!-->
<html lang="en" class="no-js" xmlns="http://www.w3.org/1999/xhtml">
<!--<![endif]-->
<head>
<meta name="generator" content="HTML Tidy, see <a href="http://www.w3.org"" title="www.w3.org"">www.w3.org"</a> />
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<meta name="ICBM" content="42.348043, -71.077617" />
<meta name="DC.title" content="Compete" />
<meta name="description" content=" See monthly traffic, unique visitors, rank and more for peppercom.com with Compete's free Site Analytics. " />
<meta name="keywords" content=" competitive intelligence, market intelligence, competitive strategy, media planning, website traffic, site traffic, search marketing, audience measurement " />
<script type="text/javascript" src="http://media.compete.com/site_media/thirdparty/modernizr-1.6.min.ver-41a36c315d5e605bcd06dd272df95a8e097482ab.js">
</script>
<title>peppercom.com 35 UVs for September 2011 | Compete</title>
<!-- block head -->
<!-- endblock head -->
</head>
<body>
<div class="header">
<p class="subscribe"><a href="http://www.compete.com/plans/">Subscribe to Compete PRO</a></p>
<ul class="nav main">
<li class="logo"><a href="http://www.compete.com"><img src="http://media.compete.com/site_media/images/free/siteanalytics_logo.png" width="228" height="26" alt="Site Analytics" /></a></li>
<li><a href="http://www.compete.com">Compete.com</a></li>
<li><a href="http://www.compete.com/pro/">Compete PRO</a></li>
<li><a href="http://www.compete.com/products/">Products</a></li>
<li><a href="http://www.compete.com/expertise/">Expertise</a></li>
<li><a href="http://www.compete.com/resources/methodology/">Our Data</a></li>
<li class="last"><a href="http://blog.compete.com">Pulse Blog</a></li>
</ul>
</div>
<span class="competeXL"><!-- The Compete XL Code -->
<script type="text/javascript">
var __compete_code_control = {
measure_traffic_asynchronously: false
};
</script>
<!-- Compete XL Code for compete.com -->
<script type="text/javascript">
__compete_code = '4b6705ef8ded7e9cb0067318dde11c3e';
/* Set control variables below this line. */
</script>
<script type="text/javascript" src="//c.compete.com/bootstrap/s/4b6705ef8ded7e9cb0067318dde11c3e/compete-com/bootstrap.js">
</script>
<noscript><img width="1" height="1" src="https://ssl-compete-com-4b6705.c-col.com" /></noscript> <!-- End of the Compete Code -->
</span>
<div class="wrapper">
<div class="login-block">
<p class="membership"><a href="javascript:void(0)">Login</a> or <a href="javascript:void(0)">Sign Up</a> for Site Analytics to follow sites</p>
<p class="get-pro">Get the whole story with a Compete PRO subscription.<a href="http://www.compete.com/pro/features/">Learn More</a></p>
</div>
<div class="page clearfix">
<div class="head">
<div class="interact-wrap">
<ul class="nav interact">
<li id="follow" class="follow"><a href="javascript:void(0)">Follow This Site</a></li>
<li id="manage" class="manage" style="display:none"><a href="javascript:void(0)">Manage List</a></li>
</ul>
</div>
<div class="stretcher"><span class='st_twitter_custom' st_title="peppercom.com 35 UVs for September 2011 from @compete"><img src="http://media.compete.com/site_media/images/free/fsp_twitter.png" width="22" height="22" alt="Twitter" /></span> <span class='st_facebook_custom'><img src="http://media.compete.com/site_media/images/free/fsp_facebook.png" width="22" height="22" alt="Facebook" /></span> <span class='st_linkedin_custom'><img src="http://media.compete.com/site_media/images/free/fsp_linkedin.png" width="22" height="22" alt="LinkedIn" /></span> <span class='st_email_custom'><img src="http://media.compete.com/site_media/images/free/fsp_mail.png" width="22" height="22" alt="Email" /></span></div>
<div class="search">
<form id="sa-search-form"><input name="t" type="hidden" /> <label>http://</label>
<div id="sa-search-input-wrapper" class="search-wrapper"><input name="q" type="text" id="sa-search-input" class="at-sa-search-input" /></div>
<div class="submit-wrapper"><input type="submit" value="GO" id="sa-search-submit" class="at-sa-search-submit" /></div>
</form>
</div>
<div class="message low-sample">
<p><strong>Rough Estimate:</strong>The sample size for this site is small, for more info refer to our <a href="http://www.compete.com/resources/methodology/">Data Methodology</a>.</p>
</div>
<p class="data-info">September 2011 / U.S. Data Only</p>
<ul class="nav tools">
<li class="save"><a href="javascript:void(0)" id="graph-image" class="at_save">Save Graph Image</a></li>
<li class="export"><a id="csv-export" href="javascript:void(0)" target="" class="at_export">Export CSV</a></li>
<li class="embed"><a href="javascript:void(0)" id="embed-graph" class="at_embed">Embed Graph</a></li>
</ul>
</div>
<div class="sidebar">
<div class="section score">
<h3><span class="help"><img src="http://media.compete.com/site_media/images/free/icon-help-grey.png" width="21" height="21" alt="?" /></span>Unique Visitors</h3>
<h4>35</h4>
<ul>
<li class="m2m"><span class="delta-negative number">-432</span> | <span class="delta-negative number">-92.51%</span></li>
<li class="y2y"><span class="delta-negative number">-3,420</span> | <span class="delta-negative number">-98.99%</span></li>
</ul>
<h3>Rank <span class="note">(by UVs)</span></h3>
<h4>5,534,624</h4>
<ul>
<li class="m2m"><span class="rank">2,126,327</span> | <span class="delta-negative move">-3,408,297</span></li>
<li class="y2y"><span class="rank">418,162</span> | <span class="delta-negative move">-5,116,462</span></li>
</ul>
</div>
<div class="section trends">
<h3>Competitive Rank <span class="note">(UVs)</span></h3>
<ol id='similar-sites'>
<li class="partner-link"><a href="http://www.similarsites.com/site/peppercom.com" target="_blank">Looking for sites similar to<br />
<span class="site">peppercom.com</span><br />
on SimilarSite.com ...</a></li>
</ol>
</div>
</div>
<div class="content">
<div class="section">
<h2><img width="16" height="16" alt="Logo" src="http://g.etfv.co/http://peppercom.com" /> peppercom.com</h2>
<div id="graph"></div>
</div>
<div class="section"><span id="zoominfo" class="zoominfo"></span></div>
</div>
</div>
</div>
<script id="template-help-tooltip" type="text/template">
<div class="pointer">
<div class="message">
<p>Unique Visitors counts how many unique individual people visited this site per month. Visitors are counted once, no matter how many times they visit a site in a month. Counts represent traffic from the United States only.</p>
<p>Rank measures the popularity of this site based on how many Unique Visitors came to the site in a month. With Rank, lower is better.</p>
<p>Competitive Rank shows where a site ranks in its competitive set measured by Unique Visitors.</p>
</div>
</div>
</script>
<!-- Google Code for NEW Site Analytics Home Page Remarketing List -->
<script type="text/javascript">
/* <![CDATA[ */
var google_conversion_id = 1069995145;
var google_conversion_language = "en";
var google_conversion_format = "3";
var google_conversion_color = "666666";
var google_conversion_label = "g5IxCJ-C_gIQiamb_gM";
var google_conversion_value = 0;
/* ]]> */
</script>
<script type="text/javascript" src="http://www.googleadservices.com/pagead/conversion.js">
</script>
<noscript>
<div style="display:inline;"><img height="1" width="1" style="border-style:none;" alt="" src="http://www.googleadservices.com/pagead/conversion/1069995145/?label=g5IxCJ-C_gIQiamb_gM&guid=ON&script=0" /></div>
</noscript>
<div class="footer">
<p class="copyright">© Copyright to Compete.com - A Kantar Media Company</p>
<ul class="nav utility">
<li><a href="http://www.compete.com">Visit Compete.com</a></li>
<li class="last"><a href="http://www.compete.com/plans/">Subscribe to Compete PRO</a></li>
</ul>
</div>
<!-- block tags -->
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-6320717-1']);
_gaq.push(['_setDomainName', '.compete.com']);
_gaq.push(['_trackPageview']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
</script>
<script type="text/javascript" charset="utf-8">
$j(document).ready(function() {
var s_code=s.t();
if(s_code) {
document.write(s_code);
}
});
</script>
<script language="JavaScript1.1" type="text/javascript">
var CQK = "B288EECE",
CQP = (("https:" == document.location.protocol) ? "https://" : "http://");
document.write(unescape("%3Cscript language=\"JavaScript1.1\" type=\"text/javascript\" src=\""+CQP+"js.clickequations.net/CLEQ_"+CQK+".js\" %3E%3C/script%3E" ));
</script>
<script type="text/javascript">
document.write(unescape("%3Cscript src='" + ((document.location.protocol=="https:")?"https:":"http:") + "//snapabug.appspot.com/snapabug.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
//SnapABug.addButton('4816ca29-9130-4045-b08e-5a4de935ff4b',"0","55%");
SnapABug.setDomain('.compete.com');
SnapABug.init('4816ca29-9130-4045-b08e-5a4de935ff4b');
</script>
<script type="text/javascript">
// NOTE: If 4q ever supports cross subdomain cookies we can get rid of all this!
$j(document).ready(function() {
$j('#ipeL, #invL').live('click', function() {
$j.cookie('suppress_4q', 1, {expires: 30, path: '/', domain: '.compete.com'});
});
if(!$j.cookie('suppress_4q')) {
var protocol = ("https:" == document.location.protocol ? "https://" : "http://");
var fourq = document.createElement('script');
fourq.setAttribute('type', 'text/javascript');
fourq.setAttribute('src', protocol + '4qinvite.4q.iperceptions.com/1.aspx?sdfc=1b3a8f93-36734-fb3cb395-5ed4-429c-aab9-babc0a0a2015&lID=1&loc=4Q-WEB2');
fourq.setAttribute('defer', 'defer');
document.getElementsByTagName('head')[0].appendChild(fourq);
}
});
</script>
<!-- endblock tags -->
<!-- block ie6_warning -->
<div id="ie6-warning" class="hidden">
<div class="left">
<p>Our site may not run like it should in Internet Explorer 6. For a better experience, please upgrade your browser:</p>
<ul>
<li class="firefox"><a href="http://www.firefox.com"><span>firefox</span></a></li>
<li class="ie"><a href="http://www.microsoft.com/windows/internet-explorer/default.aspx"><span>ie 8</span></a></li>
<li class="chrome"><a href="http://www.google.com/chrome"><span>chrome</span></a></li>
</ul>
</div>
<a href="#" onclick="hide_ie6_warning();" id="close-button">Don't show me this message again</a></div>
<!-- endblock ie6_warning -->
</body>
</html>
dsalazar, It looks like
dsalazar,
It looks like they're sniffing your user agent on the server before delivering the content. They apparently don't like screen-scraper's default user agent:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)
You'll need to manually set your user agent to something more acceptable. Try the following as an example (this requires that you use either the Professional or Enterprise Edition of screen-scraper).
scrapeableFile.setUserAgent("Opera/9.80 (Windows NT 6.1; U; en) Presto/2.9.168 Version/11.51");
-Scott
Thanks
Thanks for your help Scott. I just upgraded to Pro but dont know how to set my user agent manually.
scrapeableFile.setUserAgent("
See the docs here.