Can't find the content from the scrape
Hi - I have found a type of site I haven't come across before and I can't make SS display the same content as the browser. I have tried a proxy, but this seems to fail, I have used Chrome to inspect the site, but still could not see the path I need to take to see the product listings in SS
https://www.truckstore.co.za/search.html#/currency=ZAR
I have spent a long time and I'm very confused!
Thanks for any suggestions in advance
Jason (the one who knows very little about Screen Scraper, not the other one who does)
That site is populating the
That site is populating the results with JavaScript and JSON. You can read more about that here: http://www.screen-scraper.com/blog/2015/10/28/dynamic-content/
For this I used Firefox private mode, and used the Web Developer > Network tool to see the requests. The results are in one like: https://proxyprodza.tso-aws.com/tsoApp/widget/truck/search/en_ZA
You will need to make a scrapeableFile that makes that request with the request entity. Also pay close attention to the HTTP Headers.
I'm a little slow...
I'm afraid I don't really understand - it doesn't help that I can't get a proxy to work through Chrome (the page hangs).
Could you please let me have a more detailed description to give me a chance of getting this to work... I'd be really grateful.
I cannot get the proxy to work at all - the page will not display and so I cannot see the transaction and the header request?
Thanks
J
A lot of HTTPS sites
A lot of HTTPS sites (rightfully) think that screen-scraper is a man in the middle. In those cases I sometimes cannot use the screen-scraper proxy. In those cases I open a Chrome Incognito window, and then go to Developer Tools (Ctrl + Shift + i), and the the "network tab". Then when you request the page you can see the request and subsequent requests. In this case, the data you want is in one of the subsequent ones. On the response you can see the data, but it's not formatted like you see on the page. It's a data object, and you can request that object with a scrapeable file.
On this one, the request that
On this one, the request that has the data is at
https://proxyprodza.tso-aws.com/tsoApp/widget/truck/search/en_ZA
You will need to find it to set the headers, and in the scrapeableFile, advanced, set the Tidy to JSON will make it easier to read.
I attached a quick example.
Superb!
It was the headers I was struggling with - works perfectly now - the only issue is that it returns the first 25 records and I can't see how to get it any further. it looks like I have to post a pageNumber in the Results Page, starting with Zero then incrementing to a total of 9 pages (an alternative might be to change the number delivered from 25 to 250 - although I don't know if this is feasible/realistic?)
Your help is invaluable!
Thanks
Jason
Yeah, it looks like the
Yeah, it looks like the request entity indicates the page. The get the total pages, I would parse out the value "total" and use something like:
{
Integer total = dataRecord.get("TOTAL");
Integer perPage = 25;
Integer pages = total/perPage;
if (total%perPage>0)
pages++;
for (int i=2; i<=pages && !session.shouldStopScraping(); i++)
{
log.log(">Scraping page " + i + " of " + pages);
session.scrapeFile("Your file");
}
}
Seems logical
That assumes however that I know how to set a request entity - I have been through the forum and I am struggling to work it out. I have set the scrape up so It can scrape all the data on the details page, I can scrape the fields you outlines, (TOTAL and PAGE) but I can't work out how to put them at the end of the request here:
User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Access-Control-Request-Method: POST
Content-Type: application/json
Content-Length: 79
Access-Control-Request-Headers: content-type
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
DNT: 1
Host: proxyprodza.tso-aws.com
Connection: keep-alive
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip
Origin: https://www.truckstore.co.za
{"currency":"ZAR","uuid":"699936cb-9a56-4097-8fbf-3c966a826aa9","country":"ZA"}<code>
Many thanks for your help - I will try to upload my scraping session (please don't laugh)
I have got a little further...
I can now get to here:
Results: Requesting URL: https://proxyprodza.tso-aws.com/tsoApp/widget/truck/search/en_ZA
Results: POST data: pageNumber=1{"currency":"ZAR","uuid":"699936cb-9a56-4097-8fbf-3c966a826aa9","country":"ZA"}
but the pageNumber is not inside the {} and does not have the "" etc format..
I have used(running before the file is scraped):
// Sets the type of the POST entity to XML.
scrapeableFile.setContentType( "application/json" );
scrapeableFile.addPOSTHTTPParameter( "pageNumber", session.getv( "PAGE" ), 2 );
I'm trying, but it is slow...
Thanks
J
The request entity for page 2
The request entity for page 2 forward is a lot bigger, but the same basic idea. Are you able to see the request entity in the browser's network tools?
I can see the request - just can't seem to replicate it?
In the example below, I can see that I need to have "pageNumber":"2" in the { } brackets - I just can't seem to manage to put it there - it always ends up outside the brakets with an = sign?
Results: Requesting URL: https://proxyprodza.tso-aws.com/tsoApp/widget/truck/search/en_ZA
Results: POST data: pageNumber=0{"currency":"ZAR","uuid":"699936cb-9a56-4097-8fbf-3c966a826aa9","country":"ZA"}
Generally I wouldn't go so
Generally I wouldn't go so far in a support request, but I updated the attached scrape so the parsing script does pagination. When I watched the website making requests for the next page, the request entity was pretty big, but some experimentation found that I didn't need all of it.