Can't find the content from the scrape

Hi - I have found a type of site I haven't come across before and I can't make SS display the same content as the browser. I have tried a proxy, but this seems to fail, I have used Chrome to inspect the site, but still could not see the path I need to take to see the product listings in SS

https://www.truckstore.co.za/search.html#/currency=ZAR

I have spent a long time and I'm very confused!

Thanks for any suggestions in advance

Jason (the one who knows very little about Screen Scraper, not the other one who does)

jas777 on 09/06/2019 at 9:13 am

screen-scraper support for licensed users

That site is populating the

That site is populating the results with JavaScript and JSON. You can read more about that here: http://www.screen-scraper.com/blog/2015/10/28/dynamic-content/

For this I used Firefox private mode, and used the Web Developer > Network tool to see the requests. The results are in one like: https://proxyprodza.tso-aws.com/tsoApp/widget/truck/search/en_ZA

You will need to make a scrapeableFile that makes that request with the request entity. Also pay close attention to the HTTP Headers.

jason on 09/06/2019 at 2:37 pm

I'm a little slow...

I'm afraid I don't really understand - it doesn't help that I can't get a proxy to work through Chrome (the page hangs).

Could you please let me have a more detailed description to give me a chance of getting this to work... I'd be really grateful.

I cannot get the proxy to work at all - the page will not display and so I cannot see the transaction and the header request?

Thanks

jas777 on 09/12/2019 at 9:27 am

A lot of HTTPS sites

A lot of HTTPS sites (rightfully) think that screen-scraper is a man in the middle. In those cases I sometimes cannot use the screen-scraper proxy. In those cases I open a Chrome Incognito window, and then go to Developer Tools (Ctrl + Shift + i), and the the "network tab". Then when you request the page you can see the request and subsequent requests. In this case, the data you want is in one of the subsequent ones. On the response you can see the data, but it's not formatted like you see on the page. It's a data object, and you can request that object with a scrapeable file.

jason on 09/12/2019 at 9:56 am

On this one, the request that

On this one, the request that has the data is at

https://proxyprodza.tso-aws.com/tsoApp/widget/truck/search/en_ZA

You will need to find it to set the headers, and in the scrapeableFile, advanced, set the Tidy to JSON will make it easier to read.

I attached a quick example.

jason on 09/12/2019 at 10:04 am

Superb!

It was the headers I was struggling with - works perfectly now - the only issue is that it returns the first 25 records and I can't see how to get it any further. it looks like I have to post a pageNumber in the Results Page, starting with Zero then incrementing to a total of 9 pages (an alternative might be to change the number delivered from 25 to 250 - although I don't know if this is feasible/realistic?)

Your help is invaluable!

Thanks

Jason

jas777 on 09/18/2019 at 6:39 am

Yeah, it looks like the

Yeah, it looks like the request entity indicates the page. The get the total pages, I would parse out the value "total" and use something like:

if (session.getv("PAGE")==1)
{
Integer total = dataRecord.get("TOTAL");
Integer perPage = 25;
Integer pages = total/perPage;
if (total%perPage>0)
pages++;

for (int i=2; i<=pages && !session.shouldStopScraping(); i++)
{
log.log(">Scraping page " + i + " of " + pages);
session.scrapeFile("Your file");
}

}

jason on 09/19/2019 at 10:15 am

Seems logical

That assumes however that I know how to set a request entity - I have been through the forum and I am struggling to work it out. I have set the scrape up so It can scrape all the data on the details page, I can scrape the fields you outlines, (TOTAL and PAGE) but I can't work out how to put them at the end of the request here:

POST /tsoApp/widget/truck/search/en_ZA HTTP/1.1
User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Access-Control-Request-Method: POST
Content-Type: application/json
Content-Length: 79
Access-Control-Request-Headers: content-type
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
DNT: 1
Host: proxyprodza.tso-aws.com
Connection: keep-alive
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip
Origin: https://www.truckstore.co.za

{"currency":"ZAR","uuid":"699936cb-9a56-4097-8fbf-3c966a826aa9","country":"ZA"}<code>

Many thanks for your help - I will try to upload my scraping session (please don't laugh)

jas777 on 09/20/2019 at 4:56 am

I have got a little further...

I can now get to here:
Results: Requesting URL: https://proxyprodza.tso-aws.com/tsoApp/widget/truck/search/en_ZA
Results: POST data: pageNumber=1{"currency":"ZAR","uuid":"699936cb-9a56-4097-8fbf-3c966a826aa9","country":"ZA"}

but the pageNumber is not inside the {} and does not have the "" etc format..

I have used(running before the file is scraped):

// Sets the type of the POST entity to XML.
scrapeableFile.setContentType( "application/json" );
scrapeableFile.addPOSTHTTPParameter( "pageNumber", session.getv( "PAGE" ), 2 );

I'm trying, but it is slow...

Thanks

jas777 on 09/20/2019 at 6:20 am

The request entity for page 2

The request entity for page 2 forward is a lot bigger, but the same basic idea. Are you able to see the request entity in the browser's network tools?

jason on 09/25/2019 at 9:44 am

I can see the request - just can't seem to replicate it?

In the example below, I can see that I need to have "pageNumber":"2" in the { } brackets - I just can't seem to manage to put it there - it always ends up outside the brakets with an = sign?

Results: Requesting URL: https://proxyprodza.tso-aws.com/tsoApp/widget/truck/search/en_ZA
Results: POST data: pageNumber=0{"currency":"ZAR","uuid":"699936cb-9a56-4097-8fbf-3c966a826aa9","country":"ZA"}

jas777 on 09/26/2019 at 5:01 am

Generally I wouldn't go so

Generally I wouldn't go so far in a support request, but I updated the attached scrape so the parsing script does pagination. When I watched the website making requests for the next page, the request entity was pretty big, but some experimentation found that I didn't need all of it.

jason on 09/27/2019 at 9:56 am

Search

Community

screen-scraper

User login

Can't find the content from the scrape

That site is populating the

I'm a little slow...

A lot of HTTPS sites

On this one, the request that

Superb!

Yeah, it looks like the

Seems logical

I have got a little further...

The request entity for page 2

I can see the request - just can't seem to replicate it?

Generally I wouldn't go so