Importing a URL list to create scrapeable files
.
I have a list of URLs in an excel spreadsheet column that I need to scrape .
is there a way (perhaps a script) that I may use to import this list into screen-scraper and create the scrapeable files ? I have not found a solution to that in any of the previous posts in the present forum.
I have tried to do that by copying / pasting this list in
Somename (Scraping Session).xml,
(i.e. in a "template" of an exported Somename (Scraping Session).xml file - but screen-scraper does not allow to import it back.
although this would seem a basic function for screen-scraper to offer (i.e. scraping on the basis of an already existing list of URLs) and simple to provide it within the basic software or scripts, this is not so.
any assistance shall be much appreciated.
thanks,
heisje
.
Importing a URL list to create scrapeable files
.
it is a happy situation that screen-scraper can actually do this particular job in its present form, albeit clumsily. it will be a happier one when the proposed solution may be available, hopefully soon.
in the meantime, I feel this discussion will help many users understand how they may scrape web pages from a list of URLs already available in a spreadsheet.
many thanks, todd, for your contribution.
heisje
.
Importing a URL list to create scrapeable files
Hi,
We actually do quite a bit of what you're describing, only using more manual methods, like those I referred to previously. The difficulty is that sites vary so dramatically in how they present data and such. If what you're proposing is simply a way to interface with Excel, that could certainly be done, but that's the easy part ) The harder part is extracting the desired values from the page, saving them out, and potentially performing logic using the data. I'm unaware of any solution that automates this process much more than we do. That certainly doesn't mean that they don't exist, of course. For the time being, what you see is what you get, as far as our offering goes. I'll certainly give you suggestions a bit more thought, though. It's very possible we could be missing out on a big market simply because our tool doesn't automate enough of the process.
Kind regards,
Todd
Importing a URL list to create scrapeable files
todd, you said:
"Were you hoping that screen-scraper would somehow be able to interface directly with a Microsoft Excel spreadsheet, or something like that?"
and I say:
"E - X - A - C - T - L - Y "
that's the spirit.
what is a practical application for this? affiliate merchant datafeeds.
most of them leave a lot to be desired: no product image URLs , no long descriptions of products, etc - while all this data is available on their web sites. and they find it (it seems) very difficult to provide a proper feed - with all the readily available information.
however most merchants do provide a product URL in their spreadsheet.
a list of such URLs gives the scraping basis.
any chance of getting something on the lines you suggested?
best,
heisje
.
Importing a URL list to create scrapeable files
Just one other thought--if you're grabbing those URL's from a separate web page, another option would be to scrape each URL individually, and set them each as session variables. This would be a similar approach to using the text file, but would be more dynamic given that you'd be able to draw the URL's directly from the source, rather than having to copy and paste them into the text file.
Best,
Todd
Importing a URL list to create scrapeable files
Hi,
Thanks, this helps. I believe the method I've described is probably your best bet. I agree that it would be great to have a simpler method, but I may need to understand better what you're proposing. Were you hoping that screen-scraper would somehow be able to interface directly with a Microsoft Excel spreadsheet, or something like that? Perhaps that it would read the columns and prompt you as to which column contains the URL's you want to scrape?
Thanks,
Todd
Importing a URL list to create scrapeable files
.
thanks todd.
I did not want to bother you with specifics, but maybe some greater detail shall be more useful as an example for people having a similar request in the future.
so: here is a URL list, a real case example:
http://hamptoninn.hilton.com/en/hp/hotels/index.jhtml?ctyhocn=YWGCAHX
http://www.hilton.com/en/hi/hotels/index.jhtml?ctyhocn=YWGWIHF
http://www.hilton.com/en/hi/hotels/index.jhtml?ctyhocn=STJHITW
http://hiltongardeninn.hilton.com/en/gi/hotels/index.jhtml?ctyhocn=YHZHGGI
http://homewoodsuites.hilton.com/en/hw/hotels/index.jhtml?ctyhocn=ONTBUHW
http://hiltongardeninn.hilton.com/en/gi/hotels/index.jhtml?ctyhocn=YYZBUGI
http://hiltongardeninn.hilton.com/en/gi/hotels/index.jhtml?ctyhocn=YYZCMGI
http://www.hilton.com/en/hi/hotels/index.jhtml?ctyhocn=YXULOHF
I realize we meant exactly the same thing, when referring to tutorial 7 - i.e. creating a page with all the links and scraping out of this, according to the procedure in tutorial 7. I am glad there is a solution, even though I would hope for a more elegant and simple one - which, as I mentioned before, would be to read URLs directly from a spreadsheet column, or to read URLs imported manually in the script data from a spreadsheet column. because to do that with ex. 200,000 URLs (ex. HomeDepot) is messy, I believe.
but maybe there *is* some better solution available as the script stands now?
thanks,
heisje
.
Importing a URL list to create scrapeable files
Hi,
I might be a bit more helpful if I found out more detail on what you'd like to do. Are all of these URL's from the same site? That is, do they look something like this?
http://www.mysite.com/page1.htm
http://www.mysite.com/page2.htm
http://www.mysite.com/page3.htm
Or are you somehow wanting to scrape from different sites? Even better, could you provide a few of the URL's so that I could take a look at them?
In answer to your question, tutorial seven gives an example of reading in multiple search terms, but they could just as easily be URL's. It would be similar to your suggestion to read the values in from a spreadsheet column, only you would instead read them in from a text file. In such a case, the value for the URL of your scrapeable file might simply be ~#URL#~, where you store each URL you read in from the text file in a session variable called URL.
Kind regards,
Todd
Importing a URL list to create scrapeable files
.
many thanks Todd for replying to this post, much appreciated.
I hate to give the impression that I have not tried to give the solution by myself - and please be assured that this is not the case - I decided to ask for assistance in this forum only after many hours of trying to find the solution myself, or in previous forum-posts.
I had gone through all tutorials many times, including tutorial number 7.
the only hint surfacing (indirectly) from tutorial 7 is to create a page with all the required links, and scrape accordingly.
if this is what you imply, so far so good.
but I am talking about 40,000 links, 100,000 links - and so on.
if creating a page with 100,000 links is the solution, I would rather call for a more elegant one, like reading and importing from a spreadsheet column.
kindly confirm that what you implied as a solution, from tutorial 7, is creating a page with many thousands of links - or if you had something more elegant in mind which I could not discover.
many thanks in advance,
heisje
.
Importing a URL list to create scrapeable files
Hi,
I think your best bet would be to actually use the method we exemplify in our seventh tutorial: here. Would you mind looking through that to see if it will work for you?
Kind regards,
Todd Wilson