Mapping product numbers

Hi

I have just been looking at screen-scraper for a few days and am planning to use it for scraping e-commerce sites. I have a bit experience with Java, and after trying the tutorials I think the programming of the scraper won't be a problem for me.

As one of the posts in the blog says, this is just one part of it. Another is to find the right pages, and get the data mapped together. I think these two problems will take the most of my time. I thought of one way to do it.

I thought it might be possible to start off on a homepage that already did a lot of this work. www.pricerunner.co.uk. For instance if I wrote a scraper that was able to go to a site like this:

http://www.pricerunner.co.uk/pl/2-1082639/TVs/Samsung-LE-32A457-Compare-...

Extract all the URL's, categorize them, and save them to a file. This way I will have a file for every site i need to scrape and i have done the mapping.

Will this be an ok way to do it, or is there an easier way to do it?

Hans

We had a guy write a comment

We had a guy write a comment on here on Friday, but I guess it didn't get saved :P Sorry for the delay.

I think you're on the right track-- Really the question becomes "how do I want to store my data?"

Files work, particularly if you would just like a simple spreadsheet (or CSV, in simpler application).

A (little) more complicated solution would be a small database, but only do that if you need/want it in a database.

The spreadsheet is a pretty flexible, portable solution to most tasks like this. If you can navigate to the "details page" (as we often call it) on pricerunner.co.uk, then it's really only a matter of writing to a file whose name is made unique by the product's details... model name/number, etc.

You could store your data in lots of individual files, or you could try to map out some folders as you go, to imitate the structure of the site categories. I would say that the only disadvantage to using files is that they're not as 'searchable' for a computer. If you want to simply *have* the data, browse-able by someone at a computer, then files is a good way to go.

If you want the data for some sort of price-comparison website, it might be better to keep the data in a database, because a database will specialize in high-speed reads/writes, and is really dynamic in terms of generating web pages with the data in the database.

Do you have any particular end-point in mind? your own website? reference data for a company? spreadsheet reports?

Tim

Mapping of data

The data will be used internal in a company.

I think we will have a look at building a database.

I do not want to scrape the data from Pricerunner, but from the detailed pages Pricerunner links to. I need to get the complete up to date prices. The Pricerunner detailed pages, will only serve as a source for matching the products from the companies detailed pages.

This gives me to problems.

1. I need to figure out which companies who have a certain product.

2. I need to match all the products.

If i do scrape the urls from Pricerunner, and saves them to a database, where i store the URL, company and product, then it will be possible for me to get the scraper to visit that page every day to check for the price and delivery time and save these data to the database right?

Then my main concern will be how to save the URL to the database in the first place.

Hans

Sorry for the delayed

Sorry for the delayed response. Work has been crazy recently.

Yes, you can use screen-scraper to regularly (each week/month/etc) iterate all the products pricerunner lists, and save each's list of URLs to a simple database table. Then, you can have separate projects that actually go out and visit each of those URLs to get the real data you're after.

We don't build database communications into screen-scraper (yet-- I'm working on something that could be promising, and dirt easy to work with), so you'll have to pick a Java connector to the database type that you would like to use. The popular choice is MySQL, with its accompanying JDBC driver. You can then import the needed classes in your screen-scraper scripts, and can open a connection to a local or remote database and send the data to it with normal queries.

--So you could have a simple database ('pricerunner'), with one table per pricerunner category (however you want to divide that up is up to you).
--Then each table could have an entry per item in pricerunner's category list ('electronics', 'bedroom', 'gadgets', etc, etc). Each said entry would have an ID number (ideally an auto-incrementing integer primary key)
--Finally, a single table ('urls') can have probably only 2 columns in it: ID, and URL. The ID would the ID from the previous point, and URL would be one of the URLs listed on pricerunner. There would be multiple entries in this 'urls' table with the same ID (yet each with a different URL) since each item on pricerunner has multiple links to other sites.

The scrape responsible for the above could make its routine rounds, clearing each table as it updates it, so that your URLs are all current each time you run the update to gather the URLs.

And again, from there, you could have other projects which tap into this url database, which can go out and visit each URL and scrape details.

One last thing to consider is how to know which sites have which products, as you've said. You might want to add a column to that last 'urls' table (talked about in the last point in the above list), called "site" or "domain" or something. That way, you have a table which links back to the pricerunner ID that you're generating, along with the URLS to which it points, and then the base site's name. From there it would be possible to infer and generate a list of products for each site, as reported by pricerunner.