Downloading gzip XML file before scraping
Greetings Scrapers,
I am about to scrape an XML file which contains around 10 000 records, is gzipped and residing on a domain like http://ww.domain.com/xmlfile.gz
Is there a way to configure screen-scraper to download and unzip that file locally before initiating the scraping session?
Best,
In this case, I would make a
In this case, I would make a scraping session to download and unzip the file, and at the end of that scrape use RunnableScrapingSession to spawn the other.
Thanks for the answer
Thanks for the answer Jason.
I managed to get the customer to upload the XML file unzipped so the "unzipping problem" solved itself.
But, I ran into a new one when I try to run the scrape: As the file is pretty big screen scraper chokes itself by consuming all memory before the scrape initiates.
Any ideas on how to solve this?
Continuing from above: I
Continuing from above:
I managed to cure the choking by upping the memory and turning off the Tidy HTML. Then a new error showed: This time I get Time-out for the extractor patterns, no matter how I mess with the time-out settings in SS. The file I am trying to scrape is an XML file which contains around 3000 properties, each carrying around 15 data points. I try to extract each property by a~@DATARECORD@~ and sub-extractor patterns for the respective data points. But, I keep getting "Warning! The operation timed out while applying the DATARECORD extractor pattern, so it is being skipped.".
In my area its pretty common to have files in this manner and sometime more than ten times larger than this example.
Is there any alternative way to set up a scrape in order to get this fellow to play ball?
Are you using a scrapeable
Are you using a scrapeable file to read the XML? There is some overhead in that, and if instead you were to use a script with a buffered reader it would reduce that some.
Yes, I am using a scrapeable
Yes, I am using a scrapeable file, which I realize will not get me there.
Trying to create a script for the complete session instead. Tried digging out leads in the scripts repository but I did not find any that seemed applicable. Any nice hints residing in your back pocket Jason?