Matching newsgroup message headers and text
I would like to extract the Subject line and the message body text from the following segment (newsgroup message with headers embedded in an HTML page).
[quote]
Received: by 14.11.39.53 with SMTP id m23mr385720cwr; Wed, 27 Jul 2006 18:28:02 -0700 (PDT)Received: from 63.8.105.229 by o13h2000cwp.hgroups.com with HTTP; Thu, 28 Jul 2006 01:28:02 +0000 (UTC)From: "John Doe"To: [email protected]Subject: Re: Hello WorldDate: Thu, 28 Jul 2006 01:28:02 -0000Message-ID: <[email protected]>In-Reply-To: <[email protected]>References: <[email protected]>User-Agent: G2/0.2MIME-Version: 1.0Content-Type: text/plain; charset="iso-8859-1"Here goes the message body text, e.g... Hello World. blah blah. This part will obviously be quite random.
[/quote]
I have two problems:
1. Subject line is not always necessarily followed by "Date:". It can be "Message-ID:" in one pattern, "References:" in another, and so on. As it turns out, there is not a whitespace between the end of Subject line and the next line (whatever it might be). How do I pick only the Subject line and nothing further?
2. The only thing that separates headers from the message body is a single blank line. The headers do not end with a distinct character or pattern as the last header line could be anything and the information in the last header could be anything anyway (In this excerpt, the last header line is "Content-Type:" but again this is not always the case.). I am not sure how to pick an entirely variable text section that follows a blank line which itself cannot be reliably fixed by some other pattern preceding it.
Any and all help will be greatly appreciated!!
Matching newsgroup message headers and text
Hi,
We've actually haven't done a lot with this kind of thing, unfortunately. In nearly all cases the web sites we scrape have plenty of HTML to facilitate extraction. You can do as much as you'd like with replacing before the actual extraction occurs, but you're correct that it could get a bit tricky. I'm surprised that the execution time is increased so dramatically, though. These must be large sites you're scraping. I wish I had more to offer on this one, but, unfortunately, I don't. To speed things up it may help to simply scrape multiple sites in parallel. You could also distribute the scraping across multiple instances of screen-scraper or even multiple machines. Bear in mind, too, that the amount of memory you allocate to screen-scraper and the underlying hardware will also affect execution time.
Kind regards,
Todd
Matching newsgroup message headers and text
Thanks Todd. It deals with the problem but it has added a huge overhead on the execution time since it tries to replace unnecessarily all hard returns in the scraped file. Maybe it doubled or tripled the execution time. The particular section I am trying to extract data from (and need to use the hard returns for pattern matching) is only a small part of the scraped page. Is there a way to replace hard returns to #'s only in selected portions of the scraped file before applying the extractor? I guess that sounds like another extraction anyway and when you get into extraction it will have to strip hard returns before it ever gets to replacing them. Dog wagging its own tail!.. But maybe I could do some iterated scripts and extracts?? Do you have some experience with your clients on this issue?
Best regards,
KO
Matching newsgroup message headers and text
Hi,
You're correct that screen-scraper strips out some white space in the extraction process. Please see this FAQ for a bit more on that: [url]http://www.screen-scraper.com/support/faq/faq.php#WhiteSpace[/url].
Kind regards,
Todd Wilson
"Last Response" is not exactly what SS is trying t
I have been trying to match with regular expressions etc. with no avail. During this process, what I have found, to my great surprise, is that the "Last Response" output which users are urged to utilize (for constructing patterns) is not exactly what the Screen Scraper internally sees. For example, for the segment (see my previous message in this thread) which I directly copied from the "Last Response" if I set up an extract pattern like:
Subject: ~@SUBJECT_LINE@~
with the regular expression for SUBJECT_LINE as:
.*$
I get this:
Re: Hello WorldDate: Thu, 28 Jul 2006 01:28:02 -0000Message-ID: <[email protected]>In-Reply-To: <[email protected]>References: <[email protected]>User-Agent: G2/0.2MIME-Version: 1.0Content-Type: text/plain; charset="iso-8859-1"Here goes the message body text, e.g... Hello World. blah blah. This part will obviously be quite random.
In other words, when reading the source code of the scraped file, SS does not see any of the blank lines and carriage returns/line feeds (CRLF) that do show up in the "Last Response" output. What is exactly going on? I tried this both with and without tidy HTML option, so I know it has nothing to do with that tidying. Does SS swallow all blank lines and CRLF's when matching for patterns??