How to extract data from PDF and Excel files?

I would like to extract data that a web site makes it available in Excel spreadsheet form. Is there a way that I can get that data using screen-scraper? If it's not built into the product, is there an indirect way to get it, such as a program that converts the speadsheet into an XML or HTML file that SS can assign to a variable that it makes available to a .NET program?

How about this -- can you get SS to copy an Excel spreadsheet from a website into a SS variable that gets passed to a .NET program? Then the .NET program would somehow have to invoke a utility program to convert the Excel spread sheet into another format that can be processed by the .NET program.

A related question: I saw a post about getting PDF data using a conversion program. Is that still the best way to go for PDF, or has something been built into the product?

Do you have any samples for extracting data from either Excel or a PDF? Have you created a service to do this? Can you share the code?

Thank you for the responses!

Parsing Excel spreadsheets & PDFs

Gary,

Unfortunately, we do not have features built in to screen-scraper to parse either types of files.

For Excel spreadsheets it's likely that the .xls(x) file is actually XML. To find out, just drop the file into your favorite text editor.

If the files are not XML underneath then you would need to use a third-party utility to convert them to XML or HTML. It looks like there are a few utilities out there. To avoid needing human intervention it would be best if the program had a CLI available so screen-scraper or some other app could invoke it.

Once in XML format there are a few options you have. You can find examples of parsing CSV files in our Scripts Repository. However, because XML is not necessarily flat data, another option is to use a Java utility such as SAXReader (pm me for a sample script) or the Microsoft equivalent, XmlReader.

If your PDF document happens to be just an image that was generated from scanning a document and does not contain any selectable text you would first need to attempt to read the document using OCR. Fortunately/unfortunately we have not yet had to attempt this internally. From what I hear you're doing quite well if you can get 80% accuracy from consumer-grade OCR software.

However, provided the PDF does contain selectable text, you will need to convert it. There are different choice of utilities out there. One that we've used internally is called pdftohtml. An alternative would be to make a scrape that passes your PDF off to Adobe's Online PDF to HTML converter.

Depending on how complex your PDF is, parsing the resulting HTML/XML can be a challenge--actually two challenges. The first challenge is to accurately parse the file to isolate the data you're after. The next challenge is to have your parsing be generic enough to parse many slightly-varied documents reliably.

Under some circumstance it might make sense to have screen-scraper read the local file and make use of extractor patterns to do the extraction. Other times you're better off writing a very complex script to do the parsing. In either case, plan on investing a fair amount of time.

If you come across any utilities that will handle the parsing, please share.

-Scott

Can SS copy a PDF or Excel spreadsheet into a variable?

OK, so SS doesn't have native ability to decode PDFs or Excel spreadsheets. I think I should still be able to get that data if SS can at least pass a PDF or Excel spread sheet to my .NET program inside a variable. For example, let's say we have a website that is basically a PDF file -- it's URL ends with .PDF. So the web page IS a PDF. Can I set up screen-scraper to just send me the whole web page, which is essentially either a PDF or an Excel spreadsheet, inside a variable? Then maybe the program could invoke a third party utility .NET class library that will read the data and allow my program to write that data to a database.

The reason I am pursuing this is that for my company, most of the web sites that contain data that we need to extract, expose that data as PDFs or Excel spreadsheets.

perhaps this could do what you're asking?

Hi Gary,

I'm not sure I'm familiar with everything you're trying to do, but you could look at the following method.


scrapeableFile.getContentAsString()

found here:

here

Otherwise, you could download the .pdf using the following method to a specific place and then tell .net that it is there.

session.downloadFile( String url, String fileName )

here

or

scrapeableFile.saveFileOnRequest( String pathToSaveTo )

here

How to use the session and scrapeableFile classes?

A very basic question about the classes you referred to: I'm trying to figure out where the session and scrapeableFile classes exist, how you can make use of them, and how to link them to an external .NET 3.5 program.

Do the scrapeableFile and session classes exist within Screen-scraper? How do you access these methods? I can't get to them from .NET even though I imported the Screenscraper namespace. Are these classes exposed in Interpreted Java and JavaScript scripts that you write within the Screen-Scraper application? Are they exposed by virtue of the fact that you select which type of script you have? Do you make this a workable application by running these scripts within the screen-scraper program, then assigning values to variables in the scripts, and then a .NET program can access those variables by using the Screenscraper class library it references?

Gary, You're right. The

Gary,

You're right. The parsing of PDFs & Excel spreadsheets is not integrated into screen-scraper. Unless the Excel document is actually XML, like I mentioned before. In such a case you could request the document as a scrapeable file and parse it using extractor patterns or by processing it using the SAXReader inside a script.

Otherwise, yes, you would need to download the file using any of the three methods scraper previously mentioned and then have your 3rd party app access the file once downloaded.

-Scott

Extract a PDF or Excel file to a .NET program

I'm ready to try this out. I'm using this web page a test: http://www.spmloans.com/ratesheets/012110/53414352.pdf

I'm a little fuzzy about how this works. It is a web page. But I see the .pdf ending, so it appears to also be a pdf file. So I'm wondering what the source looks like. I used IE8 to do a "Page / View source" and it did not display anything. When I put the URL into the browser it needed Adobe Reader to display it. So is this web page a pure PDF or some sort of mix of HTML and PDF? Is a browser smart enough to detect that is has been given a pure PDF or Excel file without any HTML and then it will call up the proper program such as Adobe Reader to display it?

Here is what I want to do: I want to extract the pdf from the web pgae into a screen-scraper session variable. The .NET program will then accept the variable (as a DataSet I think). The duty of this particular .NER progra is merely to write the pdf file into a SQL Server 2008 database in varbinary format. Another program will deal with it later, and is not part of my current task. So I only need to get the pdf and write it into the database. Do I even need an extractor pattern to get the PDF or Excel file? Or would I just tell the scrapeable file to extract the ENTIRE web page. Would the returned data be a valid PDF or Excel file?

How would you extract this pdf and pass it to the .NET program?

Going a bit further... I assume that some of the web sites that I'll be scraping will have PDF files and Excel spreadsheets embedded into a web page, where you click on the page to bring it up. Would I set up screen-scraper to go to the web page that displays the PDF file or Excel spreadsheet and get it from that page? (I assume I would.)

Gary, Yes, that is a PDF

Gary,

Yes, that is a PDF document (no HTML).
Yes, when asked to handle a PDF or XML (depending on your browser settings) your browser may load the PDF reader and render the PDF inside the browser. Something similar may happen for Excel document when using IE. Otherwise, your browser will prompt you to download the file.

It sounds like your end goal is to store either the PDF or Excel document as binary data in your database. To do this you have two choices.

1. Have screen-scraper download the document to the local file system and pass to your .NET app the full directory path to the document now stored on the local file system. You would then have your .NET program insert the document into the database and optionally delete the local file.

2. Have screen-scraper pass the URL to the document to your .NET program. Your .NET program would then download the file and insert it into the database.

In either case all that you would be passing to your .NET app would be a simple string (no binary data).

If we were to handle this situation internally we would not store binary data in the database. Instead we would leave the downloaded file on the file system and simply reference its location in the database. If you're going to store the PDF as binary data in the database then this tread might be helpful.

-Scott

.NET 3.5 code to download PDF and Excel files

I was able to write .NET 3.5 code to download PDF and Excel files from the Internet. .NET 3.5 provides a few classes that enable Internet functions. I like the idea about saving the URLs of the pages that contain PDF and Excel files, but the customer also wants the files copied to a SQL Server 2008 database table.

At this point, my strategy is to use .NET code to check if a URL exists and also to download PDF and Excel files. I'll use Screen-scraper to scan HTML files for date and time data that this project requires. I'll use it to extract the names of URLs from web pagges in cases where those URLs are subject to change. And I'll use it to provide login and password information for secured sites. (I'm exploring that now.)

NOTE: *** Would you please also read and answer another question I posted above about how to use the session and scrapeableFile classes?

Here is the VB.NET 3.5 code to download PDF and Excel files from the Internet:

Private Sub Scrape_Sierra(ByVal URL_ToUse As String)

' Create a new WebClient instance.
Dim myWebClient As New WebClient()

' Download the target Web resource into a byte array.
Dim myDatabuffer() As Byte
Try
myDatabuffer = myWebClient.DownloadData(URL_ToUse)
Catch ex As WebException
Console.WriteLine((Date.Now.ToShortDateString + " " + Date.Now.ToShortTimeString + _
" Download of " + URL_ToUse + " failed: " + ex.Message))
Exit Sub
Catch ex As Exception
Exit Sub
End Try

' Create a file and write the byte data to a file.
Dim oFileStream As System.IO.FileStream
oFileStream = New System.IO.FileStream("C:\screen-scraper\SierraAtlantic.pdf", System.IO.FileMode.Create)
oFileStream.Write(myDatabuffer, 0, myDatabuffer.Length)
oFileStream.Close()

' Display the downloaded data.
'Dim download As String = Encoding.ASCII.GetString(myDatabuffer)
'Console.WriteLine(download)
Console.WriteLine((Date.Now.ToShortDateString + " " + Date.Now.ToShortTimeString + _
" Download of " + URL_ToUse + " was successful."))
End Sub

Gary, Is this the question

Gary,

Is this the question you're referring to?

"Can I set up screen-scraper to just send me the whole web page, which is essentially either a PDF or an Excel spreadsheet, inside a variable?"

If so, the answer is yes and no. What would be inside the variable would not be the "whole web page". Rather, just a reference to the page as a URL to the pdf or Excel file. You would retrieve the value of the variable in your .Net and use it here.

myDatabuffer = myWebClient.DownloadData(URL_ToUse)

Did I address the right question?

-Scott