Thursday, 4 September 2014

Data Scraping from PDF and Excel

I am doing a little data scraping, There are 3 types of file from which i am scraping data.

1- HTML
2- PDF
3- Excel(xls)

For HTML i am comfortable, i am using HTML Agility for that.

For PDF and excel i need suggestions from anyone.



Concerning Excel. If you are in a MS environment you can either do Office Automation or use OLEDB. In a Java environment look at Apache POI.

EDIT: Concerning PDF in Java try Apache PDFBox . Can also work in .NET using IKVM

I can recommend Cogniview's PDF2XL, a reasonably inexpensive commercial product, to extract data from tables in PDF files into Excel. We have used it with great success.

HTML Agility is a library. Its good to use. But then, why do you need separate tools for different data extraction purposes? Use Automation Anywhere to extract data from any source. As far as I know, it would work for all the three sources you have specified. Google it.

Source: http://stackoverflow.com/questions/3147803/data-scraping-from-pdf-and-excel

No comments:

Post a Comment