Monday 11 July 2016

Extract Data from Multiple Web Pages into Excel using import.io

In this tutorial, i will show you how to extract data from multiple web pages of a website or blog and save the extracted data into Excel spreadsheet for further processing.There are various methods and tools to do that but I found them complicated and I prefer to use import.io to accomplish the task.Import.io doesn’t require you to have programming skills.The platform is quite powerful,user-friendly with a lot of support online and above all FREE to use.

You can use the online version of their data extraction software or a desktop application.The online version will be covered in this tutorial.

Let us get started.

Step 1:Find a web page you want to extract data from.
You can extract data such as prices, images, authors’ names, addresses,dates etc

Step 2:Enter the URL for that web page into the text box here and click “Extract data”.

Then click  “Extract data” Import.io will transform the web page into data in seconds.Data such as authors,images,posts published dates and posts title will be pulled from the web page as shown in the image below.

Import.io extracted only 40 posts or articles from the first page of the blog!.
If you visit bongo5.com you will notice that the web page is having a total of 600+ pages at the time of writing this article and each page has 40 posts or articles on it as can be shown by the image below.
Next step will show you how to extract data from multiple pages of the web page into excel.

Step 3:Extract Data from Multiple Web Pages into Excel

Using the import.io online tool you can extract data from 20 web pages maximum.Go to the bottom right corner of the import.io online tool page and click “Download CSV” to save the extracted data from those 20 pages into Excel.
Note:Using the import.io desktop application you can extract an unlimited number of pages and pin point only the data you want to extract.Check out this tutorial on how to use the desktop application.
Once you click “Download CSV” the following pop up window will appear.You can specify the number of pages you want to get data from up to a maximum of 20 pages then click “Go!”
You will need to Sign up for a free account to download that data as a CSV, or save it as an API.If you save it as an API you can go back to the API later to extract new data if the web page is updated without the need to repeat the steps we have done so far.Also, you can use the API for integration into other platforms.
Below image shows 20 rows out of 800 rows of data extracted from the 20 pages of the web page.

Conclusion

The online tool doesn’t offer much flexibility than the desktop application.For example, you can not extract more than 20 pages and you can not pin point the type of data you want to extract.For a more advanced tutorial on how to use the desktop application, you can check out this tutorial I created earlier.

Source URL : http://nocodewebscraping.com/extract-multiple-web-pages-data-into-excel/

Sunday 10 July 2016

4 Web Scraping Tools To Save You Time On Data Extraction

Either you are working on a product website, struggling to add live data feed to your app or merely need to pull out a huge amount of online data for analysis, an accurate web scraping tool can save you loads of time and keep you sane. Here are four powerful web scraping tools to save you from copy-pasting or spending time on writing your own scripts.

Uipath  specializes in developing various process automation software including web scraping and screen scraping software for desktop and web. Uipath web scraper is perfect for non-coders and easily surpasses most common data extraction challenges including page navigation, digging through flash and even scraping PDF files. All you need to do is open the web scraping wizard and simply highlight the data you need to extract. The tool will scrape all the data following this pattern at all pages you’ve chosen and sort it accordingly. You can add as many items for scraping as you like and have them sorted in respective columns. As a result, you receive a neat Excel or CSV document with all the data eliminated from duplicates.

Moreover, Uipath isn’t just about scraping. This software can be used not only for extracting data, but to manipulate the interface of another app, thus establishing data transfers among the two of them. Basically, this tool could be used to conduct any repetitive task a human could do, yet much faster and with higher accuracy.

Pros: You can automate form filling, clicking buttons, navigation etc. Uipath scraper is impressively accurate, fast and simple to use. It “reads” all types of data on screen (JS, HTML, Silverlight and more), plus you can train the software to emulate human actions of various complexity.

Cons: Premium software runs at a premium price. Uipath is an affordable professional solution, but may be a bit too pricey for personal use.

 Import.io  offers you a free desktop app to help you scrap all the data you need from an unlimited amount of web pages. The service treats each page as a potential data source to generate API from. If the page you’ve submitted has been previously processed, you can access its API and get some of the data. In other case, Import.io will guide you through the process of creating the scraping matrix by building connectors (for navigation) or extractors (to pull out the needed data). Afterwards, you submit a request for extraction and it’s typically processed within 24 hours. All the data is private and you can schedule auto refreshments at any chosen period of time.

Pros: The service is easy-to-use with no tech skills needed. It can  pages with data (those that needed login/pass), plus it’s free. Minimalistic effective design and simple navigation comes along.

Cons: Improt.io has hard times navigating through combinations of javascript/POST and cannot navigate from one page to another (e.g. click next, second page etc).  Sometimes, it takes over 24 hours to receive the report.  Besides, it’s a browser-only app, non-compatible with other applications.

Kimono is a popular web scraper among app developers who prefer to power up their products with live data and no additional code. It saves you tons of time when you need to fill up your app with mashing data. Install Kimono Browser bookmarklet; highlight page elements you need to and provide some positive/negative examples to train the tool. After labeling all the data you can download it in CSV/JSON/a web endpoint format. The APIs created for your pages are stored in the cloud and you can run them on schedule. So far, Kimono is free to use with pro and enterprise solutions to be launched soon.

Pros: The tool works pretty fast and works great with scraping newsfeeds and prices. The data is rather accurate.

Cons: No page navigation available and you need to spend quite a lot of time to train Kimono before it starts to pull out the multi items data accurate enough. In general, I’d say Kimono is more of an app mash-ups creator than a full-scale web scraper.

 Screen Scraper  is pretty neat and tackles a lot of difficult tasks including navigation and precise data extractions, however it requires a bit of programming/tokenization skills if you’d like to run it super smooth. Launch the software, add a proxy, start recording the list of your actions and creating extracting patterns (some coding required). Works great with HTML and Javascript, however you should test it with Citrix and other platforms. Basically, screen scraper helps you writing simple web scraping scripts and lets you download the extracted data in txt/csv/excel format.

Pros: When set correctly, there’s no data extraction tasks Screen scraper fails to handle.
Cons: The tool is pricey and you’ll have to go through documentation and have basic coding skills to use it.

Source URL :  http://tech.co/4-web-scraping-tools-save-time-data-extraction-2015-03

Thursday 7 July 2016

Scraping the Royal Society membership list

To a data scientist any data is fair game, from my interest in the history of science I came across the membership records of the Royal Society from 1660 to 2007 which are available as a single PDF file. I’ve scraped the membership list before: the first time around I wrote a C# application which parsed a plain text file which I had made from the original PDF using an online converting service, looking back at the code it is fiendishly complicated and cluttered by boilerplate code required to build a GUI. ScraperWiki includes a pdftoxml function so I thought I’d see if this would make the process of parsing easier, and compare the ScraperWiki experience more widely with my earlier scraper.

The membership list is laid out quite simply, as shown in the image below, each member (or Fellow) record spans two lines with the member name in the left most column on the first line and information on their birth date and the day they died, the class of their Fellowship and their election date on the second line.

Later in the document we find that information on the Presidents of the Royal Society is found on the same line as the Fellow name and that Royal Patrons are formatted a little differently. There are also alias records where the second line points to the primary record for the name on the first line.

pdftoxml converts a PDF into an xml file, wherein each piece of text is located on the page using spatial coordinates, an individual line looks like this:

<text top="243" left="135" width="221" height="14" font="2">Abbot, Charles, 1st Baron Colchester </text>

This makes parsing columnar data straightforward you simply need to select elements with particular values of the “left” attribute. It turns out that the columns are not in exactly the same positions throughout the whole document, which appears to have been constructed by tacking together the membership list A-J with that of K-Z, but this can easily be resolved by accepting a small range of positions for each column.

Attempting to automatically parse all 395 pages of the document reveals some transcription errors: one Fellow was apparently elected on 16th March 197 – a bit of Googling reveals that the real date is 16th March 1978. Another fellow is classed as a “Felllow”, and whilst most of the dates of birth and death are separated by a dash some are separated by an en dash which as far as the code is concerned is something completely different and so on. In my earlier iteration I missed some of these quirks or fixed them by editing the converted text file. These variations suggest that the source document was typed manually rather than being output from a pre-existing database. Since I couldn’t edit the source document I was obliged to code around these quirks.

ScraperWiki helpfully makes putting data into a SQLite database the simplest option for a scraper. My handling of dates in this version of the scraper is a little unsatisfactory: presidential terms are described in terms of a start and end year but are rendered 1st January of those years in the database. Furthermore, in historical documents dates may not be known accurately so someone may have a birth date described as “circa 1782? or “c 1782?, even more vaguely they may be described as having “flourished 1663-1778? or “fl. 1663-1778?. Python’s default datetime module does not capture this subtlety and if it did the database used to store dates would need to support it too to be useful – I’ve addressed this by storing the original life span data as text so that it can be analysed should the need arise. Storing dates as proper dates in the database, rather than text strings means we can query the database using date based queries.

ScraperWiki provides an API to my dataset so that I can query it using SQL, and since it is public anyone else can do this too. So, for example, it’s easy to write queries that tell you the the database contains 8019 Fellows, 56 Presidents, 387 born before 1700, 3657 with no birth date, 2360 with no death date, 204 “flourished”, 450 have birth dates “circa” some year.

I can count the number of classes of fellows:

select distinct class,count(*) from `RoyalSocietyFellows` group by class

Make a table of all of the Presidents of the Royal Society

select * from `RoyalSocietyFellows` where StartPresident not null order by StartPresident desc

…and so on. These illustrations just use the ScraperWiki htmltable export option to display the data as a table but equally I could use similar queries to pull data into a visualisation.

Comparing this to my earlier experience, the benefits of using ScraperWiki are:

•    Nice traceable code to provide a provenance for the dataset;

•    Access to the pdftoxml library;

•    Strong encouragement to “do the right thing” and put the data into a database;

•    Publication of the data;

•    A simple API giving access to the data for reuse by all.

My next target for ScraperWiki may well be the membership lists for the French Academie des Sciences, a task which proved too complex for a simple plain text scraper…

Sources URL :                             http://yellowpagesdatascraping.blogspot.in/2015/06/scraping-royal-society-membership-list.html