Friday, 22 May 2015

Web scraping using Python without using large frameworks like Scrapy

scrapy-big-logoIf you need publicly available data from scraping the Internet, before creating a webscraper, it is best to check if this data is already available from public data sources or APIs. Check the site’s FAQ section or Google for their API endpoints and public data.

Even if their API endpoints are available you have to create some parser for fetching and structuring the data according to your needs.

Scrapy is a well established framework for scraping, but it is also a very heavy framework. For smaller jobs, it may be overkill and for extremely large jobs it is very slow.

So if you would like to roll up your sleeves and build your own scraper, continue reading.

Here are some basic steps performed by most webspiders:

1) Start with a URL and use a HTTP GET or PUT request to access the URL
2) Fetch all the contents in it and parse the data
3) Store the data in any database or put it into any data warehouse
4) Enqueue all the URLs in a page
5) Use the URLs in queue and repeat from process 1
Here are the 3 major modules in every web crawler:
1) Request/Response handler.
2) Data parsing/data cleansing/data munging process.
3) Data serialization/data pipelines.

Lets look at each of these modules and see what they do and how to use them.

Request/Response handler

Request/response handlers are managers who make http requests to a url or a group of urls, and fetch the response objects as html contents and pass this data to the next module. If you use Python for performing request/response url-opening process libraries such as the following are most commonly used

1) urllib(20.5. urllib – Open arbitrary resources by URL – Python v2.7.8 documentation) -Basic python library yet high-level interface for fetching data across the World Wide Web.

2) urllib2(20.6. urllib2 – extensible library for opening URLs – Python v2.7.8 documentation) – extensible library of urllib, which would handle basic http requests, digest authentication, redirections, cookies and more.

3) requests(Requests: HTTP for Humans) – Much advanced request library

which is built on top of basic request handling libraries.

Data parsing/data cleansing/data munging process

This is the module where the fetched data is processed and cleaned. Unstructured data is transformed into structured during this processing. Usually  a set of Regular Expressions (regexes) which perform pattern matching and text processing tasks on the html data are used for this processing.

In addition to regexes, basic string manipulation and search methods are also used to perform this cleaning and transformation. You must have a thorough knowledge of regular expressions and so that you could design the regex patterns.

Data serialization/data pipelines

Once you get the cleaned data from the parsing and cleaning module, the data serialization module will be used to serialize the data according to the data models that you require. This is the final module that will output data in a standard format that can be stored in databases, JSON/CSV files or passed to any data warehouses for storage. These tasks are usually performed by libraries listed below

1) pickle (pickle – Python object serialization) –  This module implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure

2) JSON (JSON encoder and decoder)

3) CSV (https://docs.python.org/2/library/csv.html)

4) Basic database interface libraries like pymongo (Tutorial – PyMongo),mysqldb ( on python.org), sqlite3(sqlite3 – DB-API interface for SQLite databases)

And many more such libraries based on the format and database/data storage.

Basic spider rules

The rules to follow while building a spider are to be nice to the sites you are scraping and follow the rules in the site’s spider policies outlined in the site’s robots.txt.

Limit the  number of requests in a second and build enough delays in the spiders so that  you don’t adversely affect the site.

It just makes sense to be nice.

We will cover more techniques in future articles

Source: http://learn.scrapehero.com/webscraping-using-python-without-using-large-frameworks-like-scrapy/

No comments:

Post a Comment