Friday 3 July 2015

ECJ clarifies Database Directive scope in screen scraping case

EC on the legal protection of databases (Database Directive) in a case concerning the extraction of data from a third party’s website by means of automated systems or software for commercial purposes (so called 'screen scraping').

Flight data extracted

The case, Ryanair Ltd vs. PR Aviation BV, C-30/14, is of interest to a range of companies such as price comparison websites. It stemmed from  Dutch company PR Aviation operation of a website where consumers can search through flight data of low-cost airlines  (including Ryanair), compare prices and, on payment of a commission, book a flight. The relevant flight data is extracted from third-parties’ websites by means of ‘screen scraping’ practices.

Ryanair claimed that PR Aviation’s activity:

• amounted to infringement of copyright (relating to the structure and architecture of the database) and of the so-called sui generis database right (i.e. the right granted to the ‘maker’ of the database where certain investments have been made to obtain, verify, or present the contents of a database) under the Netherlands law implementing the Database Directive;

• constituted breach of contract. In this respect, Ryanair claimed that a contract existed with PR Aviation for the use of its website. Access to the latter requires acceptance, by clicking a box, of the airline’s general terms and conditions which, amongst others, prohibit unauthorized ‘screen scraping’ practices for commercial purposes.

Ryanair asked Dutch courts to prohibit the infringement and order damages. In recent years the company has been engaged in several legal cases against web scrapers across Europe.

The Local Court, Utrecht, and the Court of Appeals of Amsterdam dismissed Ryanair’s claims on different grounds. The Court of Appeals, in particular, cited PR Aviation’s screen scraping of Ryanair’s website as amounting to a “normal use” of said website within the meaning of the lawful user exceptions under Sections 6 and 8 of the Database Directive, which cannot be derogated by contract (Section 15).

Ryanair appealed

Ryanair appealed the decision before the Netherlands Supreme Court (Hoge Raad der Nederlanden), which decided to refer the following question to the ECJ for a preliminary ruling: “Does the application of [Directive 96/9] also extend to online databases which are not protected by copyright on the basis of Chapter II of said directive or by a sui generis right on the basis of Chapter III, in the sense that the freedom to use such databases through the (whether or not analogous) application of Article[s] 6(1) and 8, in conjunction with Article 15 [of Directive 96/9] may not be limited contractually?.”

The ECJ’s ruling

The ECJ (without the need of the opinion of the advocate general) ruled that the Database Directive is not applicable to databases which are not protected either by copyright or by the sui generis database right. Therefore, exceptions to restricted acts set forth by Sections 6 and 8 of the Directive do not prevent the database owner from establishing contractual limitations on its use by third parties. In other words, restrictions to the freedom to contract set forth by the Database Directive do not apply in cases of unprotected databases. Whether Ryanair’s website may be entitled to copyright or sui generis database right protection needs to be determined by the competent national court.

The ECJ’s decision is not particularly striking from a legal standpoint. Yet, it could have a significant impact on the business model of price comparison websites, aggregators, and similar businesses. Owners of databases that could not rely on intellectual property protection may contractually prevent extraction and use (“scraping”) of content from their online databases. Thus, unprotected databases could receive greater protection than the one granted by IP law.

Antitrust implications

However, the lawfulness of contractual restrictions prohibiting access and reuse of data through screen scraping practices should be assessed under an antitrust perspective. In this respect, in 2013 the Court of Milan ruled that Ryanair’s refusal to grant access to its database to the online travel agency Viaggiare S.r.l. amounted to an abuse of dominant position in the downstream market of information and intermediation on flights (decision of June 4, 2013 Viaggiare S.r.l. vs Ryanair Ltd). Indeed, a balance should be struck between the need to compensate the efforts and investments made by the creator of the database with the interest of third parties to be granted with access to information (especially in those cases where the latter are not entitled to copyright protection).

Additionally, web scraping triggers other issues which have not been considered by the ECJ’s ruling. These include, but are not limited to trademark law (i.e., whether the use of a company’s names/logos by the web scraper without consent may amount to trademark infringement), data protection (e.g., in case the scraping involves personal data), or unfair competition.

Source: http://www.globallegalpost.com/blogs/global-view/ecj-clarifies-database-directive-scope-in-screen-scraping-case-128701/

Friday 26 June 2015

Data Scraping - Increasing Accessibility by Scraping Information From PDF

You may have heard about data scraping which is a method that is being used by computer programs in extracting data from an output that comes from another program. To put it simply, this is a process which involves the automatic sorting of information that can be found on different resources including the internet which is inside an html file, PDF or any other documents. In addition to that, there is the collection of pertinent information. These pieces of information will be contained into the databases or spreadsheets so that the users can retrieve them later.

Most of the websites today have text that can be accessed and written easily in the source code. However, there are now other businesses nowadays that choose to make use of Adobe PDF files or Portable Document Format. This is a type of file that can be viewed by simply using the free software known as the Adobe Acrobat. Almost any operating system supports the said software. There are many advantages when you choose to utilize PDF files. Among them is that the document that you have looks exactly the same even if you put it in another computer so that you can view it. Therefore, this makes it ideal for business documents or even specification sheets. Of course there are disadvantages as well. One of which is that the text that is contained in the file is converted into an image. In this case, it is often that you may have problems with this when it comes to the copying and pasting.

This is why there are some that start scraping information from PDF. This is often called PDF scraping in which this is the process that is just like data scraping only that you will be getting information that is contained in your PDF files. In order for you to begin scraping information from PDF, you must choose and exploit a tool that is specifically designed for this process. However, you will find that it is not easy to locate the right tool that will enable you to perform PDF scraping effectively. This is because most of the tools today have problems in obtaining exactly the same data that you want without personalizing them.

Nevertheless, if you search well enough, you will be able to encounter the program that you are looking for. There is no need for you to have programming language knowledge in order for you to use them. You can easily specify your own preferences and the software will do the rest of the work for you. There are also companies out there that you can contact and they will perform the task since they have the right tools that they can use. If you choose to do things manually, you will find that this is indeed tedious and complicated whereas if you compare this to having professionals do the job for you, they will be able to finish it in no time at all. Scraping information from PDF is a process where you collect the information that can be found on the internet and this does not infringe copyright laws.

Source: http://ezinearticles.com/?Increasing-Accessibility-by-Scraping-Information-From-PDF&id=4593863

Saturday 20 June 2015

Web Scraping: working with APIs

APIs present researchers with a diverse set of data sources through a standardised access mechanism: send a pasted together HTTP request, receive JSON or XML in return. Today we tap into a range of APIs to get comfortable sending queries and processing responses.

These are the slides from the final class in Web Scraping through R: Web scraping for the humanities and social sciences

This week we explore how to use APIs in R, focusing on the Google Maps API. We then attempt to transfer this approach to query the Yandex Maps API. Finally, the practice section includes examples of working with the YouTube V2 API, a few ‘social’ APIs such as LinkedIn and Twitter, as well as APIs less off the beaten track (Cricket scores, anyone?).

I enjoyed teaching this course and hope to repeat and improve on it next year. When designing the course I tried to cram in everything I wish I had been taught early on in my PhD (resulting in information overload, I fear). Still, hopefully it has been useful to students getting started with digital data collection, showing on the one hand what is possible, and on the other giving some idea of key steps in achieving research objectives.

Download the .Rpres file to use in Rstudio here

A regular R script with code-snippets only can be accessed here

Slides from the first session here

Slides from the second session here

Slides from the third session here

Source: http://www.r-bloggers.com/web-scraping-working-with-apis/

Tuesday 9 June 2015

Three Common Methods For Web Data Extraction

Probably the most common technique used traditionally to extract data from web pages this is to cook up some regular expressions that match the pieces you want (e.g., URL's and link titles). Our screen-scraper software actually started out as an application written in Perl for this very reason. In addition to regular expressions, you might also use some code written in something like Java or Active Server Pages to parse out larger chunks of text. Using raw regular expressions to pull out the data can be a little intimidating to the uninitiated, and can get a bit messy when a script contains a lot of them. At the same time, if you're already familiar with regular expressions, and your scraping project is relatively small, they can be a great solution.

Other techniques for getting the data out can get very sophisticated as algorithms that make use of artificial intelligence and such are applied to the page. Some programs will actually analyze the semantic content of an HTML page, then intelligently pull out the pieces that are of interest. Still other approaches deal with developing "ontologies", or hierarchical vocabularies intended to represent the content domain.

There are a number of companies (including our own) that offer commercial applications specifically intended to do screen-scraping. The applications vary quite a bit, but for medium to large-sized projects they're often a good solution. Each one will have its own learning curve, so you should plan on taking time to learn the ins and outs of a new application. Especially if you plan on doing a fair amount of screen-scraping it's probably a good idea to at least shop around for a screen-scraping application, as it will likely save you time and money in the long run.

So what's the best approach to data extraction? It really depends on what your needs are, and what resources you have at your disposal. Here are some of the pros and cons of the various approaches, as well as suggestions on when you might use each one:

Raw regular expressions and code

Advantages:

- If you're already familiar with regular expressions and at least one programming language, this can be a quick solution.

- Regular expressions allow for a fair amount of "fuzziness" in the matching such that minor changes to the content won't break them.

- You likely don't need to learn any new languages or tools (again, assuming you're already familiar with regular expressions and a programming language).

- Regular expressions are supported in almost all modern programming languages. Heck, even VBScript has a regular expression engine. It's also nice because the various regular expression implementations don't vary too significantly in their syntax.

Disadvantages:

- They can be complex for those that don't have a lot of experience with them. Learning regular expressions isn't like going from Perl to Java. It's more like going from Perl to XSLT, where you have to wrap your mind around a completely different way of viewing the problem.

- They're often confusing to analyze. Take a look through some of the regular expressions people have created to match something as simple as an email address and you'll see what I mean.

- If the content you're trying to match changes (e.g., they change the web page by adding a new "font" tag) you'll likely need to update your regular expressions to account for the change.

- The data discovery portion of the process (traversing various web pages to get to the page containing the data you want) will still need to be handled, and can get fairly complex if you need to deal with cookies and such.

When to use this approach: You'll most likely use straight regular expressions in screen-scraping when you have a small job you want to get done quickly. Especially if you already know regular expressions, there's no sense in getting into other tools if all you need to do is pull some news headlines off of a site.

Ontologies and artificial intelligence

Advantages:

- You create it once and it can more or less extract the data from any page within the content domain you're targeting.

- The data model is generally built in. For example, if you're extracting data about cars from web sites the extraction engine already knows what the make, model, and price are, so it can easily map them to existing data structures (e.g., insert the data into the correct locations in your database).

- There is relatively little long-term maintenance required. As web sites change you likely will need to do very little to your extraction engine in order to account for the changes.

Disadvantages:

- It's relatively complex to create and work with such an engine. The level of expertise required to even understand an extraction engine that uses artificial intelligence and ontologies is much higher than what is required to deal with regular expressions.

- These types of engines are expensive to build. There are commercial offerings that will give you the basis for doing this type of data extraction, but you still need to configure them to work with the specific content domain you're targeting.

- You still have to deal with the data discovery portion of the process, which may not fit as well with this approach (meaning you may have to create an entirely separate engine to handle data discovery). Data discovery is the process of crawling web sites such that you arrive at the pages where you want to extract data.

When to use this approach: Typically you'll only get into ontologies and artificial intelligence when you're planning on extracting information from a very large number of sources. It also makes sense to do this when the data you're trying to extract is in a very unstructured format (e.g., newspaper classified ads). In cases where the data is very structured (meaning there are clear labels identifying the various data fields), it may make more sense to go with regular expressions or a screen-scraping application.

Screen-scraping software

Advantages:

- Abstracts most of the complicated stuff away. You can do some pretty sophisticated things in most screen-scraping applications without knowing anything about regular expressions, HTTP, or cookies.

- Dramatically reduces the amount of time required to set up a site to be scraped. Once you learn a particular screen-scraping application the amount of time it requires to scrape sites vs. other methods is significantly lowered.

- Support from a commercial company. If you run into trouble while using a commercial screen-scraping application, chances are there are support forums and help lines where you can get assistance.

Disadvantages:

- The learning curve. Each screen-scraping application has its own way of going about things. This may imply learning a new scripting language in addition to familiarizing yourself with how the core application works.

- A potential cost. Most ready-to-go screen-scraping applications are commercial, so you'll likely be paying in dollars as well as time for this solution.

- A proprietary approach. Any time you use a proprietary application to solve a computing problem (and proprietary is obviously a matter of degree) you're locking yourself into using that approach. This may or may not be a big deal, but you should at least consider how well the application you're using will integrate with other software applications you currently have. For example, once the screen-scraping application has extracted the data how easy is it for you to get to that data from your own code?

When to use this approach: Screen-scraping applications vary widely in their ease-of-use, price, and suitability to tackle a broad range of scenarios. Chances are, though, that if you don't mind paying a bit, you can save yourself a significant amount of time by using one. If you're doing a quick scrape of a single page you can use just about any language with regular expressions. If you want to extract data from hundreds of web sites that are all formatted differently you're probably better off investing in a complex system that uses ontologies and/or artificial intelligence. For just about everything else, though, you may want to consider investing in an application specifically designed for screen-scraping.

As an aside, I thought I should also mention a recent project we've been involved with that has actually required a hybrid approach of two of the aforementioned methods. We're currently working on a project that deals with extracting newspaper classified ads. The data in classifieds is about as unstructured as you can get. For example, in a real estate ad the term "number of bedrooms" can be written about 25 different ways. The data extraction portion of the process is one that lends itself well to an ontologies-based approach, which is what we've done. However, we still had to handle the data discovery portion. We decided to use screen-scraper for that, and it's handling it just great. The basic process is that screen-scraper traverses the various pages of the site, pulling out raw chunks of data that constitute the classified ads. These ads then get passed to code we've written that uses ontologies in order to extract out the individual pieces we're after. Once the data has been extracted we then insert it into a database.

Source: http://ezinearticles.com/?Three-Common-Methods-For-Web-Data-Extraction&id=165416

Wednesday 3 June 2015

Scraping the Royal Society membership list

To a data scientist any data is fair game, from my interest in the history of science I came across the membership records of the Royal Society from 1660 to 2007 which are available as a single PDF file. I’ve scraped the membership list before: the first time around I wrote a C# application which parsed a plain text file which I had made from the original PDF using an online converting service, looking back at the code it is fiendishly complicated and cluttered by boilerplate code required to build a GUI. ScraperWiki includes a pdftoxml function so I thought I’d see if this would make the process of parsing easier, and compare the ScraperWiki experience more widely with my earlier scraper.

The membership list is laid out quite simply, as shown in the image below, each member (or Fellow) record spans two lines with the member name in the left most column on the first line and information on their birth date and the day they died, the class of their Fellowship and their election date on the second line.

Later in the document we find that information on the Presidents of the Royal Society is found on the same line as the Fellow name and that Royal Patrons are formatted a little differently. There are also alias records where the second line points to the primary record for the name on the first line.

pdftoxml converts a PDF into an xml file, wherein each piece of text is located on the page using spatial coordinates, an individual line looks like this:

<text top="243" left="135" width="221" height="14" font="2">Abbot, Charles, 1st Baron Colchester </text>

This makes parsing columnar data straightforward you simply need to select elements with particular values of the “left” attribute. It turns out that the columns are not in exactly the same positions throughout the whole document, which appears to have been constructed by tacking together the membership list A-J with that of K-Z, but this can easily be resolved by accepting a small range of positions for each column.

Attempting to automatically parse all 395 pages of the document reveals some transcription errors: one Fellow was apparently elected on 16th March 197 – a bit of Googling reveals that the real date is 16th March 1978. Another fellow is classed as a “Felllow”, and whilst most of the dates of birth and death are separated by a dash some are separated by an en dash which as far as the code is concerned is something completely different and so on. In my earlier iteration I missed some of these quirks or fixed them by editing the converted text file. These variations suggest that the source document was typed manually rather than being output from a pre-existing database. Since I couldn’t edit the source document I was obliged to code around these quirks.

ScraperWiki helpfully makes putting data into a SQLite database the simplest option for a scraper. My handling of dates in this version of the scraper is a little unsatisfactory: presidential terms are described in terms of a start and end year but are rendered 1st January of those years in the database. Furthermore, in historical documents dates may not be known accurately so someone may have a birth date described as “circa 1782? or “c 1782?, even more vaguely they may be described as having “flourished 1663-1778? or “fl. 1663-1778?. Python’s default datetime module does not capture this subtlety and if it did the database used to store dates would need to support it too to be useful – I’ve addressed this by storing the original life span data as text so that it can be analysed should the need arise. Storing dates as proper dates in the database, rather than text strings means we can query the database using date based queries.

ScraperWiki provides an API to my dataset so that I can query it using SQL, and since it is public anyone else can do this too. So, for example, it’s easy to write queries that tell you the the database contains 8019 Fellows, 56 Presidents, 387 born before 1700, 3657 with no birth date, 2360 with no death date, 204 “flourished”, 450 have birth dates “circa” some year.

I can count the number of classes of fellows:

select distinct class,count(*) from `RoyalSocietyFellows` group by class

Make a table of all of the Presidents of the Royal Society

select * from `RoyalSocietyFellows` where StartPresident not null order by StartPresident desc

…and so on. These illustrations just use the ScraperWiki htmltable export option to display the data as a table but equally I could use similar queries to pull data into a visualisation.

Comparing this to my earlier experience, the benefits of using ScraperWiki are:

•    Nice traceable code to provide a provenance for the dataset;

•    Access to the pdftoxml library;

•    Strong encouragement to “do the right thing” and put the data into a database;

•    Publication of the data;

•    A simple API giving access to the data for reuse by all.

My next target for ScraperWiki may well be the membership lists for the French Academie des Sciences, a task which proved too complex for a simple plain text scraper…

Source: https://scraperwiki.wordpress.com/2012/12/28/scraping-the-royal-society-membership-list/

Friday 29 May 2015

Data Scraping Services - Login to Website Programmatically using C# for Web Scraping

In many scenario the data is available after login that you want to scrape. So to reach at the page where data is located you need to implement code in web scraper  that automatically takes usename/email and password to login into website, once login is done you can do crawling and parsing as required.

Many third party web scraping application provides functionality where you can locate login url and set login parameters and that login task will be called when scraper start and do web scraping.

Below is C# example of programmatically  login to demo login page

http://demo.webdata-scraping.com/login.php

Below is HTML code of Login form:

<form class="form-signin" id="login" method="post" role="form"> <h3 class="form-signin-heading">Please sign in</h3> <a href="#" id="flipToRecover" class="flipLink"> <div id="triangle-topright"></div> </a> <input type="email" class="form-control" name="loginEmail" id="loginEmail" placeholder="Email address" required autofocus> <input type="password" class="form-control" name="loginPass" id="loginPass" placeholder="Password" required> <button class="btn btn-lg btn-primary btn-block" name="login_submit" id="login_submit" type="submit">Sign in</button> </form>

<form class="form-signin" id="login" method="post" role="form">

            <h3 class="form-signin-heading">Please sign in</h3>

            <a href="#" id="flipToRecover" class="flipLink">

              <div id="triangle-topright"></div>

            </a>

            <input type="email" class="form-control" name="loginEmail" id="loginEmail" placeholder="Email address" required autofocus>

            <input type="password" class="form-control" name="loginPass" id="loginPass" placeholder="Password" required>

            <button class="btn btn-lg btn-primary btn-block" name="login_submit" id="login_submit" type="submit">Sign in</button>

</form>

In this code you can notice there is ID for email input box that is id=”loginEmail”  and password input box that is id=”loginPass”

so by taking this ID we will use below two method of webBrowser control and fill the value of each input box using following code

webBrowser1.Document.GetElementById("loginEmail").InnerText =textBox1.Text.ToString(); webBrowser1.Document.GetElementById("loginPass").InnerText = textBox2.Text.ToString();

webBrowser1.Document.GetElementById("loginEmail").InnerText =textBox1.Text.ToString();

webBrowser1.Document.GetElementById("loginPass").InnerText = textBox2.Text.ToString();

After the value filled to Email and Password input box we will just call click event of submit button which is named as Sign In

webBrowser1.Document.GetElementById("login_submit").InvokeMember("click");

webBrowser1.Document.GetElementById("login_submit").InvokeMember("click");

So this is very basic example how you can login to website programatically when you need to access data that is available after login to website.  This is very simple way in which you can work with Web Browser control but there are some other way as well using which you can do same thing.

Source: http://webdata-scraping.com/login-website-programmatically-using-c-web-scraping/

Tuesday 26 May 2015

Data Extraction Services

Are you finding it tedious to perform your routine tasks as well as finding time to research for some information? Don't worry; all you have to do is outsource data extraction requirements to reliable service providers such as Hi-Tech BPO Services.

We can assist you in finding, extracting, gathering, processing and validating all the required data through our effective data extraction services. We can extract data from any given source such as websites, databases, printed documents, directories, etc.

With a whole plethora of data extraction services solutions; we are definitely a one stop solution to all your data extraction services requirements.

For utilizing our data extraction services, all you have to do is outsource data extraction requirements to us, and we will create effective strategies and extract the required data from all preferred sources. Then we will arrange all the extracted data in a systematic order.

Types of data extraction services provided by our data extraction India unit:

The data extraction India unit of Hi-Tech BPO Services can attend to all types of outsource data extraction requirements. Following are just some of the data extraction services we have delivered:

•    Data extraction from websites
•    Data extraction from databases
•    Extraction of data from directories
•    Extracting data from books
•    Data extraction from forms
•    Extracting data from printed materials

Features of Our Data Extraction Services:

•    Reliable collection of resources for data extraction
•    Extensive range of data extraction services
•    Data can be extracted from any available source be it a digital source or a hard copy source
•    Proper researching, extraction, gathering, processing and validation of data
•    Reasonably priced data extraction services
•    Quality and confidentiality ensured through various strict measures

Our data extraction India unit has the competency to handle any of your data extraction services requirements. Just provide us with your specific requirements and we will extract data accordingly from your preferred resources, if particularly specified. Otherwise we will completely rely on our collection of resources for extracting data for you.

Source: http://www.hitechbposervices.com/data-extraction.php

Monday 25 May 2015

Data Scraping - One application or multiple?

I have 30+ sources of data I scrape daily in various formats (xml, html, csv). Over the last three years Ive built 20 or so c# console applications that go out, download the data and re-format it into a database. But Im curious what other people are doing for this type of task. Are people building one tool that has a lot of variables and inputs or are people designing 20+ programs to scrape and parse this data. Everything is hard-coded into each console and run through Windows Task Manager.

Added a couple additional thoughts/details:

    Of the 30 sources, they all have unique properties, all are uploaded into individual MySQL tables and all have varying frequencies. For example, one data source is hit once a minute, another on 5 minute intervals. Majority are once an hour and once a day.

At current I download the formats (xml, csv, html), parse them into a formatted csv and put them into staging folders. Within that folder, I run an application that reads a config file specific to the folder. When a new csv is added to the folder, the application then uploads the data into the specific MySQL tables designated in the config file.

Im wondering if it is worth re-building all this into a larger complex program that is more capable of dynamically adding content+scrapes and adjusting to format changes.

Looking for outside thoughts.

5 Answers

What you are working on is basically ETL. So at a high level you need an export component (get stuff) a transform component (map to known format) and a load (take known format and put stuff somewhere). If you are comfortable being tied to a RDBMS you could use something like SQL Server SSIS packages. What I would do is create a host application that managed common aspects of the overall process (errors, and pipeline processing). Then make the specifics of the E, T, and L pluggable. A low ceremony way to get this would be to host the powershell runtime and create each seesion with common context objects that the scripts will use to communicate. You get a built in pipe and filter model for scripts and easy, safe extensibility. This design has worked extremely for my team with a similar situation.

Resist the temptation to rewrite.

However, for new code, you could plan for what you know has already happened. Write a retrieval mechanism that you can reuse through configuration. Write a translation mechanism that you can reuse (maybe in a library that you can call with very little code). Write a saving mechanism that can be called or configured.

At this point, you've written #21(+). Now, the following ones can be handled with a tiny bit of code and configuration. Yay!

(You may want to implement this in a service that handles multiple conversions, but weight the benefits of it versus the ability to separate errors in one module from the rest.)

1

It depends - if you need the scrapers to feed into a single application/database and have a uniform data format, it makes sense to have them all in a single program (possibly inheriting from a common base scraper).

If not and they are completely unrelated to each other, might as well keep them separate so changes in one have no effect on another.

Update, following edits to question:

Don't change things just for the sake of change. You have something that works, don't mess with it too much.

Since your data sources and data sinks are all separate from each other, combining them into one application will simply create a very complicated application that will be very difficult to change when needed.

Since the scrapers are separate, keep the separation as you have it now.

As sbrenton said, this most falls in with ETL. You should check out Talend Open Studio. It specializes in handling data flows like I imagine yours are as well as other things like duplicate removal, normalization of fields; tens/hundreds of drag and drop ETL components, you can also write custom code as Talend is a code generator as well, either Java or Perl are options. You can also use Talend to execute system commands. I use it for my ETL work, although not in production, in production we will use SSIS, mostly due to lots of other Microsoft products in house.

You may want to use some good scheduling library, like Quartz.NET.

In a few words, here's what you can expect:

    Your tasks are represented by classes and not processes

    You can set and forget tasks and scale across multiple servers

    You have an out-of-the-box system to actually take care of what is needed to be run when, what failed and needs to be re-run, etc. etc.

Source: http://programmers.stackexchange.com/questions/118077/data-scraping-one-application-or-multiple/118098#118098


Friday 22 May 2015

Web scraping using Python without using large frameworks like Scrapy

scrapy-big-logoIf you need publicly available data from scraping the Internet, before creating a webscraper, it is best to check if this data is already available from public data sources or APIs. Check the site’s FAQ section or Google for their API endpoints and public data.

Even if their API endpoints are available you have to create some parser for fetching and structuring the data according to your needs.

Scrapy is a well established framework for scraping, but it is also a very heavy framework. For smaller jobs, it may be overkill and for extremely large jobs it is very slow.

So if you would like to roll up your sleeves and build your own scraper, continue reading.

Here are some basic steps performed by most webspiders:

1) Start with a URL and use a HTTP GET or PUT request to access the URL
2) Fetch all the contents in it and parse the data
3) Store the data in any database or put it into any data warehouse
4) Enqueue all the URLs in a page
5) Use the URLs in queue and repeat from process 1
Here are the 3 major modules in every web crawler:
1) Request/Response handler.
2) Data parsing/data cleansing/data munging process.
3) Data serialization/data pipelines.

Lets look at each of these modules and see what they do and how to use them.

Request/Response handler

Request/response handlers are managers who make http requests to a url or a group of urls, and fetch the response objects as html contents and pass this data to the next module. If you use Python for performing request/response url-opening process libraries such as the following are most commonly used

1) urllib(20.5. urllib – Open arbitrary resources by URL – Python v2.7.8 documentation) -Basic python library yet high-level interface for fetching data across the World Wide Web.

2) urllib2(20.6. urllib2 – extensible library for opening URLs – Python v2.7.8 documentation) – extensible library of urllib, which would handle basic http requests, digest authentication, redirections, cookies and more.

3) requests(Requests: HTTP for Humans) – Much advanced request library

which is built on top of basic request handling libraries.

Data parsing/data cleansing/data munging process

This is the module where the fetched data is processed and cleaned. Unstructured data is transformed into structured during this processing. Usually  a set of Regular Expressions (regexes) which perform pattern matching and text processing tasks on the html data are used for this processing.

In addition to regexes, basic string manipulation and search methods are also used to perform this cleaning and transformation. You must have a thorough knowledge of regular expressions and so that you could design the regex patterns.

Data serialization/data pipelines

Once you get the cleaned data from the parsing and cleaning module, the data serialization module will be used to serialize the data according to the data models that you require. This is the final module that will output data in a standard format that can be stored in databases, JSON/CSV files or passed to any data warehouses for storage. These tasks are usually performed by libraries listed below

1) pickle (pickle – Python object serialization) –  This module implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure

2) JSON (JSON encoder and decoder)

3) CSV (https://docs.python.org/2/library/csv.html)

4) Basic database interface libraries like pymongo (Tutorial – PyMongo),mysqldb ( on python.org), sqlite3(sqlite3 – DB-API interface for SQLite databases)

And many more such libraries based on the format and database/data storage.

Basic spider rules

The rules to follow while building a spider are to be nice to the sites you are scraping and follow the rules in the site’s spider policies outlined in the site’s robots.txt.

Limit the  number of requests in a second and build enough delays in the spiders so that  you don’t adversely affect the site.

It just makes sense to be nice.

We will cover more techniques in future articles

Source: http://learn.scrapehero.com/webscraping-using-python-without-using-large-frameworks-like-scrapy/

Wednesday 20 May 2015

Social Media Crawling & Scraping services for Brand Monitoring

Crawling social media sites for extracting information is a fairly new concept – mainly due to the fact that most of the social media networking sites have cropped up in the last decade or so. But it’s equally (if not more) important to grab this ever-expanding User-Generated-Content (UGC) as this is the data that companies are interested in the most – such as product/service reviews, feedback, complaints, brand monitoring, brand analysis, competitor analysis, overall sentiment towards the brand, and so on.

Scraping social networking sites such as Twitter, Linkedin, Google Plus, Instagram etc. is not an easy task for in-house data acquisition departments of most companies as these sites have complex structures and also restrict the amount and frequency of the data that they let out to crawlers. This kind of a task is best left to an expert, such as PromptCloud’s Social Media Data Acquisition Service – which can take care of your end-to-end requirements and provide you with the desired data in a minimal turnaround time. Most of the popular social networking sites such as Twitter and Facebook let crawlers extract data only through their own API (Application Programming Interface), so as to control the amount of information about their users and their activities.

PromptCloud respects all these restrictions with respect to access to content and frequency of hitting their servers to make sure that user information is not compromised and their experience with the site is unhindered.

Social Media Scraping Experts

At PromptCloud, we have developed an expertise in crawling and scraping social media data in real-time. Such data can be from diverse sources such as – Twitter, Linkedin groups, blogs, news, reviews etc. Popular usage of this data is in brand monitoring, trend watching, sentiment/competitor analysis & customer service, among others.

Our low-latency component can extract data on the basis of specific keywords, categories, geographies, or a combination of these. We can also take care of complexities such as multiple languages as well as tweets and profiles of specific users (based on keywords or geographies). Sample XML data can be accessed through this link – demo.promptcloud.com.

Structured data is delivered via a single REST-based API and every time new content is published, the feed gets updated automatically. We also provide data in any other preferred formats (XML, CSV, XLS etc.).

If you have a social media data acquisition problem that you want to get solved, please do get in touch with us.

Source: https://www.promptcloud.com/social-media-networking-sites-crawling-service/

Sunday 17 May 2015

Scraping Twitter Lists To Boost Social Outreach (+ Free Tool!)

I published a post a few weeks ago describing how to build your own twitter custom audience list, outlining a variety of techniques to build up your list.

This post outlines another method (hat tip to Ade Lewis for the idea) which requires you to scrape Twitter directly.

If you want to skip all the explanations and just want to download the Twitter List Scraper tool, here you go…

Download the Twitter Scraper Tool for Windows or Mac (completely free)

Disclaimer: Scraping Twitter is against their Terms of Service, so if you decide to do this you do it at your own risk.

Some Benchmarks

Building custom audiences on Twitter requires you to identify Twitter usernames that might be interested in your service or product.

In my previous posts, one of the methods I employed was to pull a competitor’s link profile and scrape social accounts from the linking domains.

Once you upload a custom list, Twitter goes through a process of ‘matching’ against profiles in their system, to make sure the user exists and hasn’t opted out of tailored ads.

As our data was scraped from a list of unqualified websites, the data matching wasn’t likely to be perfect.

Experiments

Since I published that post, I have been experimenting a fair bit with list building, and have built up around 10 custom audience lists. I‘ve uploaded a total of 48,857 Twitter usernames using this method, but only 29,260 were matched by Twitter (just less than 60% match rate).

From some other experiments where I have had better control over the input data, this match rate was between 70-80%.

Since we’ll be scraping Twitter directly, I expect our match rate to be much higher – 90%+

Finding Relevant Twitter Lists

So, we’re going to scrape Twitter, and the first step is to find Twitter lists that will contain users potentially interested in what we have to offer.

As an example, we’ll pretend we’re marketing a music website, and we’ve produced a survey we want to collect responses for.

An advanced Google query can give us lists of music bloggers: site:twitter.com inurl:lists inurl:members inurl:music “music blogger”

Source: http://urlprofiler.com/blog/scraping-twitter/

Friday 8 May 2015

Web Scraping - Data Collection or Illegal Activity?

Web Scraping Defined

We've all heard the term "web scraping" but what is this thing and why should we really care about it?  Web scraping refers to an application that is programmed to simulate human web surfing by accessing websites on behalf of its "user" and collecting large amounts of data that would typically be difficult for the end user to access.  Web scrapers process the unstructured or semi-structured data pages of targeted websites and convert the data into a structured format.  Once the data is in a structured format, the user can extract or manipulate the data with ease.  Web scraping is very similar to web indexing (used by most search engines), but the end motivation is typically much different.  Whereas web indexing is used to help make search engines more efficient, web scraping is typically used for different reasons like change detection, market research, data monitoring, and in some cases, theft.

Why Web Scrape?

There are lots of reasons people (or companies) want to scrape websites, and there are tons of web scraping applications available today.  A quick Internet search will yield numerous web scraping tools written in just about any programming language you prefer.  In today's information-hungry environment, individuals and companies alike are willing to go to great lengths to gather information about all sorts of topics.  Imagine a company that would really like to gather some market research on one of their leading competitors...might they be tempted to invoke a web scraper that gathers all the information for them?  Or, what if someone wanted to find a vulnerable site that allowed otherwise not-so-free downloads?  Or, maybe a less than honest person might want to find a list of account numbers on a site that failed to properly secure them.  The list goes on and on.

I should mention that web scraping is not always a bad thing.  Some websites allow web scraping, but many do not.  It's important to know what a website allows and prohibits before you scrape it.

The Problem With Web Scraping

Web scraping rides a fine line between collecting information and stealing information.  Most websites have a copyright disclosure statement that legally protects their website information.  It's up to the reader/user/scraper to read these disclosure statements and follow along legally and ethically.  In fact, the F5.com website presents the following copyright disclosure:  "All content included on this site, such as text, graphics, logos, button icons, images, audio clips, and software, including the compilation thereof (meaning the collection, arrangement, and assembly), is the property of F5 Networks, Inc., or its content and software suppliers, except as may be stated otherwise, and is protected by U.S. and international copyright laws."  It goes on to say, "We reserve the right to make changes to our site and these disclaimers, terms, and conditions at any time."

So, scraper beware!  There have been many court cases where web scraping turned into felony offenses.  One case involved an online activist who scraped the MIT website and ultimately downloaded millions of academic articles.  This guy is now free on bond, but faces dozens of years in prison and $1 million if convicted.  Another case involves a real estate company who illegally scraped listings and photos from a competitor in an attempt to gain a lead in the market.  Then, there's the case of a regional software company that was convicted of illegally scraping a major database company's websites in order to gain a competitive edge.  The software company had to pay a $20 million fine and the guilty scraper is serving three years probation.  Finally, there's the case of a medical website that hosted sensitive patient information.  In this case, several patients had posted personal drug listings and other private information on closed forums located on the medical website.  The website was scraped by a media-rese
arch firm, and all this information was suddenly public.

While many illegal web scrapers have been caught by the authorities, many more have never been caught and still run loose on websites around the world.  As you can see, it's increasingly important to guard against this activity.  After all, the information on your website belongs to you, and you don't want anyone else taking it without your permission.

The Good News

As we've noted, web scraping is a real problem for many companies today.  The good news is that F5 has web scraping protection built into the Application Security Manager (ASM) of its BIG-IP product family.  As you can see in the screenshot below, the ASM provides web scraping protection against bots, session opening anomalies, session transaction anomalies, and IP address whitelisting.

The bot detection works with clients that accept cookies and process JavaScript.  It counts the client's page consumption speed and declares a client as a bot if a certain number of page changes happen within a given time interval.  The session opening anomaly spots web scrapers that do not accept cookies or process JavaScript.  It counts the number of sessions opened during a given time interval and declares the client as a scraper if the maximum threshold is exceeded.  The session transaction anomaly detects valid sessions that visit the site much more than other clients.  This defense is looking at a bigger picture and it blocks sessions that exceed a calculated baseline number that is derived from a current session table.  The IP address whitelist allows known friendly bots and crawlers (i.e. Google, Bing, Yahoo, Ask, etc), and this list can be populated as needed to fit the needs of your organization.

I won't go into all the details here because I'll have some future articles that dive into the details of how the ASM protects against these types of web scraping capabilities.  But, suffice it to say, ASM does a great job of protecting your website against the problem of web scraping.

I'm sure as you studied the screenshot above you also noticed lots of other protection capabilities the ASM provides...brute force attack prevention, customized attack signatures, Denial of Service protection, etc.  You might be wondering how it does all that stuff as well.  Give us a little feedback on the topics you would like to see, and we'll start posting some targeted tech tips for you!

Thanks for reading this introductory web scraping article...and, be sure to come back for the deeper look into how the ASM is configured to handle this problem. For more information, check out this video from Peter Silva where he discusses ASM botnet and web scraping defense.

Source: https://devcentral.f5.com/articles/web-scraping-data-collection-or-illegal-activity

Thursday 30 April 2015

Lawyers & Attorneys Website Data Scraping Services

There are so many instances where one end’s up needing information from lawyers or bar associations. However, if you approach them directly or look for other ways to get information it might either be difficult or you might not get the information you are looking for. Thus, the best way to go about the scraping lawyer data.

Scraping lawyer data allow you to get information from various attorney websites, bar association websites, or other related websites. Using web scraping tools for getting such information makes it much easier to get all the relevant and important information without actually having to worry about the same.

If you wish to scrape data from lawyer, you are entitled to information such as lawyer name, firm names, address, contact details, history about the lawyers, educational qualifications, the bar association they are part of and much more.

Scraping lawyer data ensure that you also have images of the lawyer you are concentrating on. The result of scrape data form lawyer can be obtained in any format the user wants such as csv, excel, MySql etc. Scraping lawyer data also ensures that none of the information provided are repetitive or redundant.

If you are in need of information regarding any lawyer such as their contact details, address etc. it could end up being a huge and difficult task to get it manually or physically. Thus, taking off the help of scraping tools would ensure that you get all the needed information without actually having to bother about anything at all. The presence of lots of attorney websites and the fact that more and more lawyers are moving to the internet makes getting information easy with the help of some great tools. Scraping data is a very useful and handy method in which one can get all the required and relevant information and that too in a very easy to read format, which makes the method even worthier.

There are quite a few tools or services that you can take help of to get lawyers data scraped. Most of these services also provide with a sample demo and that free of cost. From the sample one can decide if they wish to continue with the services or try some other services. Thus, if you want any information from attorney websites or information about any lawyers, data scraping is a great way to get the same.

Source: https://3idatascraping.wordpress.com/2014/03/18/lawyers-attorneys-website-data-scraping-services/

Tuesday 28 April 2015

Benefits of Scraping Data from Real Estate Website

With so much of growth in the recent times in real estate industry, it is likely that companies would want to create something different or use another method, so as to get desired benefits. Thus, it is best to go with the technological advancements and create real estate websites to get an edge over others in the industry. And to get all the information regarding website content, one can opt for real estate data scraping methods.

About real estate website scraping

Internet has become an important part of our daily lives and in industry marketing procedures too. With the use of website scraping one can easily scrape real estate listing from various websites. One just needs the help of experts and with proper software and tools; they can easily collect all the relevant real estate data from the required real estate websites and make a structured file containing the information. With internet becoming a valid platform for information and data submitted by numerous sources from around the globe, it is necessary to gather them all in one place for companies. In this way, the company can know what it lacks and work upon their strategies so as to gain profit and get to the top of the business world by taking one step at a time.

Uses of real estate website scraping

With proper use of website scraping one can collect and scrape the real estate listings which can help the company in the real estate market area. One can draw the attention of potential customers by designing the company strategies in such a way as contemplating the changing trends in the real estate global arena. All this is done with the help of the data collected from various real estate websites. With the help of proper website, one can collect the data and these get updated whenever new information gets into the web portal. In this way the company is kept updated about the various changes happening around the global market and thus, ensure in making plans regarding the company. This way one can plan ahead and take steps that can lead to the company gaining profits in future.

Thus, with the help of proper real estate website scraping one can be sure of getting all the information regarding real estate market. This way one can work upon making the company move as per the market trends and get a stronghold in real estate business.

Source: https://3idatascraping.wordpress.com/2013/09/25/benefit-of-scraping-data-from-real-estate-website/

Sunday 26 April 2015

Three Common Methods For Web Data Extraction

Probably the most common technique used traditionally to extract data from web pages this is to cook up some regular expressions that match the pieces you want (e.g., URL's and link titles). Our screen-scraper software actually started out as an application written in Perl for this very reason. In addition to regular expressions, you might also use some code written in something like Java or Active Server Pages to parse out larger chunks of text. Using raw regular expressions to pull out the data can be a little intimidating to the uninitiated, and can get a bit messy when a script contains a lot of them. At the same time, if you're already familiar with regular expressions, and your scraping project is relatively small, they can be a great solution.

Other techniques for getting the data out can get very sophisticated as algorithms that make use of artificial intelligence and such are applied to the page. Some programs will actually analyze the semantic content of an HTML page, then intelligently pull out the pieces that are of interest. Still other approaches deal with developing "ontologies", or hierarchical vocabularies intended to represent the content domain.

There are a number of companies (including our own) that offer commercial applications specifically intended to do screen-scraping. The applications vary quite a bit, but for medium to large-sized projects they're often a good solution. Each one will have its own learning curve, so you should plan on taking time to learn the ins and outs of a new application. Especially if you plan on doing a fair amount of screen-scraping it's probably a good idea to at least shop around for a screen-scraping application, as it will likely save you time and money in the long run.

So what's the best approach to data extraction? It really depends on what your needs are, and what resources you have at your disposal. Here are some of the pros and cons of the various approaches, as well as suggestions on when you might use each one:

Raw regular expressions and code

Advantages:

- If you're already familiar with regular expressions and at least one programming language, this can be a quick solution.

- Regular expressions allow for a fair amount of "fuzziness" in the matching such that minor changes to the content won't break them.

- You likely don't need to learn any new languages or tools (again, assuming you're already familiar with regular expressions and a programming language).

- Regular expressions are supported in almost all modern programming languages. Heck, even VBScript has a regular expression engine. It's also nice because the various regular expression implementations don't vary too significantly in their syntax.

Disadvantages:

- They can be complex for those that don't have a lot of experience with them. Learning regular expressions isn't like going from Perl to Java. It's more like going from Perl to XSLT, where you have to wrap your mind around a completely different way of viewing the problem.

- They're often confusing to analyze. Take a look through some of the regular expressions people have created to match something as simple as an email address and you'll see what I mean.

- If the content you're trying to match changes (e.g., they change the web page by adding a new "font" tag) you'll likely need to update your regular expressions to account for the change.

- The data discovery portion of the process (traversing various web pages to get to the page containing the data you want) will still need to be handled, and can get fairly complex if you need to deal with cookies and such.

When to use this approach: You'll most likely use straight regular expressions in screen-scraping when you have a small job you want to get done quickly. Especially if you already know regular expressions, there's no sense in getting into other tools if all you need to do is pull some news headlines off of a site.

Ontologies and artificial intelligence

Advantages:

- You create it once and it can more or less extract the data from any page within the content domain you're targeting.

- The data model is generally built in. For example, if you're extracting data about cars from web sites the extraction engine already knows what the make, model, and price are, so it can easily map them to existing data structures (e.g., insert the data into the correct locations in your database).

- There is relatively little long-term maintenance required. As web sites change you likely will need to do very little to your extraction engine in order to account for the changes.

Disadvantages:

- It's relatively complex to create and work with such an engine. The level of expertise required to even understand an extraction engine that uses artificial intelligence and ontologies is much higher than what is required to deal with regular expressions.

- These types of engines are expensive to build. There are commercial offerings that will give you the basis for doing this type of data extraction, but you still need to configure them to work with the specific content domain you're targeting.

- You still have to deal with the data discovery portion of the process, which may not fit as well with this approach (meaning you may have to create an entirely separate engine to handle data discovery). Data discovery is the process of crawling web sites such that you arrive at the pages where you want to extract data.

When to use this approach: Typically you'll only get into ontologies and artificial intelligence when you're planning on extracting information from a very large number of sources. It also makes sense to do this when the data you're trying to extract is in a very unstructured format (e.g., newspaper classified ads). In cases where the data is very structured (meaning there are clear labels identifying the various data fields), it may make more sense to go with regular expressions or a screen-scraping application.

Screen-scraping software

Advantages:

- Abstracts most of the complicated stuff away. You can do some pretty sophisticated things in most screen-scraping applications without knowing anything about regular expressions, HTTP, or cookies.

- Dramatically reduces the amount of time required to set up a site to be scraped. Once you learn a particular screen-scraping application the amount of time it requires to scrape sites vs. other methods is significantly lowered.

- Support from a commercial company. If you run into trouble while using a commercial screen-scraping application, chances are there are support forums and help lines where you can get assistance.

Disadvantages:

- The learning curve. Each screen-scraping application has its own way of going about things. This may imply learning a new scripting language in addition to familiarizing yourself with how the core application works.

- A potential cost. Most ready-to-go screen-scraping applications are commercial, so you'll likely be paying in dollars as well as time for this solution.

- A proprietary approach. Any time you use a proprietary application to solve a computing problem (and proprietary is obviously a matter of degree) you're locking yourself into using that approach. This may or may not be a big deal, but you should at least consider how well the application you're using will integrate with other software applications you currently have. For example, once the screen-scraping application has extracted the data how easy is it for you to get to that data from your own code?

When to use this approach: Screen-scraping applications vary widely in their ease-of-use, price, and suitability to tackle a broad range of scenarios. Chances are, though, that if you don't mind paying a bit, you can save yourself a significant amount of time by using one. If you're doing a quick scrape of a single page you can use just about any language with regular expressions. If you want to extract data from hundreds of web sites that are all formatted differently you're probably better off investing in a complex system that uses ontologies and/or artificial intelligence. For just about everything else, though, you may want to consider investing in an application specifically designed for screen-scraping.

As an aside, I thought I should also mention a recent project we've been involved with that has actually required a hybrid approach of two of the aforementioned methods. We're currently working on a project that deals with extracting newspaper classified ads. The data in classifieds is about as unstructured as you can get. For example, in a real estate ad the term "number of bedrooms" can be written about 25 different ways. The data extraction portion of the process is one that lends itself well to an ontologies-based approach, which is what we've done. However, we still had to handle the data discovery portion. We decided to use screen-scraper for that, and it's handling it just great. The basic process is that screen-scraper traverses the various pages of the site, pulling out raw chunks of data that constitute the classified ads. These ads then get passed to code we've written that uses ontologies in order to extract out the individual pieces we're after. Once the data has been extracted we then insert it
into a database.

Source: http://ezinearticles.com/?Three-Common-Methods-For-Web-Data-Extraction&id=165416

Wednesday 22 April 2015

How to Properly Scrape Windows During The Cleaning Process

Removing ordinary dirt such as dust, fingerprints, and oil from windows seem simple enough. However, sometimes, you may find stubborn caked-on dirt or debris on your windows that cannot be removed by standard window cleaning techniques such as scrubbing or using a squeegee. The best way to remove caked-on dirt on your windows is to scrape it off. Nonetheless, you have to be extra careful when you are scraping your windows, because they can be easily scratched and damaged. Here are a number of rules that you need to follow when you are scraping windows.

Rule No. 1: It is recommended that you use a professional window scraper to remove caked-on dirt and debris from your windows. This type of scraper is specially made for use on glass, and it comes with certain features that can prevent scratching and other kinds of damage.

Rule No. 2: It is important to inspect your window scraper before using it. Take a look at the blade of the scraper and make sure that it is not rusted. Also, it must not be bent or chipped off at the corners. If you are not certain whether the blade is in a good enough condition, you should just play it safe by using a new blade.

Rule No. 3: When you are working with a window scraper, always use forward plow-like scraping motions. Scrape forward and lift the scraper off the glass, and then scrape forward again. Try not to slide the scraper backwards, because you may trap debris under the blade when you do so. Consequently, the scraper may scratch the glass.

Rule No. 4: Be extra cautious when you are using a window scraper on tempered glass. Tempered glass may have raised imperfections, which make it more vulnerable to scratches. To find out if the window that you are scraping is made of tempered glass, you have to look for a label in one of its corners.

Window Scraping Procedures

Before you start scraping, you have to wet your window with soapy water first. Then, find out how the window scraper works by testing it in a corner. Scrape on the same spot three or four times in forward motion. If you find that the scraper is moving smoothly and not scratching the glass, you can continue to work on the rest of the window. On the other hand, if you feel as if the scraper is sliding on sandpaper, you have to stop scraping. This indicates that the glass may be flawed and have raised imperfections, and scraping will result in scratches.

After you have ascertained that it is safe to scrape your window, start working along the edges. It is best that you start scraping from the middle of an edge, moving towards the corners. Work in a one or two inch pattern, until all the edges of the glass are properly scraped. After that, scrape the rest of the window in a straight pattern of four or five inches, working from top to bottom. If you find that the window is beginning to dry while you are working, wet it with soapy water again.

Source: http://ezinearticles.com/?How-to-Properly-Scrape-Windows-During-The-Cleaning-Process&id=6592930

Saturday 18 April 2015

What is HTML Scraping and how it works

There are many reasons why there may be a requirement to pull data or information from other sites, and usually the process begins after checking whether the site has an official API. There are very few people who are aware about the presence of structured data that is supported by every website automatically. We are basically talking about pulling data right from the HTML, also referred to as HTML scraping. This is an awesome way of gleaning data and information from third party websites.

Any webpage content that can be viewed can be scraped without any trouble. If there is any way provided by the website to the browser of the visitor to download content and use the same in a highly structured manner, in that case, accessing of the content programmatically is possible. HTML scraping works in an amazing manner.

Before indulging in HTML scraping, one can inspect the browser for network traffic. Site owners have a couple of tricks up their sleeve to thwart this access, but majority of them can be worked around.

Before moving on to how HTML scraping works, we must understand the reasons behind the same. Why is scraping needed? Once you get a satisfactory answer to this question, you can start looking for RSS or API feeds or various other traditional structured data forms. It is significant to understand that when compared with APIs, websites are more significant.

The most important advantage of the same is the maintenance of their websites where a lot of visitors visit rather than safeguarding structured data feeds. With Tweeter, the same has been publicly seen when it clamps down on the developer ecosystem. Many times, API feeds change or move without any prior warning. Many times, it can also be a deliberate attempt, but mostly, such issues or problems erupt as there is no authority or an organization that maintains or takes care of the structured data. It is rarely noticed, if the same gets severely mangled or goes offline. In case the website has certain issues or the website no longer works, the problem is more in the form of a ball in your court requiring dealing with the same without losing any time. api-comic-image

Rate limiting is another factor that needs a lot of thinking and in case of public websites, it virtually doesn’t exist. Besides some occasional sign up pages or captchas, many business websites fail to create and built defenses against any unwarranted automated access. Many times, a single website can be scraped for four hours straight without anyone noticing. There are chances that you would not be viewed under DDOS attack unless concurrent requests are being made by you. You will be seen just as an avid visitor or an enthusiast in the logs, that too, in case anyone is looking.

Another factor in HTML scraping is that one can easily access any website anonymously. Behavior tracking can be done with a few ways by the administrator of the website and this turns out to be beneficial if you want to privately gather the data. Many times, registration is imperative with APIs in order to get key and with any request being sent, this key also needs to be sent. But, in case of simple and straightforward HTTP requests, the visitor can stay anonymous besides cookies and IP address, which can again be spoofed.

The availability of HTML scraping is universal and there is no need to wait for the opening of the site for an API or for contacting anyone in the organization. One simply needs to spend some time and browse websites at a leisurely pace until the data you want is available and then find out the basic patterns to access the same.

Now you need to don a hat of a professional scraper and simply dive in. Initially, it may take some time to work up figuring out the way the data have been structured and the way it can be accessed just as we read APIs. If there is no documentation unlike APIs, you need to be a little more smart about it and use clever tricks.

Some of the most used tricks are

Data Fetching


The first thing that is required is data fetching. Find endpoints to begin with, that is the URLs that can help in returning the data that is required. If you are pretty sure about the data and the way it should be structured so as to match your requirements, you will require a particular subset for the same and later you can indulge in site browsing using the navigation tools.

GET Parameter

The URLs must be paid attention to and see the way it changes as you indulge in clicking between the sections and the way they divide into various subsections. Before starting, the other option that can be used is to straight away go to the search functionality of the site. Certain terms can be typed and the URL needs to be focused again for watching the changes on the basis of what is being searched. A GET parameter will be probably seen like q which changes on the basis of the search term used by you. Other GET parameters that are not being used can be removed from the URL until only the ones that are needed are left for data loading. Before a query string, there must always be a “?” beginning.

Now the time has come when you would have started to come across the data that you would like to see and want to access, but sometimes, there may be certain pagination issues that require to be dealt with. Due to these issues, you may not be able to see the data in its entirety. Single requests are kept away by many APIs as well from database slamming. Many times, clicking the next page can add some offset parameter that helps in data visibility on the page. All these steps will help you succeed in HTML scraping.

Source: https://www.promptcloud.com/blog/what-is-html-scraping-and-how-it-works/

Tuesday 7 April 2015

Thoughts on scraping SERPs and APIs

Google says that scraping keyword rankings is against their policy from what I've read. Bummer. We comprise a lot of reports and manual finding and entry was a pain. Enter Moz! We still manually check and compare, but it's nice having that tool. I'm confused now though about practices and getting SERPs in an automated way. Here are my questions

  1.     Is it against policy to get SERPs from an automated method? If that is the case, isn't Moz breaking this policy with it's awesome keyword tracker?
  2.     If it's not, and we wanted to grab that kind of data, how would we do it? Right now, Moz's API doesn't offer this data. I thought Raven Tools at one point offered this, but they don't now from what I've read. Are there any APIs out there that we can grab this data and do what we want with it? (let's day build our own dashboard)?

Thanks for any clarification and input!

Source: http://moz.com/community/q/thoughts-on-scraping-serps-and-apis

Saturday 4 April 2015

How to Generate Sales Leads Using Web Scraping Services

The first stage of any selling process is what is popularly known as “lead generation”. This phase is what most businesses place at the apex of their sales concerns. It is a driving force that governs decision-making at its highest levels, and influences business strategy and planning. If you are about to embark on an outbound sales campaign and are in the process of looking for leads, you would acknowledge the fact that lead generation process is of extreme importance for any business.

Different lead generation techniques have been used over and over again by companies around the world to satiate this growing business need. Newer, more innovative methods have also emerged to help marketers in this process. One such method of lead generation that is fast catching on, and is poised to play a big role for businesses in the coming years, is web scraping. With web scraping, you can easily get access to multiple relevant and highly customized leads – a perfect starting point for any marketing, promotional or sales campaign.

The prominence of Web Scraping in overall marketing strategy

At present, levels of competition have risen sky high for most businesses. For success, lead generation and gaining insight about customer behavior and preferences is an essential business requirement. Web scraping is the process of scraping or mining the internet for information. Different tools and techniques can be used to harvest information from multiple internet sources based on relevance, and the structured and organized in a way that makes sense to your business. Companies that provide web scraping services essentially use web scrapers to generate a targeted lead database that your company can then integrate into its marketing and sales strategies and plans.

The actual process of web scraping involves creating scraping scripts or algorithms which crawl the web for information based on certain preset parameters and options. The scraping process can be customized and tuned towards finding the kind of data that your business needs. The script can extract data from websites automatically, collate and put together a meaningful collection of leads for business development.

Lead Generation Basics

At a very high level, any person who has the resources and the intent to purchase your product or service qualifies as a lead. In the present scenario, you need to go far deeper than that. Marketers need to observe behavior patterns and purchasing trends to ensure that a particular person qualifies as a lead. If you have a group of people you are targeting, you need to decide who the viable leads will be, acquire their contact information and store it in a database for further action.

List buying used to be a popular way to get leads, but their efficacy has dwindled over time. Web scraping is the fast coming up as a feasible lead generation technique, allowing you to find highly focused and targeted leads in short amounts of time. All you need is a service provider that would carry out the data mining necessary for lead generation, and you end up with a list of actionable leads that you can try selling to.

How Web Scraping makes a substantial difference


With web scraping, you can extract valuable predictive information from websites. Web scraping facilitates high quality data collection and allows you to structure marketing and sales campaigns better. To drive sales and maximize revenue, you need strong, viable leads. To facilitate this, you need critical data which encompasses customer behavior, contact details, buying patterns and trends, willingness and ability to spend resources, and a myriad of other aspects critical to ascertain the potential of an entity as a rewarding lead. Data mining through web scraping can be a great way to get to these factors and identifying the leads that would make a difference for your business.

web-scraping-service

Crawling through many different web locales using different techniques, web scraping services pick up a wealth of information. This highly relevant and specialized information instantly provides your business with actionable leads. Furthermore, this exercise allows you to fine-tune your data management processes, make more accurate and reliable predictions and projections, arrive at more effective, strategic and marketing decisions and customize your workflow and business development to better suit the current market.

The Process and the Tools

Lead generation, being one of the most important processes for any business, can prove to be an expensive proposition if not handled strategically. Companies spend large amounts of their resources acquiring viable leads they can sell to. With web scraping, you can dramatically cut down the costs involved in lead generation and take your business forward with speed and efficiency. Here are some of the time-tested web scraping tools which can come in handy for lead generation –

  •     Website download software – Used to copy entire websites to local storage. All website pages are downloaded and the hierarchy of navigation and internal links preserve. The stored pages can then be viewed and scoured for information at any later time.     Web scraper – Tools that crawl through bulk information on the internet, extracting specific, relevant data using a set of pre-defined parameters.
  •     Data grabber – Sifts through websites and databases fast and extracts all the information, which can be sorted and classified later.
  •     Text extractor – Can be used to scrape multiple websites or locations for acquiring text content from websites and web documents. It can mine data from a variety of text file formats and platforms.
  • With these tools, web scraping services scrape websites for lead generation and provide your business with a set of strong, actionable leads that can make a difference.

Covering all Bases

The strength of web scraping and web crawling lies in the fact that it covers all the necessary bases when it comes to lead generation. Data is harvested, structured, categorized and organized in such a way that businesses can easily use the data provided for their sales leads. As discussed earlier, cold and detached lists no longer provide you with enough actionable leads. You need to look at various factors and consider them during your lead generation efforts –
  •     Contact details of the prospect
  •     Purchasing power and purchasing history of the prospect
  •     Past purchasing trends, willingness to purchase and history of buying preferences of the prospect
  •     Social markers that are indicative of behavioral patterns
  •     Commercial and business markers that are indicative of behavioral patterns
  •     Transactional details
  •     Other factors including age, gender, demography, social circles, language and interests

All these factors need to be taken into account and considered in detail if you have to ensure whether a lead is viable and actionable, or not. With web scraping you can get enough data about every single prospect, connect all the data collected with the help of onboarding, and ascertain with conviction whether a particular prospect will be viable for your business.

Let us take a look at how web scraping addresses these different factors –

1. Scraping website’s


During the scraping process, all websites where a particular prospect has some participation are crawled for data. Seemingly disjointed data can be made into a sensible unit by the use of onboarding- linking user activities with their online entities with the help of user IDs. Documents can be scanned for participation. E-commerce portals can be scanned to find comments and ratings a prospect might have delivered to certain products. Service providers’ websites can be scraped to find if the prospect has given a testimonial to any particular service. All these details can then be accumulated into a meaningful data collection that is indicative of the purchasing power and intent of the prospect, along with important data about buying preferences and tastes.

2. Social scraping

According to a study, most internet users spend upwards of two hours every day on social networks. Therefore, scraping social networks is a great way to explore prospects in detail. Initially, you can get important identification markers like names, addresses, contact numbers and email addresses. Further, social networks can also supply information about age, gender, demography and language choices. From this basic starting point, further details can be added by scraping social activity over long periods of time and looking for activities which indicate purchasing preferences, trends and interests. This exercise provides highly relevant and targeted information about prospects can be constructively used while designing sales campaigns.

Check out How to use Twitter data for your business

3. Transaction scraping


Through the scraping of transactions, you get a clear idea about the purchasing power of prospects. If you are looking for certain income groups or leads that invest in certain market sectors or during certain specific periods of time, transaction scraping is the best way to harvest meaningful information. This also helps you with competition analysis and provides you with pointers to fine-tune your marketing and sales strategies.

get-results-from-your-lead-generation-campaign


Using these varied lead generation techniques and finding the right balance and combination is key to securing the right leads for your business. Overall, signing up for web scraping services can be a make or break factor for your business going forward. With a steady supply of valuable leads, you can supercharge your sales, maximize returns and craft the perfect marketing maneuvers to take your business to an altogether new dimension.

Source: https://www.promptcloud.com/blog/how-to-generate-sales-leads-using-web-scraping-services/

Monday 30 March 2015

How does Web Scraping Identify the Data you Want

The Web is one of the biggest sources of data that should be leveraged for your business. Be it an email, an URL or even a hyperlink text you are looking at, it comprises data that could be translated into useful information for your business. The challenge however lies in identifying the data that is relevant for your needs and enabling access to the required data. Web Scraping tools, however, are geared to help you address this need and leverage the benefit of this huge information repository.

Web Scraping and how it Works?

 Web Scraping is the practice followed to extract data from relevant sources on the Web and transforming them into crucial information packages for use in your business. This is an automated process which is executed with the help of a host of intuitive Web Extraction tools, thus facilitating ease, accuracy and convenience in extracting vital data.

Scrapers also work by writing intelligent pieces of code that scour the web and extract data that you need for the benefit of your business. The languages used for coding these scrapers are Python, Ruby and PHP. The language you use will be determined by the community you have access to.

As mentioned earlier, the biggest challenge that web scraping is subjected to include the identification of the right URL, page and element in order to scrape out the required information. No matter how good you may be at coding scripts, no amount of that will help you achieve your objective if you fail to develop an understanding of the way the web is structured. It is this which will enable you to structure your code in a manner that will be the most effective in scraping the desired information.

Understanding a Web Site

 A Web Site appears on your browser owing to two technologies. These include:

  •     HTTP – The language used to communicate with the server for requesting the retrieval of resources, namely, images, videos, and documents and so on.
  •     HTML – The language that helps to display the retrieved information on the browser.

The display format of your website is therefore defined using the HTML. It is within the folds of its syntax, that you will find the data which you need to extract. It is, therefore, important that you understand the anatomy of a web site by studying the structure of an HTML Page.

The HTML Page Structure

 An HTML page comprises a stack of elements known as tags, each bearing a specific significance. The first among these being the header tags that comprises mostly all the elements within it. The table element, the most important so far as data containers are concerned, is a crucial element that you need to study. It comprises several table rows (TR) and table data (TD) elements that hold the vital data nuggets that you might need to train your scrapers to extract.

In addition to these, HTML pages comprise a series of other tags that act as vital data holders, namely, image tags (img src), hyperlinks (a href) and the div tags which essentially refer to a block of text.

The scraper code needs to be built around your understanding of the HTML elements. Knowing the elements will help you to understand the specific location where relevant data are stacked. This helps you to correctly define the code so as to enable the scraper to search and extract the right element in order to provide you with the most appropriate information.

We are leading Webdatascraping.us company and enough capable to extract website information, review scraping, contact information scraping, business directory scraping, email list scraping etc.

Friday 27 March 2015

Scraping expert's Amazon Scraper provides huge access to find your desired product on Amazon

Today, with latest advancement of technology we find plenty of ecommerce websites offering huge benefits to people by giving out various products from different categories at an affordable cost. One of the most renowned ecommerce website Amazon has come up with its all new launch of Amazon Scraper for the comfort of their customers. This product Amazon Scraper is also called web harvesting which is a computer software technique for getting out data from websites.

Today anyone can find such web scraping tools that are specifically designed for particular websites. Like for example, Amazon Scraper is also a web scraper tool or technique utilised to crawl, or scrap or even extract the data from the largest e commerce website called Amazon.com. Scrapingexpert.com offers best Amazon scraper for extracting plenty of products from websites easily.

Amazon scraper

Let us see how the Amazon Scraper works:


How to use: Step 1) Select the Category; Enter the (Keyword, UPC, and ASIN) Step 2) Set the delay in seconds Step 3) Click Start

Also you can Scrape the below given details from Amazon.com:

  •     Product Title & Description
  •     Category & Cost Manufacture,
  •     QTY Seller Name,
  •     Total Sellers Shipping Cost,
  •     Shipping / Product Weight ImageURL, IsBuyBoxFBA, Source Link
  •     Stars, Customer Reviews
  •     ASIN, UPC, Model Number Sales Rank,
  •     Sales Rank In Category

Here are some interesting Product Features:
  •     Single Screen Dashboard that shows total extracted records, extracted keywords, and elapse.
  •     Filter Search - Skip data that do not match phrases or keywords
  •     Compatible for Microsoft XP/Vista/Windows 7
  •     Option to set delay between requests to simulate a human surfing in a browser
  •     Extracted data is stored in CSV format, which you can easily open in excel
  • Benefits:
  •     Less Expensive - With our valuable services, we allow you to save both your efforts and money. We have some competitors who outsource their scraping projects to us.
  •     Guaranteed Accurate Results - We assure you get most reliable solutions with accurate results that cannot be collected by any ordinary human being or anyone else.
  •     Delivers Fast Results - We promise to get your work done in just few hours, which can take plenty of time if done by someone else. We save your time, workforce and money and give you an opportunity to stand at a distinction over your multiple competitors.
  •     System Requirement: Operating System - Windows XP, Windows Vista, Windows 7 Net Framework 2.0

Are you searching for some cost effective programs to extract data of other users? If your answer is yes, then we offer Amazon Screen Scraping which is the best Amazon Screen Scraping method of data extraction. Today, in this competitive world of advanced technology there are multiple companies who claim to offer best Amazon Screen Scraping services, so hiring their services for Amazon Screen Scraping can allow you to scrap almost any data in any format you wish to obtain. Well, we at Scrapingexpert.com study each and every single bit of little details of the scraping project and then provide you with a free quote and the date of completing the work

In order to get accurate data pertaining to a specific product, you can use our Awesome Amazon Scraper Tool. This Awesome Amazon Scraping Tool is very effective tool that will help you to extract information about any product from Amazon.

Websitedatascraping.com is enough capable to web data scraping, website data scraping, web scraping services, website scraping services, data scraping services, product information scraping and yellowpages data scraping.

Wednesday 25 March 2015

Data Mining Process - Why Outsource Data Mining Service?

Overview of Data Mining and Process:

Data mining is one of the unique techniques for investigating information to extract certain data patterns and decide to outcome of existing requirements. Data mining is widely use in client research, services analysis, market research and so on. It is totally based on mathematical algorithm and analytical skills to drive the desired results from the huge database collection.

Information mining is mostly used by financial analyzer, business and professional organization and also there are many growing area of business that are get maximum advantages of data extract with use of data warehouses in their small to large level of businesses.

Most of functionalities which are used in information collecting process define as under:

* Retrieving Data
* Analyzing Data
* Extracting Data
* Transforming Data
* Loading Data
* Managing Databases

Most of small, medium and large levels of businesses are collect huge amount of data or information for analysis and research to develop business. Such kind of large amount will help and makes it much important whenever information or data required.

Why Outsource Data Online Mining Service?

Outsourcing advantages of data mining services:


o Almost save 60% operating cost

o High quality analysis processes ensuring accuracy levels of almost 99.98%

o Guaranteed risk free outsourcing experience ensured by inflexible information security policies and practices

o Get your project done within a quick turnaround time

o You can measure highly skilled and expertise by taking benefits of Free Trial Program.

o Get the gathered information presented in a simple and easy to access format

Thus, data or information mining is very important part of the web research services and it is most useful process. By outsource data extraction and mining service; you can concentrate on your co relative business and growing fast as you desire.

Outsourcing web research is trusted and well known Internet Market research organization having years of experience in BPO (business process outsourcing) field.

If you want to more information about data mining services and related web research services, then contact us.

Outsourcing Web Research has best infrastructure includes 200+ workstations supported by advanced technologies for operational efficiency and optimum security of your data and information.

Source: http://ezinearticles.com/?Data-Mining-Process---Why-Outsource-Data-Mining-Service?&id=3789102