Since you seem like the try-first ask-question later type (that's a
very good thing), I won't give you an answer, but a (very detailed)
guide on how to find the answer.
The thing is, unless you are a
yahoo developer, you probably don't have access to the source code
you're trying to scrape. That is to say, you don't know exactly how the
site is built and how your requests to it as a user are being processed
on the server-side. You can, however, investigate the client-side and
try to emulate it. I like using Chrome Developer Tools for this, but you
can use others such as FF firebug.
So first off we need to
figure out what's going on. So the way it works, is you click on the
'show comments' it loads the first ten, then you need to keep clicking
for the next ten comments each time. Notice, however, that all this
clicking isn't taking you to a different link, but lively fetches the
comments, which is a very neat UI but for our case requires a bit more
work. I can tell two things right away:
They're using javascript to load the comments (because I'm staying on the same page).
They load them dynamically with AJAX calls each time you click (meaning
instead of loading the comments with the page and just showing them to
you, with each click it does another request to the database).
Now let's right-click and inspect element on that button. It's actually just a simple span with text:
<span>View Comments (2077)</span>
By
looking at that we still don't know how that's generated or what it
does when clicked. Fine. Now, keeping the devtools window open, let's
click on it. This opened up the first ten. But in fact, a request was
being made for us to fetch them. A request that chrome devtools
recorded. We look in the network tab of the devtools and see a lot of
confusing data. Wait, here's one that makes sense:
http://news.yahoo.com/_xhr/contentcomments/get_comments/?content_id=42f7f6e0-7bae-33d3-aa1d-3dfc7fb5cdfc&_device=full&count=10&sortBy=highestRated&isNext=true&offset=20&pageNumber=2&_media.modules.content_comments.switches._enable_view_others=1&_media.modules.content_comments.switches._enable_mutecommenter=1&enable_collapsed_comment=1
See?
_xhr and then get_comments. That makes a lot of sense. Going to that
link in the browser gave me a JSON object (looks like a python
dictionary) containing all the ten comments which that request fetched.
Now that's the request you need to emulate, because that's the one that
gives you what you want. First let's translate this to some normal
reqest that a human can read:
go to this url: http://news.yahoo.com/_xhr/contentcomments/get_comments/
include these parameters: {'_device': 'full',
'_media.modules.content_comments.switches._enable_mutecommenter': '1',
'_media.modules.content_comments.switches._enable_view_others': '1',
'content_id': '42f7f6e0-7bae-33d3-aa1d-3dfc7fb5cdfc',
'count': '10',
'enable_collapsed_comment': '1',
'isNext': 'true',
'offset': '20',
'pageNumber': '2',
'sortBy': 'highestRated'}
Now it's just a matter of trial-and-error. However, a few things to note here:
Obviously the count is what decides how many comments you're getting. I
tried changing it to 100 to see what happens and got a bad request. And
it was nice enough to tell me why - "Offset should be multiple of total
rows". So now we understand how to use offset
The content_id
is probably something that identifies the article you are reading.
Meaning you need to fetch that from the original page somehow. Try
digging around a little, you'll find it.
Also, you obviously
don't want to fetch 10 comments at a time, so it's probably a good idea
to find a way to fetch the number of total comments somehow (either find
out how the page gets it, or just fetch it from within the article
itself)
Using the devtools you have access to all client-side
scripts. So by digging you can find that that link to /get_comments/ is
kept within a javascript object named YUI. You can then try to
understand how it is making the request, and try to emulate that (though
you can probably figure it out yourself)
You might need to
overcome some security measures. For example, you might need a
session-key from the original article before you can access the
comments. This is used to prevent direct access to some parts of the
sites. I won't trouble you with the details, because it doesn't seem
like a problem in this case, but you do need to be aware of it in case
it shows up.
Finally, you'll have to parse the JSON object
(python has excellent built-in tools for that) and then parse the html
comments you are getting (for which you might want to check out
BeautifulSoup).
As you can see, this will require some work, but despite all I've written, it's not an extremely complicated task either.
So don't panic.
It's
just a matter of digging and digging until you find gold (also, having
some basic WEB knowledge doesn't hurt). Then, if you face a roadblock
and really can't go any further, come back here to SO, and ask again.
Someone will help you.
Source: http://stackoverflow.com/questions/20218855/web-data-scraping-online-news-comments-with-scrapy-python