如何用Python爬虫抓取网页内容?

比如新浪,QQ等

2024年12月05日 00:25

有2个网友回答

网友（1）：

爬虫流程
其实把网络爬虫抽象开来看，它无外乎包含如下几个步骤
模拟请求网页。模拟浏览器，打开目标网站。
获取数据。打开网站之后，就可以自动化的获取我们所需要的网站数据。
保存数据。拿到数据之后，需要持久化到本地文件或者数据库等存储设备中。
那么我们该如何使用 Python 来编写自己的爬虫程序呢，在这里我要重点介绍一个 Python 库：Requests。
Requests 使用
Requests 库是 Python 中发起 HTTP 请求的库，使用非常方便简单。
模拟发送 HTTP 请求
发送 GET 请求
当我们用浏览器打开豆瓣首页时，其实发送的最原始的请求就是 GET 请求
import requests
res = requests.get('http://www.douban.com')
print(res)
print(type(res))
>>>

网友（2）：

首先,你要安装requests和BeautifulSoup4,然后执行如下代码.

import requests
from bs4 import BeautifulSoup

iurl = 'http://news.sina.com.cn/c/nd/2017-08-03/doc-ifyitapp0128744.shtml'

res = requests.get(iurl)

res.encoding = 'utf-8'

#print(len(res.text))

soup = BeautifulSoup(res.text,'html.parser')

#标题
H1 = soup.select('#artibodyTitle')[0].text

#来源
time_source = soup.select('.time-source')[0].text


#来源
origin = soup.select('#artibody p')[0].text.strip()

#原标题
oriTitle = soup.select('#artibody p')[1].text.strip()

#内容
raw_content = soup.select('#artibody p')[2:19]
content = []
for paragraph in raw_content:
    content.append(paragraph.text.strip())
'@'.join(content)    
#责任编辑
ae = soup.select('.article-editor')[0].text

这样就可以了