【Python爬虫实例学习篇】——3、访问CSDN并编译成exe文件

==首先声明:本博文仅供技术参考,想要提高访问量,好的博文才是王道!!!==

用到的工具:

  1. Python 3.6
  2. requests库
  3. request库(为什么会有两个???因为我只是单纯想练练手哈哈)
  4. random库
  5. time库
  6. pyinstaller库(用于生成exe文件)

目录:

  1. 代理库的构建(GetFreeProxies.py)
  2. 获取所有文章的链接
  3. 访问文章
  4. CSDNVisit.py
  5. 编译
  6. 结果展示

代理库的构建

详情请看【Python爬虫实例学习篇】——2、获取免费IP代理

获取所有文章的链接

思路:访问主页,然后用xpath语法获取所有文章的链接

1
2
3
4
5
6
def GetUrlList(homeurl):
# 获取所有文章的链接
res = request.urlopen(url=homeurl)
html = etree.HTML(res.read().decode('utf-8'))
UrlList = html.xpath('//h4//a/@href')
return UrlList

访问文章

由于CSDN在记录阅读数时会检查IP地址,而又由于免费代理不稳定,可能存在超时等问题,故需要使用try和except来避免错误中断。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
def GetUrlList(homeurl):
# 获取所有文章的链接
res = request.urlopen(url=homeurl)
html = etree.HTML(res.read().decode('utf-8'))
UrlList = html.xpath('//h4//a/@href')
return UrlList


def Visit(homeurl, method, UrlList, num, IP_list):
# 进行访问
user_agent_list = [
{'Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0)'},
{'Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.0;Trident/4.0)'},
{'Mozilla/4.0(compatible;MSIE7.0;WindowsNT6.0)'},
{'Opera/9.80(WindowsNT6.1;U;en)Presto/2.8.131Version/11.11'},
{'Mozilla/5.0(WindowsNT6.1;rv:2.0.1)Gecko/20100101Firefox/4.0.1'}
]
success = 0
fail = 0
i=0
# 第一层while循环用于保证达到目标
while num>success:
i=i+1
headers = {
'User-Agent': str(choice(user_agent_list)),
'Referer': homeurl
}
# 选择代理ip
if len(IP_list):
ip = IP_list[0].strip('\n')
IP_list.remove(IP_list[0])
else:
if method == 1:
IP_list = GetFreeProxies.GetFreeProxy()
elif method == 2:
IP_list = GetFreeProxies.GetFreeProxyList()
elif method == 3:
IP_list = GetFreeProxies.GetFreeProxyAPI()
elif method == 4:
IP_list = GetFreeProxies.GetFreeProxyListAPI()
else:
print("暂未设置更多方法!")
return 0
# 第二层for循环用于遍历所有文章
print('-' * 45)
print("| 序号:" + str(i)+' |')
for j in range(len(UrlList)):
print('当前正在浏览第' + str(j) + '篇文章,链接:' + UrlList[j])
url = str(UrlList[j])
proxy_dict = {'https': ip}
handler = request.ProxyHandler(proxies=proxy_dict)
opener = request.build_opener(handler)
req = request.Request(url=url, headers=headers)
# try用来避免代理不稳定所导致的报错
try:
res = opener.open(req, timeout=10).read().decode('utf-8')
html = etree.HTML(res)
VisitCount = html.xpath("//div/span[@class='read-count']")
sleepnum = random() * 4 + 1
print(str(VisitCount[0].text) + " || " + "随机延迟:" + "%2.f" % sleepnum + "|| 代理IP为:" + ip)
success=success+1
sleep(sleepnum)
except Exception:
sleepnum = random() * 2
print('代理IP:' + ip + '连接失败! ' + '||' + ' 随机延迟:' + "%2.f" % sleepnum)
fail=fail+1
sleep(sleepnum)
break
print('成功数:%i' % success + ' 失败数:%i' % fail + ' || 连接数:%i' % (success + fail)+' || 剩余任务数:%i' %(num-success))

CSDNVisit.py

说明:
博客主页:如 https://blog.csdn.net/qq_40528553
阅读次数:如 5000
代理方法: 如 2

  • 方法1,代表直接访问网页获取1个代理ip
  • 方法2,代表直接访问网页获取15个代理ip,推荐使用,成功率超50%
  • 方法3,代表使用API一次获取1个代理ip
  • 方法4,代表使用API一次获取15个代理ip
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
from urllib import request
from random import choice, random
from time import sleep
from lxml import etree
import GetFreeProxies


def VisitCsdn(homeurl, num=1, method=4):
# method=1,从网页上获取一个免费代理
# method=2,从网页上获取一列免费代理
# method=3,从API上获取一个免费代理
# method=4,从API上获取一列免费代理,推荐使用API获取代理
if method == 1:
IP_list = GetFreeProxies.GetFreeProxy()
elif method == 2:
IP_list = GetFreeProxies.GetFreeProxyList()
elif method == 3:
IP_list = GetFreeProxies.GetFreeProxyAPI()
elif method == 4:
IP_list = GetFreeProxies.GetFreeProxyListAPI()
else:
print("暂未设置更多方法!")
return 0
UrlList = GetUrlList(homeurl)
Visit(homeurl,method, UrlList, num, IP_list)


def GetUrlList(homeurl):
# 获取所有文章的链接
res = request.urlopen(url=homeurl)
html = etree.HTML(res.read().decode('utf-8'))
UrlList = html.xpath('//h4//a/@href')
return UrlList


def Visit(homeurl, method, UrlList, num, IP_list):
# 进行访问
user_agent_list = [
{'Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0)'},
{'Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.0;Trident/4.0)'},
{'Mozilla/4.0(compatible;MSIE7.0;WindowsNT6.0)'},
{'Opera/9.80(WindowsNT6.1;U;en)Presto/2.8.131Version/11.11'},
{'Mozilla/5.0(WindowsNT6.1;rv:2.0.1)Gecko/20100101Firefox/4.0.1'}
]
success = 0
fail = 0
i=0
# 第一层while循环用于保证达到目标
while num>success:
i=i+1
headers = {
'User-Agent': str(choice(user_agent_list)),
'Referer': homeurl
}
# 选择代理ip
if len(IP_list):
ip = IP_list[0].strip('\n')
IP_list.remove(IP_list[0])
else:
if method == 1:
IP_list = GetFreeProxies.GetFreeProxy()
elif method == 2:
IP_list = GetFreeProxies.GetFreeProxyList()
elif method == 3:
IP_list = GetFreeProxies.GetFreeProxyAPI()
elif method == 4:
IP_list = GetFreeProxies.GetFreeProxyListAPI()
else:
print("暂未设置更多方法!")
return 0
# 第二层for循环用于遍历所有文章
print('-' * 45)
print("| 序号:" + str(i)+' |')
for j in range(len(UrlList)):
print('当前正在浏览第' + str(j) + '篇文章,链接:' + UrlList[j])
url = str(UrlList[j])
proxy_dict = {'https': ip}
handler = request.ProxyHandler(proxies=proxy_dict)
opener = request.build_opener(handler)
req = request.Request(url=url, headers=headers)
# try用来避免代理不稳定所导致的报错
try:
res = opener.open(req, timeout=10).read().decode('utf-8')
html = etree.HTML(res)
VisitCount = html.xpath("//div/span[@class='read-count']")
sleepnum = random() * 4 + 1
print(str(VisitCount[0].text) + " || " + "随机延迟:" + "%2.f" % sleepnum + "|| 代理IP为:" + ip)
success=success+1
sleep(sleepnum)
except Exception:
sleepnum = random() * 2
print('代理IP:' + ip + '连接失败! ' + '||' + ' 随机延迟:' + "%2.f" % sleepnum)
fail=fail+1
sleep(sleepnum)
break
print('成功数:%i' % success + ' 失败数:%i' % fail + ' || 连接数:%i' % (success + fail)+' || 剩余任务数:%i' %(num-success))


if __name__ == '__main__':
print('*'*45)
print('* CSDN刷阅读量工具正式版 V1.0\t *')
print('* -by asyu17\t *')
print('* 2020/1/14\t *')
print('*'*45)

homeurl = input('请输入博客主页:')
num = input('请输入阅读次数:')
method = input('请选择代理方法:')

VisitCsdn(homeurl, int(num), method=int(method))

编译

  • pip install pyinstaller 安装pyinstaller库
  • 在cmd中定位到python脚本所在目录 如 cd D:\code\python\2020_spider_1
  • 随后执行 pyinstaller -F -i asyu17.ico CSDNVisit.py 即可
  • 可执行文件在dist目录中
    在这里插入图片描述

6、结果展示:

结果展示

==微信公众号:==

小术快跑