VII Python（7）爬虫

cundeng 发表于 2018-8-7 08:41:43

VII Python（7）爬虫
　　网络爬虫（网页蜘蛛）：
　　python访问互联网：
　　urllib和urllib2模块（python2.*分urllib和urllib2；python3..4.1中把urllib和urllib2合并统一为一个包package，注意版本3是包不是模块）；
　　json模块（json轻量级的数据交换格式，此处对其应用是用字符串形式将python的数据结构封装起来）；
　　URL的一般格式：
　　protocol://hostname[:port]/path/to/file
　　protocal有：http、https、ftp、file、ed2k
　　In : import urllib
　　In : dir(urllib)
　　……
　　'urlopen',
　　'urlretrieve']
　　In : help(urllib.urlopen)
　　urlopen(url, data=None, proxies=None)
　　Create a file-like object for the specified URL to read from.
　　In : help(urllib.urlretrieve)
　　urlretrieve(url, filename=None,reporthook=None, data=None)
　　In : help(urllib.urlencode)
　　urlencode(query, doseq=0)
　　Encode a sequence of two-element tuples or dictionary into a URL querystring.
　　In : import urllib2
　　In : help(urllib2.urlopen)
　　urlopen(url, data=None, timeout=<objectobject>)
　　In : help(urllib2.Request)
　　__init__(self, url, data=None, headers={},origin_req_host=None, unverifiable=False)
　　add_header(self, key, val)
　　In : help(urllib2.ProxyHandler)
　　__init__(self, proxies=None)
　　proxy_open(self, req, proxy, type)
　　In : import json
　　In : json.<TAB>
　　json.JSONDecoderjson.decoder    json.dumps    json.load       json.scanner
　　json.JSONEncoderjson.dump       json.encoder    json.loads
　　In : help(json.loads)
　　loads(s, encoding=None, cls=None,object_hook=None, parse_float=None, parse_int=None, parse_constant=None,object_pairs_hook=None, **kw)
　　Deserialize ``s`` (a ``str`` or ``unicode`` instance containing a JSON
　　document) to a Python object.
　　In : import time
　　In : time.<TAB>
　　time.accept2dyeartime.clock       time.gmtime    time.sleep       time.struct_time time.tzname
　　time.altzone    time.ctime       time.localtime time.strftime    time.time       time.tzset
　　time.asctime    time.daylight    time.mktime    time.strptime    time.timezone
　　In : help(time.sleep)
　　sleep(...)
　　sleep(seconds)
　　举例1：
　　In : response=urllib.urlopen('http://www.FishC.com')
　　In : html=response.read()
　　In : print html #（若此处打印的内容（即是网页中审查元素看到的代码）不规整，则要根据网站编码进行转码，html=html.decode('utf-8')）
　　<!DOCTYPE html PUBLIC "-//W3C//DTDXHTML 1.0 Strict//EN"
　　"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
　　
　　<htmlxmlns="http://www.w3.org/1999/xhtml">
　　<head>
　　<metahttp-equiv="content-type" content="text/html; " />
　　……
　　In : response.<TAB> #（对于打开的网页，可施加的方法或属性，geturl()得到访问的地址，info()返回的是文件对象（内容是请求的网页的代码），getcode()返回的是http的状态码）
　　response.close    response.fp       response.headers response.read    response.url
　　response.code    response.getcode response.info    response.readline
　　response.fileno response.geturl response.next    response.readlines
　　In : response.geturl()
　　Out: 'http://www.FishC.com'
　　In : response.info()
　　Out: <httplib.HTTPMessage instanceat 0x16a7b48>
　　In : print response.info
　　<bound method addinfourl.info of<addinfourl at 23755304 whose fp = <socket._fileobject object at0x15abbd0>>>
　　In :response.getcode()
　　Out: 200
　　举例2（保存网站placekitten.com中的图片）：
　　# vim download_cat.py
　　-----------------------script start-----------------------
　　#!/usr/bin/python2.7
　　#filename:download_cat.py
　　import urllib
　　response=urllib.urlopen('http://placekitten.com/g/500/600')
　　cat_img=response.read()
　　with open('cat_500_600.jpg','wb') as f:
　　f.write(cat_img)
　　----------------------script end--------------------------
　　# chmod 755download_cat.py
　　# python2.7 download_cat.py
　　# ll cat_500_600.jpg
　　-rw-r--r--. 1 root root 26590 Jun 19 22:10 cat_500_600.jpg
　　举例3（模拟在线浏览器翻译）：
　　网页中右键审查元素-->Network-->找到如下信息，在Headers中的内容是我们需要的

　　Headers中，General段中的RequestURL（用此处的地址才可翻译），Request Headers段中的User-Agent（服务器用来判断是否非人类访问，不过此处信息可自定义），From Data（POST提交的主要内容）
　　注：GET（从server请求获得数据）；POST（向指定server提交被处理的数据）
　　# vim translation.py
　　---------------------------script start------------------------
　　#!/usr/bin/python2.7
　　#filename:translation.py
　　import urllib
　　import json
　　content=raw_input('please input translatecontent: ')
　　url='http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=dict2.index'
　　data={}
　　data['type']='AUTO'
　　data['i']=content
　　data['doctype']='json'
　　data['xmlVersion']='1.8'
　　data['keyfrom']='fanyi.web'
　　data['ue']='UTF-8'
　　data['action']='FY_BY_CLICKBUTTON'
　　data['typoResult']='true'
　　data=urllib.urlencode(data)
　　response=urllib.urlopen(url,data)
　　html=response.read()
　　target=json.loads(html)
　　print 'Translate the result: %s' %(target['translateResult']['tgt'])
　　-----------------------------script end---------------------------
　　# python2.7 translation.py
　　please input translate content: girl
　　Translate the result: 女孩
　　注：
　　此脚本优化：
　　可将代码放在while循环中，当输入quit或q时退出；
　　此脚本不能运行在生产环境中，因为server会根据User-Agent判断是人工访问还是机器代码访问，若机器代码访问多了会被server屏蔽，解决方法：隐藏修改User-Agent，（1）先事先定义好head={'User-Agend':'……'}再传递给urllib2.Request(url,data,head)；（2）在请求urllib2.Request(url,data)之后通过urllib2.Request.add_header()添加；
　　修改User-Agent方法虽可行，但server还会根据IP访问的次数，在超过预值（阈值）会认为是网络爬虫，server会要求其填验证码之类的，若是用户可识别验证码，但以上脚本仍无法应付会被屏蔽，解决方法：（1）通过time模块延迟提交时间time.sleep(3)，让脚本代码（爬虫）看上去是人类在正常访问；（2）使用代理IP（推荐使用此方法）
　　注：
　　使用代理IP三步骤：
　　1）proxy_support=urllib2.ProxyHandler({'http':'112.111.53.173:8888'})，注意此方法扩号中要是一个字典，格式：urllib2.ProxyHandler('类型':'代理ip:port'）；
　　2）定制、创建一个opener（可理解为私人定制），opener=urllib2.build_opener(proxy_support)；
　　3）安装opener，urllib2.install_opener(opener)，opener.open(url)；
　　举例4（优化例3，修改User-Agent，使用方法1）：
　　# vim translation.py
　　----------------------script start--------------------
　　#!/usr/bin/python2.7
　　#filename:translation.py
　　import urllib
　　import urllib2
　　import json
　　while True:
　　content=raw_input('please input translate content: ')
　　if content=='q':
　　break
　　url='http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=dict2.index'
　　head={}
　　head['User-Agend']='Mozilla/5.0(Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/44.0.2403.155 Safari/537.36'
　　data={}
　　data['type']='AUTO'
　　data['i']=content
　　data['doctype']='json'
　　data['xmlVersion']='1.8'
　　data['keyfrom']='fanyi.web'
　　data['ue']='UTF-8'
　　data['action']='FY_BY_CLICKBUTTON'
　　data['typoResult']='true'
　　data=urllib.urlencode(data)
　　req=urllib2.Request(url,data,head)
　　response=urllib2.urlopen(req)
　　html=response.read()
　　target=json.loads(html)
　　print 'Translate the result: %s' %(target['translateResult']['tgt'])
　　------------------------------script end----------------------
　　# python2.7 translation.py
　　please input translate content: ladies
　　Translate the result: 女士们
　　please input translate content: gentleman
　　Translate the result: 绅士
　　please input translate content: q
　　举例5（优化例3，修改User-Agent，使用方法2）：
　　# vim translation.py
　　------------------------script start---------------------
　　#!/usr/bin/python2.7
　　#filename:translation.py
　　import urllib
　　import urllib2
　　import json
　　while True:
　　content=raw_input('please input translate content: ')
　　if content=='q':
　　break
　　url='http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=dict2.index'
　　#head={}
　　#head['User-Agend']='Mozilla/5.0 (Windows NT 6.1; WOW64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36'
　　data={}
　　data['type']='AUTO'
　　data['i']=content
　　data['doctype']='json'
　　data['xmlVersion']='1.8'
　　data['keyfrom']='fanyi.web'
　　data['ue']='UTF-8'
　　data['action']='FY_BY_CLICKBUTTON'
　　data['typoResult']='true'
　　data=urllib.urlencode(data)
　　req=urllib2.Request(url,data)
　　req.add_header('User-Agent','Mozilla/5.0(Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/44.0.2403.155 Safari/537.36')
　　response=urllib2.urlopen(req)
　　html=response.read()
　　target=json.loads(html)
　　print 'Translate the result: %s' %(target['translateResult']['tgt'])
　　----------------------------script end---------------------------
　　# python2.7 translation.py
　　please input translate content: cat
　　Translate the result: 猫
　　please input translate content: dog
　　Translate the result: 狗
　　please input translate content: q
　　举例6（优化例3，使用代码频繁访问翻译server防止将我们的IP屏蔽，方法一延迟提交时间，这样在每翻译一个条目后间隔3s才允许翻译下个条目）：
　　# vim translation.py
　　----------------script start----------------
　　#filename:translation.py
　　import urllib
　　import urllib2
　　import json
　　import time
　　while True:
　　……
　　time.sleep(3)
　　-----------------script end----------------
　　# python2.7 translation.py
　　please input translate content: chinese
　　Translate the result: 中国
　　please input translate content: japanese
　　Translate the result: 日本
　　please input translate content: q#!/usr/bin/python2.7
　　举例7（使用代理访问网页）：
　　准备（通过http://www.whatismyip.com.tw/得到当前正在使用的IP，通过http://www.xicidaili.com/得到代理IP）
　　# vim proxy_egg.py
　　---------------------script start--------------------
　　#!/usr/bin/python2.7
　　#filename:proxy_egg.py
　　import urllib2
　　import random
　　url='http://www.whatismyip.com.tw'
　　ip_list=['110.6.35.181:8888','122.193.55.64:81']
　　proxy_support=urllib2.ProxyHandler({'http':random.choice(ip_list)})
　　opener=urllib2.build_opener(proxy_support)
　　#opener.addheaders=[('User-Agend','Mozilla/5.0(Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/44.0.2403.155 Safari/537.36')]
　　urllib2.install_opener(opener)
　　response=urllib2.urlopen(url)
　　html=response.read()
　　print html
　　-------------------------scirpt end------------------------
　　# python2.7 proxy_egg.py
　　<html>
　　<head>
　　<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
　　<meta name="description" content="我的IP查詢"/>
　　<meta name="keywords" content="查ip,ip查詢,查我的ip,我的ip位址,我的ip位置,偵測我的ip,查詢我的ip,查看我的ip,顯示我的ip,whatis my IP,whatismyip,my IP address,my IP proxy"/>
　　<title>我的IP位址查詢</title>
　　</head>
　　<body>
　　<h1>IP位址</h1> <h2>122.193.55.64</h2>
　　<scripttype="text/javascript">
　　var sc_project=6392240;
　　var sc_invisible=1;
　　var sc_security="65d86b9d";
　　var scJsHost = (("https:" ==document.location.protocol) ? "https://secure." :"http://www.");
　　document.write("<sc"+"ripttype='text/javascript' src='" + scJsHost +"statcounter.com/counter/counter.js'></"+"script>");
　　</script>

　　<noscript><divclass="statcounter"><a>　　</body>
　　</html>
　　举例8（优化例3，使用脚本代码频繁访问翻译server，防止server将我们的IP屏蔽，方法二使用代理IP）：
　　注：使用免费代理IP极不稳定，应尽可能在ip_list中多加一些代理IP
　　# vim translation.py
　　-----------------------script start-------------------
　　#!/usr/bin/python2.7
　　#filename:translation.py
　　import urllib
　　import urllib2
　　import json
　　import random
　　while True:
　　content=raw_input('please input translate content: ')
　　if content=='q':
　　break
　　url='http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=dict2.index'
　　ip_list=['123.185.109.86:8888','124.235.47.141:8888']
　　proxy_support=urllib2.ProxyHandler({'http':random.choice(ip_list)})
　　opener=urllib2.build_opener(proxy_support)
　　opener.addheaders=[('User-Agend','Mozilla/5.0 (Windows NT 6.1; WOW64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36')]
　　urllib2.install_opener(opener)
　　data={}
　　data['type']='AUTO'
　　data['i']=content
　　data['doctype']='json'
　　data['xmlVersion']='1.8'
　　data['keyfrom']='fanyi.web'
　　data['ue']='UTF-8'
　　data['action']='FY_BY_CLICKBUTTON'
　　data['typoResult']='true'
　　data=urllib.urlencode(data)
　　req=urllib2.Request(url,data)
　　response=urllib2.urlopen(req)
　　html=response.read()
　　target=json.loads(html)
　　print 'Translate the result: %s' %(target['translateResult']['tgt'])
　　----------------scipt end----------------
　　# python2.7 translation.py
　　please input translate content: boy
　　Translate the result: 男孩
　　please input translate content: girl
　　Translate the result: 女孩
　　please input translate content: man
　　Traceback (most recent call last):
　　File "translation.py", line 32, in <module>
　　response=urllib2.urlopen(req)
　　File "/usr/local/python2.7/lib/python2.7/urllib2.py", line127, in urlopen
　　return _opener.open(url, data, timeout)
　　File "/usr/local/python2.7/lib/python2.7/urllib2.py", line404, in open
　　response = self._open(req, data)
　　File "/usr/local/python2.7/lib/python2.7/urllib2.py", line422, in _open
　　'_open', req)
　　File "/usr/local/python2.7/lib/python2.7/urllib2.py", line382, in _call_chain
　　result = func(*args)
　　File "/usr/local/python2.7/lib/python2.7/urllib2.py", line1214, in http_open
　　return self.do_open(httplib.HTTPConnection, req)
　　File "/usr/local/python2.7/lib/python2.7/urllib2.py", line1184, in do_open
　　raise URLError(err)
　　urllib2.URLError:<urlopen error Connection refused>
　　举例（下载指定网页中的图片，默认下载至当前目录，使用urllib.urlretrieve()将文件保存至本地）：
　　此脚本缺陷：仅下载指定页面的图片，不能更新到该网站最新的图片进行下载
　　# vim download_pic.py
　　------------------script start-------------------
　　#!/usr/bin/python2.7
　　#filename:download_pic.py
　　import urllib
　　import urllib2
　　import re
　　url='http://jandan.net/ooxx'
　　def getHtml(url):
　　req=urllib2.Request(url)
　　req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36')
　　response=urllib2.urlopen(req)
　　html=response.read()
　　return html
　　def getImg(html):
　　imglist=re.findall(r'src="(.*?\.jpg)"',html)
　　#print imglist
　　x=1
　　for imgurl in imglist:
　　urllib.urlretrieve(imgurl,'%s.jpg' % x)
　　x+=1
　　html=getHtml(url)
　　#print html
　　getImg(html)
　　--------------------script end------------------
　　# python2.7 download_pic.py
　　# ll
　　total 31664
　　-rw-r--r--. 1 root root 174584 Jun 21 23:18 10.jpg
　　-rw-r--r--. 1 root root 153359 Jun 21 23:18 11.jpg
　　-rw-r--r--. 1 root root 125877 Jun 21 23:18 12.jpg
　　-rw-r--r--. 1 root root 152194 Jun 21 23:18 13.jpg
　　-rw-r--r--. 1 root root    91847 Jun 21 23:18 14.jpg
　　-rw-r--r--. 1 root root    78389 Jun 21 23:18 15.jpg
　　-rw-r--r--. 1 root root    68577 Jun 21 23:18 16.jpg
　　-rw-r--r--. 1 root root    99573 Jun 21 23:18 17.jpg
　　-rw-r--r--. 1 root root    32444 Jun 21 23:18 18.jpg
　　-rw-r--r--. 1 root root    79730 Jun 21 23:18 19.jpg
　　-rw-r--r--. 1 root root 144334 Jun 21 23:18 1.jpg
　　……

页: [1]

运维网's Archiver

VII Python（7）爬虫