nginx反爬虫配置详解

nidr 发表于 2018-11-13 12:10:40

　　网络上的爬虫非常多，有对网站收录有益的，比如百度蜘蛛（Baiduspider），也有不但不遵守robots规则对服务器造成压力，还不能为网站带来流量的无用爬虫，比如宜搜蜘蛛（YisouSpider）。
　　下面介绍怎么禁止这些无用的user agent访问网站。
　　核心参数:http_user_agent,就是根据这个参数，将一些无用的爬虫禁止掉。
　　进入到nginx安装目录下的conf/vhost目录，将如下代码保存为 deny_agented.conf
　　注意：我的nginx.conf 项目都放在vhsot目录里面
　　vim deny_agented.conf
　　#禁止Scrapy|curl等工具的抓取
　　if ($http_user_agent ~* (Scrapy|Curl|HttpClient))
　　{
　　return 403;
　　}
　　#禁止指定UA及UA为空的访问
　　if ($http_user_agent ~ "FeedDemon|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|YisouSpider|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|^$" )
　　{
　　return 403;
　　}
　　#禁止非GET|HEAD|POST方式的抓取
　　if ($request_method !~ ^(GET|HEAD|POST)$)
　　{
　　return 403;
　　}
　　然后，在网站相关配置中的location / {之后插入如下代码：
　　include deny_agented.conf;
　　保存后，执行如下命令，平滑重启nginx即可：
　　# nginx -t
　　nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
　　nginx: configuration file /etc/nginx/nginx.conf test is successful

　　# nginx -s>　　重启之后开始模拟访问
　　A:没有加上 #禁止Scrapy|curl等工具的抓取之前
http://blog.chinaunix.net/attachment/201702/16/30234663_1487229841PfR5.png
加上禁止Scrapy|curl等工具的抓取之后http://blog.chinaunix.net/attachment/201702/16/30234663_14872298691kwq.png
　　B：加上 #禁止指定UA及UA为空的访问
http://blog.chinaunix.net/attachment/201702/16/30234663_1487229967UuaY.png
　　下面是网络上常见的垃圾UA列表
　　FeedDemon 内容采集
　　BOT/0.1 (BOT for JCE) sql注入
　　CrawlDaddy sql注入
　　Java 内容采集
　　Jullo 内容采集
　　Feedly 内容采集
　　UniversalFeedParser 内容采集
　　ApacheBench cc***器
　　Swiftbot 无用爬虫
　　YandexBot 无用爬虫
　　AhrefsBot 无用爬虫
　　YisouSpider 无用爬虫
　　jikeSpider 无用爬虫
　　MJ12bot 无用爬虫
　　ZmEu phpmyadmin 漏洞扫描
　　WinHttp 采集cc***
　　EasouSpider 无用爬虫
　　HttpClient tcp***
　　Microsoft URL Control 扫描
　　YYSpider 无用爬虫
　　jaunty wordpress爆破扫描器
　　oBot 无用爬虫
　　Python-urllib 内容采集
　　Indy Library 扫描
　　FlightDeckReports Bot 无用爬虫
　　Linguee Bot 无用爬虫

页: [1]

运维网's Archiver

nginx反爬虫配置详解