nginx log analysis-WorkNote

sunfull 发表于 2018-11-9 11:57:04

　　1、日志简介
　　nginx日志主要有两种：访问日志和错误日志。访问日志主要记录客户端访问nginx的每一个请求，格式可以自定义；错误日志主要记录客户端访问nginx出错时的日志，格式不支持自定义。两种日志都可以选择性关闭。
　　通过访问日志，你可以得到用户地域来源、跳转来源、使用终端、某个URL访问量等相关信息；通过错误日志，你可以得到系统某个服务或server的性能瓶颈等。因此，将日志好好利用，你可以得到很多有价值的信息。
　　2、访问日志
　　
　　log_formatmain'$remote_addr $remote_user [$time_local] "$request" $http_host '
　　'$status $upstream_status $body_bytes_sent "$http_referer" '
　　'"$http_user_agent" $ssl_protocol $ssl_cipher $upstream_addr '
　　'$request_time $upstream_response_time';
　　变量名称变量描述举例说明
　　$remote_addr 客户端地址例：10.140.15.91
　　$remote_user 客户端用户名称 -
　　$time_local 访问时间和时区例：18/Jul/2016:17:00:01 +0800
　　$request 请求的URI和HTTP协议" 例：GET /pa/img/home/logo-alipay-t.png HTTP/1.1"
　　$http_host 请求地址，即浏览器中你输入的地址(IP或域名) 例：img.xxxxx.com 10.253.70.113
　　$status HTTP请求状态例：200
　　$upstream_status upstream状态例：200
　　$body_bytes_sent 发送给客户端文件内容大小例：547
　　$http_referer 跳转来源例："https://cashier.xxxxx.com.../"
　　$http_user_agent 用户终端代理例："Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; SV1; GTB7.0; .NET4.0C;
　　$ssl_protocol SSL协议版本例：TLSv1
　　$ssl_cipher 交换数据中的算法例：RC4-SHA
　　$upstream_addr 后台upstream的地址，即真正提供服务的主机地址例：10.238.35.247:80
　　$request_time 整个请求的总时间例：0.205
　　$upstream_response_time 请求过程中，upstream响应时间例：0.002
　　实例：
　　10.9.137.91 - "GET /images/XX/20160324752729.png HTTP/1.1"img.xxxxx.com 200 200 2038 https://cashier.xxxxx.com/XX/PaymentResult.htm?payNo=XX&outBizNo=2012XX "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; Tablet PC 2.0; 360SE)" TLSv1 AES128-SHA 10.228.21.237:80 0.198 0.001
　　线下测试（$http_referer）：
　　10.14.21.107 - - "GET /spanner/watch/v1?--db=ztg-1&--mode=compare&--index=status&--option=&--cluster=whole&-F=2016/8/12-00:00:00&-T=+2880&-i=1&-n=0&_=1344936501292 HTTP/1.1" 200 94193 "http://spanner.xxxxx.net/optionFrame/history.html" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.60 Safari/537.1"
　　备注：$http_referer和重定向有关。
　　3、错误日志
　　错误信息                                        错误说明
　　"upstream prematurely closed connection" 请求uri的时候出现的异常，是由于upstream还未返回应答给用户时用户断掉连接造成的，对系统没有影响，可以忽略
　　"recv() failed (104: Connection reset by peer)"
　　（1）服务器的并发连接数超过了其承载量，服务器会将其中一些连接Down掉；
　　（2）客户关掉了浏览器，而服务器还在给客户端发送数据；
　　（3）浏览器端按了Stop
　　"(111: Connection refused) while connecting to upstream" 用户在连接时，若遇到后端upstream挂掉或者不通，会收到该错误
　　"(111: Connection refused) while reading response header from upstream" 用户在连接成功后读取数据时，若遇到后端upstream挂掉或者不通，会收到该错误
　　"(111: Connection refused) while sending request to upstream" Nginx和upstream连接成功后发送数据时，若遇到后端upstream挂掉或者不通，会收到该错误
　　"(110: Connection timed out) while connecting to upstream" nginx连接后面的upstream时超时
　　"(110: Connection timed out) while reading upstream" nginx读取来自upstream的响应时超时
　　"(110: Connection timed out) while reading response header from upstream" nginx读取来自upstream的响应头时超时
　　"(110: Connection timed out) while reading upstream" nginx读取来自upstream的响应时超时
　　"(104: Connection reset by peer) while connecting to upstream" upstream发送了RST，将连接重置
　　"upstream sent invalid header while reading response header from upstream" upstream发送的响应头无效
　　"upstream sent no valid HTTP/1.0 header while reading response header from upstream" upstream发送的响应头无效
　　"client intended to send too large body" 用于设置允许接受的客户端请求内容的最大值，默认值是1M，client发送的body超过了设置值
　　"reopening logs" 用户发送kill-USR1命令
　　"gracefully shutting down", 用户发送kill-WINCH命令
　　"no servers are inside upstream" upstream下未配置server
　　"no live upstreams while connecting to upstream" upstream下的server全都挂了
　　"SSL_do_handshake() failed" SSL握手失败
　　"SSL_write() failed (SSL:) while sending to client"
　　"(13: Permission denied) while reading upstream"
　　"(98: Address already in use) while connecting to upstream"
　　"(99: Cannot assign requested address) while connecting to upstream"
　　"ngx_slab_alloc() failed: no memory in SSL session shared cache"             ssl_session_cache大小不够等原因造成
　　"could not add new SSL session to the session cache while SSL handshaking"    ssl_session_cache大小不够等原因造成
　　"send() failed (111: Connection refused)"
　　例： connect() failed (111: Connection refused) while connecting to upstream. 后端的upstream中指定的是fast-cgi的php服务器，所以原因有可能是php-fpm.conf的进程数量过少导致。解决方法：php-fpm.conf配置文件pm.max_children修改大一点,重启php-fpm
　　除了上述方法，还需要注意os层面的资源限制，总体我的如下：
　　#########
　　ulimit -a

　　core file>
　　data seg>　　scheduling priority             (-e) 0

　　file>　　pending signals             (-i) 32063
　　max locked memory    (kbytes, -l) 64

　　max memory>　　open files                   (-n) 65536

　　pipe>　　POSIX message queues (bytes, -q) 819200
　　real-time priority          (-r) 0

　　stack>　　cpu time             (seconds, -t) unlimited
　　max user processes          (-u) 32063
　　virtual memory       (kbytes, -v) unlimited
　　file locks                   (-x) unlimited
　　#########
　　例： connect() failed (110: Connection refused) while connecting to upstream. 后端的upstream中指定的是tomcat server ，所以原因有可能是tomcat 的连接数过少，tomcat无法处理高并发导致。解决办法：建议修改tomcat 配置文件中子进程的最大个数、并发连接数、内存大小等。另外，还建议调整nginx 到后端upstream的timeout 时长，如下所示：
　　#############
　　worker_processes 4;
　　pid /run/nginx.pid;
　　worker_rlimit_nofile 16384;
　　events {
　　worker_connections 10000;
　　# multi_accept on;
　　}
　　http {
　　proxy_connect_timeout 300s;
　　proxy_send_timeout 600s;
　　proxy_read_timeout 600s;
　　send_timeout 600s;
　　#############
　　例：一台nginx是用来做转发数据的,也就是反向代理。当连接数达到8000以上时，就会出现NGINX 502 Bad Gateway错误。nginx日志全部都是一些110错误代码：*7300911 connect() failed (110: Connection timed out) while connecting to upstream。(nginx的并发连接数五六万都没问题的吧，怎么会出现这种现象呢)
　　分析：Nginx 502错误的原因比较多，大部分是因为在反向代理模式下后端服务器出现问题引起的。这些错误一般都不是nginx本身的问题，所以一定要从后端找原因！但nginx抛出的这些出错信息非常简单，着实让nginx的用户备受置疑，毕竟从字眼上理解，bad gateway？不就是bad nginx吗？让不了解的人看到，会直接把责任推在nginx身上。502错误最通常的出现情况就是后端主机当机，当然，后端主机并发数太高也可能引起偶尔的502，后端主机优化配置不得当，也可能造成502。如果一直502，得检查后端主机地址或端口是否配置错误。
　　例：nginx 配置如下:
　　upstream tomcat{
　　server tom1:8901 max_fails=1 fail_timeout=60s;
　　server tom2:8901 max_fails=1 fail_timeout=60s;
　　}
　　日志里面每隔一两分钟就会记录一条类似 *379803415 no live upstreams while connecting to upstream的日志，此外，还有大量的“upstream prematurely closed connection while reading response header from upstream”的日志。
　　分析，nginx在连接upstream的其中一个server时，maxfail为1，表示有任何一次失败，nginx就认为该机器失败，然后就去尝试连下一个；如果所有的都失败了，在底层就会进行quick recovery把每个peer的失败次数都重置为0，然后再返回一个NGX_BUSY，然后nginx就会打印一条no live upstreams ,最后又回到原始状态，接着进行转发了。这就解释了no live upstreams之后还能正常访问。
　　重新看配置文件，如果其中一台有一次失败，nginx就会认为它已经死掉，然后就会把以后的流量全都打到另一台上面，当另外一台也有一次失败的时候，就认为两个都死掉了，然后quick recovery，然后打印一条日志。这样带来的另一个问题是，如果几台nginx同时认定一台后端已经死掉的时候，会造成流量的不均衡。
　　初步的解决方法：把max_fails从1改成5，效果很明显，“no live upstreams”出现的概率变少了很多，但却没有完全消失。
　　至于日志中的“upstream prematurely closed connection while reading response header from upstream” 抓包分析后发现是因为nginx 给后端服务器发送一个请求后，后端服务器直接回复，然后nginx将502的结果转发给请求者。这种情况很可能是由于后端应用引起

页: [1]

运维网's Archiver

nginx log analysis-WorkNote