Web服务器⽇志特征提取

在典型的 Web 服务器上，你会在 /var/log/apache2/ 中找到 Apache ⽇志，通常是access.log 、 ssl_access.log （对于 HTTPS）或 gzip 压缩后的轮转⽇志⽂件，如 access-20200101.gz 或者 ssl_access-20200101.gz 。
HTTP服务器记录什么⽇志信息可以配置的。参考简书：web访问⽇志分析

1)Apache采⽤如何配置⽣成⽇志

LogFormat "%h %l %u %t "%r" %>s %b "%{Referer}i" "%{UserAgent}i"" combined
LogFormat "%h %l %u %t "%r" %>s %b" common

2)nginx采⽤如何配置⽣成⽇志

log_format main '$remote_addr - $remote_user [$time_local] "$request"
'
 '$status $body_bytes_sent "$http_referer" '
 '"$http_user_agent" "$http_x_forwarded_for"';

服务器⽇志常采⽤Common Log Format (CLF) 或 Extended Log Format (ELF) 格式。
1)通⽤⽇志格式 common log format

127.0.0.1 - - [14/May/2017:12:45:29 +0800] "GET /index.html HTTP/1.1" 200 4286
远程主机IP 请求时间 时区 ⽅法 资源 协议 状态码 发送字节

2)组合⽇志格式 combined log format

 127.0.0.1 - - [14/May/2017:12:51:13 +0800] "GET /index.html HTTP/1.1"
200 4286 "http://127.0.0.1/" "Mozilla/5.0 (Windows NT 6.1; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116
Safari/537.36"
远程主机IP 请求时间 时区 ⽅法 资源 协议 状态
码 发送字节 referer字符 浏览器信息

其中，有关键的信息：⽇志状态码。
2XX:

 200: 请求成功
 201: 创建成功
 202: 接受请求
 204: ⽆内容

3XX

301: 永远重定向
302: 临时重定向
303: 临时重定向(HTTP1.1 同302)
307: 临时重定向(HTTP1.1 POST⽅法)

4XX

 400: 错误请求
 401: 访问拒绝
 403: 访问禁⽌
 404: 未找到
 405: 请求⽅法错误

5XX

 500: 服务器内部错误
 503: 服务不可⽤
 505: ⽹关超时

平时应急响应的时候可以利⽤⼀些Python⼩⼯具来使得⼯作事半功倍，⽹上⼀些⽐较优秀的web⽇志分析⼯具，可以在web系统中故障排查、性能分析⽅⾯有着⾮常重要的作⽤。可以在指定时间段内提供细粒度（最⼩分钟级别，即⼀分钟内的⽇志做抽象和汇总）的异常定位和性能分析。⽐如：使⽤这个 Python ⼯具分析你的 Web 服务器⽇志⽂件。Lars 是 Python 写的 Web 服务器⽇志⼯具包。这意味着你可以使⽤ Python 通过简单的代码来回溯（或实时）解析⽇志，并对数据做任何你想做的事：将它存储在数据库中、另存为 CSV ⽂件，或者⽴即使⽤ Python 进⾏更多分析。

下面是给出一个初步的提取样例（并不是非常专业，博主还在入门阶段）

from lars import apache,datatypes
from datetime import datetime, date
import pandas as pd
import os
import time
import glob
import csv

class web_log(object):
    def __init__(self, fp):
        self.file=open('net1.log')

    def get_log_feature(self, num=None):
        index=0
        with apache.ApacheSource(self.file,log_format=apache.COMBINED) as source:
            for row in source:
                #print("index:",index)
                client_IP=datatypes.ipaddress.native_str(row.remote_host) #
                #print("client_IP:",client_IP,type(client_IP))

                bytes_size=str(row.size)
                #print("bytes_size:",bytes_size,type(bytes_size))

                referrer=str(row.req_Referer) #

                #print("referrer:",referrer,type(referrer))

                client_UA=str(row.req_User_agent)
                #print("client_UA",client_UA,type(client_UA))

                request_Type=row.request.method
                #print("request_Type",request_Type,type(request_Type))

                time=row.time #
                Timestamp = str(datatypes.DateTime.timestamp(time))
                #print("time",time,type(time),"timestamp:",Timestamp,type(Timestamp))


                HTTP_protocol_version=str(row.request.protocol)
               #print('HTTP_protocol_version:',HTTP_protocol_version,type(HTTP_protocol_version))

                url=str(row.request.url) #
                #print("url:",url,type(url))

                http_status_code=str(row.status)
                #print("http_status_code:",http_status_code,type(http_status_code))
                index+=1
                if num:  # 如果指定num=100，则只会输出100个流
                    if index >= num:
                        return[client_IP,bytes_size,referrer,request_Type,Timestamp,
                           HTTP_protocol_version,url,http_status_code,client_UA]
                    yield [client_IP,bytes_size,referrer,request_Type,Timestamp,
                           HTTP_protocol_version,url,http_status_code,client_UA]


if __name__ == '__main__':

        log_file = "net1.log"
        web_logs = web_log(log_file)
        log_features=web_logs.get_log_feature(num=100);
        with open("net_log_features.csv", 'w') as f:  # 将提取出的特征组保存起来
            f.write("client_IP,bytes_size,referrer,request_Type,timestamp, HTTP_protocol_version,url,http_status_code,client_UAr")
            writer=csv.writer(f)
            for log_feature in log_features:
                print(log_feature)
                #write_str = ",".join(log_feature)
                writer.writerow(log_feature)

            f.close()

提取出的csv文件大致如下：

过程中遇到的问题：
在提取过程中，对于lars包定义的数据格式并不了解。详见
https://lars.readthedocs.io/en/latest/lars.datatypes.html
关于lars定义的其它使用方法也会在其中有一定表述

Web服务器⽇志特征提取

Python相关栏目本月热门文章