CVE 爬虫

发布于：2019年1月6日

Python3爬国家漏洞信息库

思路流程

先确定要爬的网站并观察他URL的规律

发现，当url为：

1	https://nvd.nist.gov/vuln/full-listing/

会显示出所有时间段的链接，点击2018年8月的链接时，他的url变成：

1	https://nvd.nist.gov/vuln/full-listing/2018/8

不难发现规律是：

1	https://nvd.nist.gov/vuln/full-listing/年/月

构造好爬取网站的url后开始提取网站内的数据，使用xpath，可以精准提取到页面里展示出来的所有漏洞名称和超链接，但xpath获取的超链接不完整

1	/vuln/detail/CVE-2018-0413

点击超链接之后发现真实地址是xpath提取的超链接加该网站的完整域名

1	https://nvd.nist.gov/vuln/detail/CVE-2018-0413

这样就有了进一步的思路，将具体月份内的所有漏洞列表的超链接拼接为可用的具体网址，存放在列表里，方便之后的遍历访问。

爬取并保存信息

打开漏洞网址利用xpath将网页内的信息提取出来，按键值对有序的存放在该漏洞的字典中，其中Impact部分数据比较复杂。因此，在建两个字典分别存放3.0和2.0版本的具体影响。

因为每个网页中的字段数量不同，因此存在只有标题没有内容的情况，因此某些特定地方需要加判断来增加程序的健壮性

其次，存在很多内容存在样式，所以需要用strip和join进行去空格和拼接，字符串切片处理

将数据爬取下来后为了让数据更具有可读性，增加了翻译功能

增加功能

使用随机请求头来减少拒绝请求的次数

在爬取的过程中总是存在“服务器长期未答应”或者“服务器强制关闭了此连接”的错误，猜测服务器存在反爬虫机制，因此是用来随机请求头的方式来建立简单地反反爬虫机制：

1	'User-Agent': choice(UAlist)

随机选择列表中的请求头，来迷惑服务器，减少请求数量达到服务器规定阀值的机会

确定使用的翻译接口

截取漏洞网站的一段话分别放进百度翻译，有道翻译，google翻译，检测翻译的准确性，结果如下：

原话：

A vulnerability in the web-based management interface of Cisco Identity Services Engine (ISE) could allow an 

unauthenticated, remote attacker to conduct a cross-site request forgery (CSRF) attack and perform arbitrary 

actions on an affected device. The vulnerability is due to insufficient CSRF protections for the web-based 

management interface of an affected device. An attacker could exploit this vulnerability by persuading a 

user of the interface to follow a crafted link. A successful exploit could allow the attacker to perform arbitrary

actions on a targeted device via a web browser and with the privileges of the user. Cisco Bug IDs: CSCvi85159.

百度翻译：

Cisco Identity Services Engine(ISE)的基于网络的管理接口中的漏洞可能允许未经身份验证的远程攻击者进行跨站点请求伪造(CSRF)

攻击并在受影响的设备上执行任意操作。该漏洞是由于受影响设备的基于网络的管理接口没有足够的CSRF保护造成的。攻击者可以通过说服接口

的用户遵循精心编制的链接来利用此漏洞。成功利用此漏洞可使攻击者通过Web浏览器并以用户的权限在目标设备上执行任意操作。CiscoBugID：

CSCvi85159。

有道翻译：

Cisco Identity Services Engine (ISE)基于web的管理界面中的一个漏洞可能允许未经身份验证的远程攻击者进行跨站点请求伪造

(cross-site request forgery, CSRF)攻击，并在受影响的设备上执行任意操作。该漏洞是由于受影响设备的基于web的管理界面的CSRF

保护不足造成的。攻击者可以通过说服界面用户遵循精心设计的链接来利用这个漏洞。一个成功的攻击可以允许攻击者通过web浏览器和用户的特权

在目标设备上执行任意操作。Cisco Bug id: CSCvi85159。

google翻译：

思科身份服务引擎（ISE）基于Web的管理界面中的漏洞可能允许未经身份验证的远程攻击者进行跨站点请求伪造（CSRF）攻击并在受影响的设备上执

行任意操作。 该漏洞是由于受影响设备的基于Web的管理界面的CSRF保护不足。 攻击者可以通过说服界面的用户遵循精心设计的链接来利用此漏洞。

成功利用可以允许攻击者通过Web浏览器并使用用户的权限在目标设备上执行任意操作。 思科Bug ID：CSCvi85159。

通过以上结果发现google翻译最贴切实意，因此确定使用google翻译

发现并使用google翻译接口

首先发现url是：

https://translate.google.cn/#view=home&op=translate&sl=en&tl=zh-CN&text=A%20vulnerability%20in%20the%20web-

based%20management%20interface%20of%20Cisco%20Identity%20Services%20Engine%20(ISE)%20could%20allow%20an

%20unauthenticated%2C%20remote%20attacker%20to%20conduct%20a%20cross-site%20request%20forgery%20(CSRF)%

20attack%20and%20perform%20arbitrary%20actions%20on%20an%20affected%20device.%20The%20vulnerability%20is%

20due%20to%20insufficient%20CSRF%20protections%20for%20the%20web-based%20management%20interface%20of%

20an%20affected%20device.%20An%20attacker%20could%20exploit%20this%20vulnerability%20by%20persuading%20a%

20user%20of%20the%20interface%20to%20follow%20a%20crafted%20link.%20A%20successful%20exploit%20could%

20allow%20the%20attacker%20to%20perform%20arbitrary%20actions%20on%20a%20targeted%20device%20via%20a%

20web%20browser%20and%20with%20the%20privileges%20of%20the%20user.%20Cisco%20Bug%20IDs%3A%20CSCvi85159.

很容易发现url中带有我们所要翻译的句子，然后也准备好了可以筛选出翻译部分的xpath规则，但运行程序后发现总是返回空集。后来查看网页源代码后发现源代码中确实没有翻译的内容，所以推测他是使用某种神秘手段显示。

然后观察浏览器与服务器的通信发现每次他都会和一个网址发起请求，并且网址会返回以列表嵌套的方式返回翻译内容，翻译状况等信息：

https://translate.google.cn/translate_a/single?client=webapp&sl=en&tl=zh-CN&hl=zh-CN&dt=at&dt=bd&dt=ex

&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&otf=1&ssel=0&tsel=0&kc=2&tk=831447.664573&q=A%

20vulnerability%20in%20the%20web-based%20management%20interface%20of%20Cisco%20Identity%20Services%

20Engine%20(ISE)%20could%20allow%20an%20unauthenticated%2C%20remote%20attacker%20to%20conduct%20a%

20cross-site%20request%20forgery%20(CSRF)%20attack%20and%20perform%20arbitrary%20actions%20on%20an%

20affected%20device.%20The%20vulnerability%20is%20due%20to%20insufficient%20CSRF%20protections%20for%20the

%20web-based%20management%20interface%20of%20an%20affected%20device.%20An%20attacker%20could%20exploit

%20thivulnerability%20by%20persuading%20a%20user%20of%20the%20interface%20to%20follow%20a%20crafted%

20link.%20A%20successful%20exploit%20could%20allow%20the%20attacker%20to%20perform%20arbitrary%20actions%

20on%20a%20targeted%20device%20via%20a%20web%20browser%20and%20with%20the%20privileges%20of%20the%

20user.%20Cisco%20Bug%20IDs%3A%20CSCvi85159.

浏览器以GET方式发出请求并携带了几个参数：

baseUrl='https://translate.google.cn/translate_a/single'
baseUrl+='?client=t&'
baseUrl+='s1=auto&'
baseUrl+='t1=zh-CN&'
baseUrl+='h1=zh-CN&'
baseUrl+='dt=at&'
baseUrl+='dt=bd&'
baseUrl+='dt=ex&'
baseUrl+='dt=ld&'
baseUrl+='dt=md&'
baseUrl+='dt=qca&'
baseUrl+='dt=rw&'
baseUrl+='dt=rm&'
baseUrl+='dt=ss&'
baseUrl+='dt=t&'
baseUrl+='ie=UTF-8&'
baseUrl+='oe=UTF-8&'
baseUrl+='otf=1&'
baseUrl+='pc=1&'
baseUrl+='ssel=0&'
baseUrl+='tsel=0&'
baseUrl+='kc=2&'
baseUrl+='tk=831447.664573&'
baseUrl+='q='+text

利用代码，拼接好url后请求发现他只能翻译那句话，其他话返回403界面。然后通过尝试发现改变tk参数后，那句话也没办法翻译了，因此推测tk是一个请求服务的关键参数。

通过百度后发现有人已经写出了这个密钥的生成原理，利用他的生成原理果然可以返回正确的结果，然后把这个功能进行了分装，然后在需要翻译的地方进行了调用。

提高代码的灵活性

增加了用户输入接口，让用户有选择的余地，可以根据自己的需求灵活的自定义参数：

def main():
    while 1:
        year = int(input("你想爬哪年的CVE?(1988-2018)"))
        if year >= 1988 and year <= 2018:
            break
        else:
            print("请重新输入")
    while 1:
        month = int(input("你想爬哪年的CVE?(1-12)"))
        if month >= 1 and month <= 12:
            break
        else:
            print("请重新输入")
    while 1:
        mod = int(input("是否保存数据到本地?\n1 保存\n2 不保存"))
        if mod == 1 or mod == 2:
            break
        else:
            print("请重新输入")
    CVEs = fun.getCVEs(year,month)
    details = fun.getDetails(CVEs)
    if mod == 1:
        write(details)
    else:
        show(details)