3.5.3 CSS选择器_Python网络爬虫技术与实战-QQ阅读女生短篇网

上QQ阅读APP看书，第一时间看更新

3.5.3　CSS选择器

Beautiful Soup还提供了另一种选择器，那就是CSS选择器。如果对Web开发熟悉的话，那么对CSS选择器肯定也不陌生。我们在写CSS时，标签名不加任何修饰，类名前加点，id名前加#，在这里我们也可以利用类似的方法来筛选元素，用到的方法是soup.select()，返回类型是list。

1.通过标签名查找

查找标签为title的结果：

print soup.select('title')

运行结果如下：

[<title>The Dormouse's story</title>]

查找标签为a的结果：

print (soup.select('a'))

运行结果如下：

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a cl
ass="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" hr
ef="http://example.com/tillie" id="link3">Tillie</a>]

2.通过类名查找

通过直接引用标签的'.类名'来进行查找，如下所示：

print (soup.select('.sister'))

运行结果如下：

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a cl
ass="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" hr
ef="http://example.com/tillie" id="link3">Tillie</a>]

3.通过id名查找

通过直接引用标签的'#id'来进行查找，如下所示：

print (soup.select('#link1'))

运行结果如下：

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

4.组合查找

组合查找的原理和写class文件时标签名与类名、id名进行组合的原理是一样的，例如查找p标签中id等于link1的内容，二者需要用空格分开：

print (soup.select('p #link1'))

运行结果如下：

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

直接子标签查找：

print (soup.select("head > title"))

运行结果如下：

[<title>The Dormouse's story</title>]

5.属性查找

查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。

查找属性值为sister的内容：

print (soup.select('a[class="sister"]'))

运行结果如下：

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a cl
ass="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" hr
ef="http://example.com/tillie" id="link3">Tillie</a>]

查找属性值为href="http://example.com/lacie"的内容：

print (soup.select('a[href="http://example.com/lacie"]'))

运行结果如下：

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

属性也可以与上述查找方式组合，不在同一节点的用空格隔开，而在同一节点的则不加空格：

print (soup.select('p a[href="http://example.com/elsie"]'))

以上的select方法返回的结果都是列表形式，可以遍历形式输出，然后用get_text()方法来获取它的内容：

print (type(soup.select('title')))
print (soup.select('title')[0].get_text())

for title in soup.select('p'):
    print (title.get_text())

运行结果如下：

<class 'list'>
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...