3.5.1 Beautiful Soup库的安装和使用_Python网络爬虫技术与实战-QQ阅读女生中文古言网

上QQ阅读APP看书，第一时间看更新

3.5.1　Beautiful Soup库的安装和使用

1.采用pip安装Beautiful Soup

这里推荐使用pip安装方式来安装Beautiful Soup库，命令如下：

pip3 install Beautiful Soup

然后在命令提示符中导入Beautiful Soup测试，命令如下：

$ Python3
>>> import bs4

若没报错，则Beautiful Soup安装成功。

2.采用wheel安装Beautiful Soup

若因为某些原因采用pip无法成功安装Beautiful Soup库，也可直接下载Beautiful Soup对应的wheel文件，下载地址为https://pypi.Python.org/pypi/beautifulsoup4，下载与Python版本和系统版本相对应的wheel文件，例如Python版本为3.7，就选择beautifulsoup4-4.7.1-py3-none-any.whl下载即可（macOS与Linux类似）。

然后利用pip安装，命令如下：

pip3 install beautifulsoup4-4.7.1-py3-none-any.whl

Beautiful Soup的HTML和XML解析器是依赖于lxml库的，所以在此之前请确保已经成功安装好了lxml库，具体的安装方式参见3.4.1节。

3.Beautiful Soup库的使用

使用BeautifulSoup解析HTML代码之前，我们首先需要给出一个HTML代码示例，它将在后文中被多次用到。

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names 
were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

以上述HTML代码为例，使用Beautiful Soup库解析这段代码，能够得到一个BeautifulSoup对象，并能按照标准缩进格式输出，具体用法示例如下。

【例3-39】Beautiful Soup库的简单使用

1  html_doc = """
2  <html><head><title>The Dormouse's story</title></head>
3  <body>
4  <p class="title"><b>The Dormouse's story</b></p>  
5  <p class="story">Once upon a time there were three little sisters; and their nam
6  es were
7  <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
8  <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
9  <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
10 and they lived at the bottom of a well.</p>
11 <p class="story">...</p>
12 """
13 from bs4 import BeautifulSoup
14 # 使用lxml解析方式创建BeautifulSoup对象
15 soup = BeautifulSoup(html_doc,"lxml")
16 # 打印soup对象的内容，格式化输出
17 print(soup.prettify())

运行结果如下：

<html>
    <head>
        <title>
            The Dormouse's story
        </title>
    </head>
    <body>
        <p class="title">
            <b>
                The Dormouse's story
            </b>
        </p>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
                Elsie
            </a>
            ,
            <a class="sister" href="http://example.com/lacie" id="link2">
                Lacie
            </a>
            and
            <a class="sister" href="http://example.com/tillie" id="link3">
                Tillie
            </a>
            ;
and they lived at the bottom of a well.
        </p>
        <p class="story">
            ...
        </p>
    </body>
</html>

上述代码声明了一个HTML字符串，但需要注意的是，它并不是一个完整的HTML字符串，因为body和html节点都暂未闭合。接着，我们将它当作第一个参数传给BeautifulSoup对象，该对象的第二个参数为解析器的类型（这里使用lxml），此时就完成了BeautifulSoup对象的初始化。然后，将这个对象赋值给soup变量。

然后调用prettify()方法。这个方法可以把要解析的字符串以标准的缩进格式输出。这里需要注意的是，输出结果里面包含body和html节点，也就是说，对于不标准的HTML字符串BeautifulSoup，可以自动更正格式。这一步不是由prettify()方法做的，而是在初始化BeautifulSoup时就完成了。

Beautiful Soup库在解析时实际上依赖于解析器，它除了支持Python标准库中的HTML解析器外，还支持一些第三方解析器（比如lxml）。表3-11列出了Beautiful Soup库支持的解析器。

表3-11　Beautiful Soup库支持的解析器