BeautifulSoup遍历文档树

我们已经了解了HTML/XML 的树状结构，以及各个节点的类型及其属性、方法。简单说，遍历文档树指怎样从文档中的一个节点跳转到/找到另一节点。

本文依然使用以下文档作说明：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
    <body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')

子节点

一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点.Beautiful Soup提供了许多操作和遍历子节点的属性.

注意: Beautiful Soup中字符串节点不支持这些属性,因为字符串没有子节点

- .find_all(‘’) 查找所有

soup.body.a
# 获取第一个<body>中的第一个<a>
soup.find_all('a')
# 获取所有的<a>

- .contents - 以列表输出当前tag 的子节点

- .children - 对tag 的子节点进行循环

- .descendants - 对tag 的所有子孙节点进行递归循环

- .string 输出tag 包含的字符串

如果tag 只有一个 NavigableString 类型子节点,那么这个tag可以使用 .string 得到子节点
如果tag 有多个 NavigableString 类型子节点，.string 会返回None

Echo

BeautifulSoup遍历文档树

子节点

- .find_all(‘’) 查找所有

- .contents - 以列表输出当前tag 的子节点

- .children - 对tag 的子节点进行循环

- .descendants - 对tag 的所有子孙节点进行递归循环

- .string 输出tag 包含的字符串

- .strings 循环输出tag 包含的所有字符串

- stripped_strings 去除.strings 中的多余空白内容

父节点

- .parent 获取某个元素的父节点

- .parents 递归得到元素的所有父辈节点

兄弟节点

- .next_sibling / .previous_sibling 查询下/上一个兄弟节点

- .next_siblings / .previous_siblings 对当前节点的兄弟节点迭代输出

回退和前进

- .next_element / .previous_element 解析过程中下/上一个被解析的对象(字符串或tag)

- .next_elements / .previous_elements 生成迭代器向前或向后访问文档的解析内容