Beautiful Soup 对象

BeautifulSoup 将HTML 文档转换成树形结构，每个节点都是Python 可以操作的对象。这些Python 可操作的对象可以归纳为：Tag、NavigableString、BeautifulSoup、Comment。

本文使用以下文档作说明：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

Tag - 与XML/HTML 中的tag 相同

type(tag) - tag 的种类
tag.name - tag 的名字
tag.attrs / tag[‘’] - tag的所有/指定属性
- tag 的名称可以编辑；
- tag 可以具有多个属性，操作方法与字典相同，例如想要查询某个指定的属性，使用该属性的键即可；
- tag 的属性可以添加删除或修改，操作方法与字典相同；
- HTML 定义了一些列多值属性，最常见的就是class，即一个tag 可以具有多个CSS 的class，还有一些例如rel、rev、accept-charset、headers、accesskey，多值属性返回类型为列表。
1
2
3
4
5
6
7
8
9
10
11
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
tag = soup.p
type(tag)
# <class 'bs4.element.Tag'>
tag.name
# u'p'
tag.attrs
# {u'class': u'title}
tag['class']
# 'title'

NavigableSrting - 可遍历字符串

tpye(tag.string) - string的种类
tag.string
tag.string.replace_with(‘’) - string不可以编辑但是可以替换

tag.string
# u'The Dormouse's story'
type(tag.string)
# <class 'bs4.element.NavigablelString'>
tag.string.replace_with('No longer bold')

BeautifulSoup - 文档全部内容，一个特殊的Tag对象

BeautifulSoup 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象,它支持遍历文档树和搜索文档树中描述的大部分的方法.

因为 BeautifulSoup 对象并不是真正的HTML或XML的tag,所以它没有name和attribute属性.但有时查看它的 .name 属性是很方便的,所以 BeautifulSoup对象包含了一个值为 “[document]” 的特殊属性 .name

1 2	soup.name # u'[document]'

Comment - 注释及特殊字符串

comment 对象是一个特殊的 NavigableString

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>
comment
# u'Hey, buddy. Want to buy a used parser?'

实际上当它出现在HTML文档中：

print(soup.b.prettify())
# <b>
#  <!--Hey, buddy. Want to buy a used parser?-->
# </b>

Beautiful Soup中定义的其它类型都可能会出现在XML的文档中: CData , ProcessingInstruction , Declaration , Doctype .与 Comment 对象类似,这些类都是 NavigableString 的子类,只是添加了一些额外的方法的字符串独享.下面是用CDATA来替代注释的例子:
1
2
3
4
5
6
7
8
9
> from bs4 import CData
> cdata = CData("A CDATA block")
> comment.replace_with(cdata)
>
> print(soup.b.prettify())
> # <b>
> #  <![CDATA[A CDATA block]]>
> # </b>
>

Echo

Beautiful Soup 对象

Tag - 与XML/HTML 中的tag 相同

type(tag) - tag 的种类

tag.name - tag 的名字

tag.attrs / tag[‘’] - tag的所有/指定属性

NavigableSrting - 可遍历字符串

tpye(tag.string) - string的种类

tag.string

tag.string.replace_with(‘’) - string不可以编辑但是可以替换

BeautifulSoup - 文档全部内容，一个特殊的Tag对象

Comment - 注释及特殊字符串