栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 面试经验 > 面试问答

如何使用rowspan和colspan解析表

面试问答 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

如何使用rowspan和colspan解析表

您不能只数

td
th
单元格,不。您必须对表进行扫描以获取每一行的列数,并将上一行中任何活动的行跨度添加到该计数中。

在用行跨度解析表的另一种情况下,我跟踪每个列号的行跨数,以确保来自不同单元格的数据以正确的列结尾。这里可以使用类似的技术。

第一计数列;只保留最高的数字。保留行数为2或更大的列表,并为您处理的每一行列的每行减去1。这样,您就知道每一行有多少“额外”列。以最高的列数来构建输出矩阵。

接下来,再次遍历行和单元格,这次跟踪从列号到活动计数的字典中的行跨度。同样,将值大于等于2的任何内容都保留到下一行。然后移动列号以说明活动的任何行跨度;如果在第

td
0列上有活动的行跨度,则行的第一个实际上是第二个,等等。

您的代码将复制的列和行的值重复复制到输出中;我通过在给定单元格(每个默认为1)的

colspan
rowspan
数字上创建循环以多次复制值来实现相同目的。我忽略了重叠的单元格;的HTML表格规范指出重叠的小区是一个错误,它是由用户代理来解决冲突。在下面的代码中,colspan胜过rowpan单元。

from itertools import productdef table_to_2d(table_tag):    rowspans = []  # track pending rowspans    rows = table_tag.find_all('tr')    # first scan, see how many columns we need    colcount = 0    for r, row in enumerate(rows):        cells = row.find_all(['td', 'th'], recursive=False)        # count columns (including spanned).        # add active rowspans from preceding rows        # we *ignore* the colspan value on the last cell, to prevent        # creating 'phantom' columns with no actual cells, only extended        # colspans. This is achieved by hardcoding the last cell width as 1.         # a colspan of 0 means “fill until the end” but can really only apply        # to the last cell; ignore it elsewhere.         colcount = max( colcount, sum(int(c.get('colspan', 1)) or 1 for c in cells[:-1]) + len(cells[-1:]) + len(rowspans))        # update rowspan bookkeeping; 0 is a span to the bottom.         rowspans += [int(c.get('rowspan', 1)) or len(rows) - r for c in cells]        rowspans = [s - 1 for s in rowspans if s > 1]    # it doesn't matter if there are still rowspan numbers 'active'; no extra    # rows to show in the table means the larger than 1 rowspan numbers in the    # last table row are ignored.    # build an empty matrix for all possible cells    table = [[None] * colcount for row in rows]    # fill matrix from row data    rowspans = {}  # track pending rowspans, column number mapping to count    for row, row_elem in enumerate(rows):        span_offset = 0  # how many columns are skipped due to row and colspans         for col, cell in enumerate(row_elem.find_all(['td', 'th'], recursive=False)): # adjust for preceding row and colspans col += span_offset while rowspans.get(col, 0):     span_offset += 1     col += 1 # fill table data rowspan = rowspans[col] = int(cell.get('rowspan', 1)) or len(rows) - row colspan = int(cell.get('colspan', 1)) or colcount - col # next column is offset by the colspan span_offset += colspan - 1 value = cell.get_text() for drow, dcol in product(range(rowspan), range(colspan)):     try:         table[row + drow][col + dcol] = value         rowspans[col + dcol] = rowspan     except IndexError:         # rowspan or colspan outside the confines of the table         pass        # update rowspan bookkeeping        rowspans = {c: s - 1 for c, s in rowspans.items() if s > 1}    return table

这样可以正确解析您的样本表:

>>> from pprint import pprint>>> pprint(table_to_2d(soup.table), width=30)[['1', '2', '5'], ['3', '4', '4'], ['3', '6', '7']]

并处理您的其他示例;第一张桌子:

>>> table1 = BeautifulSoup('''... <table border="1">...   <tr>...     <th>A</th>...     <th>B</th>...   </tr>...   <tr>...     <td rowspan="2">C</td>...     <td rowspan="1">D</td>...   </tr>...   <tr>...     <td>E</td>...     <td>F</td>...   </tr>...   <tr>...     <td>G</td>...     <td>H</td>...   </tr>... </table>''', 'html.parser')>>> pprint(table_to_2d(table1.table), width=30)[['A', 'B', None], ['C', 'D', None], ['C', 'E', 'F'], ['G', 'H', None]]

第二个:

>>> table2 = BeautifulSoup('''... <table border="1">...   <tr>...     <th>A</th>...     <th>B</th>...   </tr>...   <tr>...     <td rowspan="2">C</td>...     <td rowspan="2">D</td>...   </tr>...   <tr>...     <td>E</td>...     <td>F</td>...   </tr>...   <tr>...     <td>G</td>...     <td>H</td>...   </tr>... </table>... ''', 'html.parser')>>> pprint(table_to_2d(table2.table), width=30)[['A', 'B', None, None], ['C', 'D', None, None], ['C', 'D', 'E', 'F'], ['G', 'H', None, None]]

最后但并非最不重要的一点是,代码正确地处理了超出实际表的

"0"
跨度和跨度(延伸至末尾),如以下示例所示:

<table border="1">  <tr>    <td rowspan="3">A</td>    <td rowspan="0">B</td>    <td>C</td>    <td colspan="2">D</td>  </tr>  <tr>    <td colspan="0">E</td>  </tr></table>

即使有rowpan和colspan值会让您相信可能会有3和5,也有两行包含4个单元格:

+---+---+---+---+|   |   | C | D || A | B +---+---+|   |   |   E   |+---+---+-------+

这种超限的处理方式与浏览器相同。它们将被忽略,并且0跨度扩展到其余的行或列:

>>> span_demo = BeautifulSoup('''... <table border="1">...   <tr>...     <td rowspan="3">A</td>...     <td rowspan="0">B</td>...     <td>C</td>...     <td colspan="2">D</td>...   </tr>...   <tr>...     <td colspan="0">E</td>...   </tr>... </table>''', 'html.parser')>>> pprint(table_to_2d(span_demo.table), width=30)[['A', 'B', 'C', 'D'], ['A', 'B', 'E', 'E']]


转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/645428.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号