python docx 利用tables读取表格存在错行

最近因为工作原因一直在使用python docx提取word中的表格及内嵌文件

发现大家一般使用的提取方法都是直接用enumerate table.rows和row.cells直接读取表格的每行，每列，但是这样会导致一个问题，word中的table很多时候是手画的，或者经过多次编辑，行和列都存在对不齐的现象，如果用这种方法去提取表格则会导致串行。

于是我就想到直接读取表格的xml信息，然后用xml来提取表格信息的方法

##path 文件所在位置
##tb_location 所要提取的表格为文档中的第几个表格

def extract_table(path,tb_location):
    document = document(path)
    tables = document.tables          
        
    proxy=[]
    for p in document.tables:
        proxy.append(p._element.xml)  ##提取table xml信息
    
    wt_lt=[]  
    cell_lt=[]  
    wtr_lt=[] 
    ##wtr 每行
    wtr_str=".//{http://schemas.openxmlformats.org/wordprocessingml/2006/main}tr" 
    ## wtc 每个cell
    wtc_str=".//{http://schemas.openxmlformats.org/wordprocessingml/2006/main}tc"
    ## wt 文字
    wt_str=".//{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t"
    if tb_location>=0:
        root=ET.fromstring(proxy[tb_location])
        wtrs=root.findall(wtr_str)
        for i in range(len(wtrs)):
            cells=wtrs[i].findall(wtc_str)
            for j in range(len(cells)):
                wts=cells[j].findall(wt_str)
                for wt in wts:
                    wt_lt.append(wt.text)
                    wtfull="".join(wt_lt)
                cell_lt.append(wtfull)
                wt_lt=[]
            wtr_lt.append(cell_lt)
            cell_lt=[]
        return wtr_lt
    else:
        return []

python docx 利用tables读取表格存在错行

Python相关栏目本月热门文章