您必须跟踪上一行的行跨度,每列一个。
您可以简单地将行跨度的整数值复制到字典中,然后随后的行将行跨度值递减,直到它降至
1(或者我们可以将整数值减1并降至
0以便简化编码)。然后,您可以根据先前的行跨度调整后续的表计数。
您的表格通过使用默认大小2(以2为步长递增)使此复杂化了一点,但是通过除以2可以很容易地将其恢复为可管理的数字。
与其使用大量的CSS选择器,不如选择表行,我们将遍历这些行:
roster = []rowspans = {} # track rowspanning cells# every second row in the tablerows = page.select('html > body > center > table > tr')[1:21:2]for block, row in enumerate(rows, 1): # take direct child td cells, but skip the first cell: daycells = row.select('> td')[1:] rowspan_offset = 0 for daynum, daycell in enumerate(daycells, 1): # rowspan handling; if there is a rowspan here, adjust to find correct position daynum += rowspan_offset while rowspans.get(daynum, 0): rowspan_offset += 1 rowspans[daynum] -= 1 daynum += 1 # now we have a correct day number for this cell, adjusted for # rowspanning cells. # update the rowspan accounting for this cell rowspan = (int(daycell.get('rowspan', 2)) // 2) - 1 if rowspan: rowspans[daynum] = rowspan texts = daycell.select("table > tr > td > font") if texts: # class info found teacher, classroom, course = (c.get_text(strip=True) for c in texts) roster.append({ 'blok_start': block, 'blok_eind': block + rowspan, 'dag': daynum, 'leraar': teacher, 'lokaal': classroom, 'vak': course }) # days that were skipped at the end due to a rowspan while daynum < 5: daynum += 1 if rowspans.get(daynum, 0): rowspans[daynum] -= 1这将产生正确的输出:
[{'blok_eind': 2, 'blok_start': 1, 'dag': 5, 'leraar': u'BLEEJ002', 'lokaal': u'ALK B021', 'vak': u'WEBD'}, {'blok_eind': 3, 'blok_start': 2, 'dag': 3, 'leraar': u'BLEEJ002', 'lokaal': u'ALK B021B', 'vak': u'WEBD'}, {'blok_eind': 4, 'blok_start': 3, 'dag': 5, 'leraar': u'DOODF000', 'lokaal': u'ALK C212', 'vak': u'PROJ-T'}, {'blok_eind': 5, 'blok_start': 4, 'dag': 3, 'leraar': u'BLEEJ002', 'lokaal': u'ALK B021B', 'vak': u'MENT'}, {'blok_eind': 7, 'blok_start': 6, 'dag': 5, 'leraar': u'JONGJ003', 'lokaal': u'ALK B008', 'vak': u'BURG'}, {'blok_eind': 8, 'blok_start': 7, 'dag': 3, 'leraar': u'FLUIP000', 'lokaal': u'ALK B004', 'vak': u'ICT algemeen Prakti'}, {'blok_eind': 9, 'blok_start': 8, 'dag': 5, 'leraar': u'KOOLE000', 'lokaal': u'ALK B008', 'vak': u'NED'}]而且,即使课程跨越 两个以上的块 或仅一个块,此代码也将继续起作用。支持任何行跨大小。



