使用Python逐块加载Excel文件，而不是将整个文件加载到内存

由于

xlsx

文件的性质（本质上是一堆

xml

压缩在一起的文件），您不能将文件戳到任意字节，而希望它成为您感兴趣的表中表格的第N行的开头。

你能做的最好是用

pandas.read_excel

与

skiprows

（从文件顶部跳过行）和

skip_footer

（从底部跳跃行）参数。但是，这将首先将整个文件加载到内存，然后仅解析所需的行。

# if the file contains 300 rows, this will read the middle 100df = pd.read_excel('/path/excel.xlsx', skiprows=100, skip_footer=100,        names=['col_a', 'col_b'])

请注意，您必须使用

names

参数手动设置标题，否则列名将是最后跳过的行。

如果您希望使用

csv

它，那么这是一项简单的任务，因为

csv

文件是纯文本文件。

但是，这是一个很大的 ，但是 ，如果你是真的绝望了，你可以提取相关片的

xml

从文件

xlsx

归档和解析。但是，这绝非易事。

一个示例

xml

文件，代表具有一个2 X 3表格的工作表。该

<v>

标签表示该单元的值。

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officedocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="x14ac" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac">    <dimension ref="A1:B3"/>    <sheetViews>        <sheetView tabSelected="1" workbookViewId="0"> <selection activeCell="C10" sqref="C10"/>        </sheetView>    </sheetViews>    <sheetFormatPr defaultColWidth="11" defaultRowHeight="14.25" x14ac:dyDescent="0.2"/>    <sheetData>        <row r="1" spans="1:2" ht="15.75" x14ac:dyDescent="0.2"> <c r="A1" t="s">     <v>1</v> </c><c r="B1" s="1" t="s">     <v>0</v> </c>        </row>        <row r="2" spans="1:2" ht="15" x14ac:dyDescent="0.2"> <c r="A2" s="2">     <v>1</v> </c><c r="B2" s="2">     <v>4</v> </c>        </row>        <row r="3" spans="1:2" ht="15" x14ac:dyDescent="0.2"> <c r="A3" s="2">     <v>2</v> </c><c r="B3" s="2">     <v>5</v> </c>        </row>    </sheetData>    <pageMargins left="0.75" right="0.75" top="1" bottom="1" header="0.5" footer="0.5"/></worksheet>

使用Python逐块加载Excel文件，而不是将整个文件加载到内存

面试问答相关栏目本月热门文章