使用PDFMiner解析没有/ Root对象的PDF

有趣的问题。我进行了某种研究：

解析pdf的函数（来自矿工源代码）：

def set_parser(self, parser):        "Set the document to use a given PDFParser object."        if self._parser: return        self._parser = parser        # Retrieve the information of each header that was appended        # (maybe multiple times) at the end of the document.        self.xrefs = parser.read_xref()        for xref in self.xrefs: trailer = xref.get_trailer() if not trailer: continue # If there's an encryption info, remember it. if 'Encrypt' in trailer:     #assert not self.encryption     self.encryption = (list_value(trailer['ID']),  dict_value(trailer['Encrypt'])) if 'Info' in trailer:     self.info.append(dict_value(trailer['Info'])) if 'Root' in trailer:     #  Every PDF file must have exactly one /Root dictionary.     self.catalog = dict_value(trailer['Root'])     break        else: raise PDFSyntaxError('No /Root object! - Is this really a PDF?')        if self.catalog.get('Type') is not LITERAL_CATALOG: if STRICT:     raise PDFSyntaxError('Catalog not found!')        return

如果您在使用EOF时遇到问题，则会引发另一个异常：’‘’源中的另一个函数’‘’

def load(self, parser, debug=0):        while 1: try:     (pos, line) = parser.nextline()     if not line.strip(): continue except PSEOF:     raise PDFNoValidXRef('Unexpected EOF - file corrupted?') if not line:     raise PDFNoValidXRef('Premature eof: %r' % parser) if line.startswith('trailer'):     parser.seek(pos)     break f = line.strip().split(' ') if len(f) != 2:     raise PDFNoValidXRef('Trailer not found: %r: line=%r' % (parser, line)) try:     (start, nobjs) = map(long, f) except ValueError:     raise PDFNoValidXRef('Invalid line: %r: line=%r' % (parser, line)) for objid in xrange(start, start+nobjs):     try:         (_, line) = parser.nextline()     except PSEOF:         raise PDFNoValidXRef('Unexpected EOF - file corrupted?')     f = line.strip().split(' ')     if len(f) != 3:         raise PDFNoValidXRef('Invalid XRef format: %r, line=%r' % (parser, line))     (pos, genno, use) = f     if use != 'n': continue     self.offsets[objid] = (int(genno), long(pos))        if 1 <= debug: print >>sys.stderr, 'xref objects:', self.offsets        self.load_trailer(parser)        return

来自Wiki（pdf规范）：PDF文件主要由对象组成，其中有八种类型：

Boolean values, representing true or falseNumbersStringsNamesArrays, ordered collections of objectsDictionaries, collections of objects indexed by NamesStreams, usually containing large amounts of dataThe null object

对象可以是直接的（嵌入另一个对象中）或间接的。间接对象用对象编号和世代编号编号。称为外部参照表的索引表给出了每个间接对象与文件开头的字节偏移量。
这种设计允许对文件中的对象进行有效的随机访问，并且还允许进行较小的更改而无需重写整个文件（增量更新）
。从PDF版本1.5开始，间接对象也可以位于称为对象流的特殊流中。此技术减小了具有大量小型间接对象的文件的大小，并且对于“标记PDF”特别有用。

我认为问题是您的“损坏的pdf”页面上有一些“根元素”。

Possible solution:

您可以下载源代码并在检索外部参照对象和解析器尝试解析此对象的每个位置编写“打印功能”。有可能确定完整的错误堆栈（在出现此错误之前）。

ps：我认为这是产品中的某种错误。

使用PDFMiner解析没有/ Root对象的PDF

面试问答相关栏目本月热门文章