从pdf提取引用-Python

PDF

非常复杂，我不是专家，但是我使用了extractText（）的源代码来查看其工作方式，并使用它

print('>>>',operator, operands)

可以查看它在PDF中找到的值。

在本文档中，它用于

"Tm"

将位置移动到新行，因此更改了原始代码

extractText()

，我曾经

"Tm"

添加

并在行中得到了文本

Arto Anttila. 1995. How to recognise subjects in English. In Karlsson et al., chapt. 9, pp. 315-358. Dekang Lin. 1996. evaluation of Principar with the Susanne corpus. In John Carroll, editor, Work- shop on Robust Parsing, pages 54-69, Prague. Jason M. Eisner. 1996. Three new probabilistic models for dependency parsing: An exploration. In The 16th International Conference on Compu- tational Linguistics, pages 340-345. Copenhagen. David G. Hays. 1964. Dependency theory: A formalism and some observations. Language, 40(4):511-525.

或与

---

线之间

---Arto Anttila. 1995. How to recognise subjects in ---English. In Karlsson et al., chapt. 9, pp. 315-358. ---Dekang Lin. 1996. evaluation of Principar with the ---Susanne corpus. In John Carroll, editor, Work- ---shop on Robust Parsing, pages 54-69, Prague. ---Jason M. Eisner. 1996. Three new probabilistic ---models for dependency parsing: An exploration. ---In The 16th International Conference on Compu- ---tational Linguistics, pages 340-345. Copenhagen. ---David G. Hays. 1964. Dependency theory: A ---formalism and some observations. Language, ---40(4):511-525.

但是它仍然不是那么有用，但是现在我用来获得这个结果的代码

import PyPDF2from PyPDF2.pdf import *  # to import function used in origimal `extractText`# --- functions ---def myExtractText(self):      # pre from original `extractText()`    # https://github.com/mstamy2/PyPDF2/blob/d7b8d3e0f471530267827511cdffaa2ab48bc1ad/PyPDF2/pdf.py#L2645    text = u_("")    content = self["/Contents"].getObject()    if not isinstance(content, ContentStream):        content = ContentStream(content, self.pdf)    for operands, operator in content.operations:        # used only for test to see values in variables        #print('>>>', operator, operands)        if operator == b_("Tj"): _text = operands[0] if isinstance(_text, TextStringObject):     text += _text        elif operator == b_("T*"): text += "n"        elif operator == b_("'"): text += "n" _text = operands[0] if isinstance(_text, TextStringObject):     text += operands[0]        elif operator == b_('"'): _text = operands[2] if isinstance(_text, TextStringObject):     text += "n"     text += _text        elif operator == b_("TJ"): for i in operands[0]:     if isinstance(i, TextStringObject):         text += i text += "n"        # new pre to add `n` when text moves to new line        elif operator == b_("Tm"): text += 'n'    return text# --- main ---pdfFileObj = open('A97-1011.pdf', 'rb')pdfReader = PyPDF2.PdfFileReader(pdfFileObj)text = ''for page in pdfReader.pages:    #text += page.extractText()  # original function    text += myExtractText(page)  # modified function# get only text after word `References`pos = text.lower().find('references')text = text[pos+len('references '):]# print all at onceprint(text)# print line by linefor line in text.split('n'):    print(line)    print('---')

挖掘之后，似乎

Tm

也有值，并且有一个新位置

x,y

可以用来计算文本行之间的距离，并且

当距离大于某个值时可以添加。我测试了不同的价值，从价值中

我得到了预期的结果

---Arto Anttila. 1995. How to recognise subjects in English. In Karlsson et al., chapt. 9, pp. 315-358. ---Dekang Lin. 1996. evaluation of Principar with the Susanne corpus. In John Carroll, editor, Work- shop on Robust Parsing, pages 54-69, Prague. ---Jason M. Eisner. 1996. Three new probabilistic models for dependency parsing: An exploration. In The 16th International Conference on Compu- tational Linguistics, pages 340-345. Copenhagen. ---David G. Hays. 1964. Dependency theory: A formalism and some observations. Language, 40(4):511-525. ---

这里的代码

import PyPDF2from PyPDF2.pdf import *  # to import function used in origimal `extractText`# --- functions ---def myExtractText2(self):    # original pre from `page.extractText()`    # https://github.com/mstamy2/PyPDF2/blob/d7b8d3e0f471530267827511cdffaa2ab48bc1ad/PyPDF2/pdf.py#L2645    text = u_("")    content = self["/Contents"].getObject()    if not isinstance(content, ContentStream):        content = ContentStream(content, self.pdf)    prev_x = 0    prev_y = 0    for operands, operator in content.operations:        # used only for test to see values in variables        #print('>>>', operator, operands)        if operator == b_("Tj"): _text = operands[0] if isinstance(_text, TextStringObject):     text += _text        elif operator == b_("T*"): text += "n"        elif operator == b_("'"): text += "n" _text = operands[0] if isinstance(_text, TextStringObject):     text += operands[0]        elif operator == b_('"'): _text = operands[2] if isinstance(_text, TextStringObject):     text += "n"     text += _text        elif operator == b_("TJ"): for i in operands[0]:     if isinstance(i, TextStringObject):         text += i text += "n"        elif operator == b_("Tm"): x = operands[-2] y = operands[-1] diff_x = prev_x - x diff_y = prev_y - y #print('>>>', diff_x, diff_y - y) #text += f'| {diff_x}, {diff_y - y} |' if diff_y > 17 or diff_y < 0:  # (bigger margin) or (move to top in next column)     text += 'n'     #text += 'n' # to add empty line between elements prev_x = x prev_y = y    return text# --- main ---pdfFileObj = open('A97-1011.pdf', 'rb')pdfReader = PyPDF2.PdfFileReader(pdfFileObj)text = ''for page in pdfReader.pages:    #text += page.extractText()  # original function    text += myExtractText(page)  # modified function# get only text after word `References`pos = text.lower().find('references')text = text[pos+len('references '):]# print all at onceprint(text)# print line by linefor line in text.split('n'):    print(line)    print('---')

它适用于此PDF，但其他文件可能具有不同的结构或彼此之间的距离，

references

并且可能需要其他更改。

编辑：

更通用的版本-它有第二个论点

如果没有第二个参数运行

 text += myExtractText(page)

那么它的工作原理就像原始的一样，

extractText()

并且您将所有内容集中在一个字符串中。

如果第二个参数是

True

 text += myExtractText(page, True)

然后每次添加新行

Tm

-就像我的第一个版本一样。

如果第二个参数是整数-即。

 text += myExtractText(page, 17)

然后它会在距离更大时添加新行

-就像我的第二个版本一样。

import PyPDF2from PyPDF2.pdf import *  # to import function used in origimal `extractText`# --- functions ---def myExtractText(self, distance=None):    # original pre from `page.extractText()`    # https://github.com/mstamy2/PyPDF2/blob/d7b8d3e0f471530267827511cdffaa2ab48bc1ad/PyPDF2/pdf.py#L2645    text = u_("")    content = self["/Contents"].getObject()    if not isinstance(content, ContentStream):        content = ContentStream(content, self.pdf)    prev_x = 0    prev_y = 0    for operands, operator in content.operations:        # used only for test to see values in variables        #print('>>>', operator, operands)        if operator == b_("Tj"): _text = operands[0] if isinstance(_text, TextStringObject):     text += _text        elif operator == b_("T*"): text += "n"        elif operator == b_("'"): text += "n" _text = operands[0] if isinstance(_text, TextStringObject):     text += operands[0]        elif operator == b_('"'): _text = operands[2] if isinstance(_text, TextStringObject):     text += "n"     text += _text        elif operator == b_("TJ"): for i in operands[0]:     if isinstance(i, TextStringObject):         text += i text += "n"        if operator == b_("Tm"): if distance is True:      text += 'n' elif isinstance(distance, int):     x = operands[-2]     y = operands[-1]     diff_x = prev_x - x     diff_y = prev_y - y     #print('>>>', diff_x, diff_y - y)     #text += f'| {diff_x}, {diff_y - y} |'     if diff_y > distance or diff_y < 0:  # (bigger margin) or (move to top in next column)         text += 'n'         #text += 'n' # to add empty line between elements     prev_x = x     prev_y = y    return text# --- main ---pdfFileObj = open('A97-1011.pdf', 'rb')pdfReader = PyPDF2.PdfFileReader(pdfFileObj)text = ''for page in pdfReader.pages:    #text += page.extractText()  # original function    #text += myExtractText(page)        # modified function (works like original version)    #text += myExtractText(page, True)  # modified function (add `n` after every `Tm`)    text += myExtractText(page, 17)  # modified function (add `n` only if distance is bigger then `17`)# get only text after word `References`pos = text.lower().find('references')text = text[pos+len('references '):]# print all at onceprint(text)# print line by linefor line in text.split('n'):    print(line)    print('---')

顺便说一句： 它不仅对

References

文本而且对其余文本都是有用的-似乎将段落分割了。

PDF开始的结果

---A non-projective dependency parser ---Pasi Tapanainen and Timo J~irvinen University of Helsinki, Department of General Linguistics Research Unit for Multilingual Language Technology P.O. Box 4, FIN-00014 University of Helsinki, Finland {Pas i. Tapanainen, Timo. Jarvinen}@l ing. Hel s inki. f i ---Abstract ---We describe a practical parser for unre- stricted dependencies. The parser creates links between words and names the links according to their syntactic functions. We first describe the older Constraint Gram- mar parser where many of the ideas come from. Then we proceed to describe the cen- tral ideas of our new parser. Finally, the parser is evaluated. ---1 Introduction ---We are concerned with surface-syntactic parsing of running text. Our main goal is to describe syntac- tic analyses of sentences using dependency links that show the he~t-modifier relations between words. In addition, these links have labels that refer to the syntactic function of the modifying word. A simpli- fied example is in Figure 1, where the link between I and see denotes that I is the modifier of see and its syntactic function is that of subject. Similarly, a modifies bird, and it is a determiner. ---see bi i ~ d'~b~ bird ---Figure 1: Dependencies for sentence I see a bird. ---First, in this paper, we explain some central con- cepts of the Constraint Grammar framework from which many of the ideas are derived. Then, we give some linguistic background to the notations we are using, with a brief comparison to other current de- pendency formalisms and systems. New formalism is described briefly, and it is utilised in a small toy grammar to illustrate how the formalism works. Fi- nally, the real parsing system, with a grammar of some 2 500 rules, is evaluated. ---64 ---The parser corresponds to over three man-years of work, which does not include the lexical analyser and the morphological disambiguator, both parts of the existing English Constraint Grammar parser (Karls- son et al., 1995). The parsers can be tested via WWW t . ---2 Background ---Our work is partly based on the work done with the Constraint Grammar framework that was orig- inally proposed by Fred Karlsson (1990). A de- tMled description of the English Constraint Gram- mar (ENGCG) is in Karlsson et al. (1995). The basic rule types of the Constraint Grammar (Tapanainen, 1996) 2 are REMOVE and SELECT for discarding and se- lecting an alternative reading of a word. Rules also have contextual tests that describe the condition ac- cording to which they may be applied. For example, the rule ---

从pdf提取引用-Python

面试问答相关栏目本月热门文章