print('>>>',operator, operands)可以查看它在PDF中找到的值。在本文档中,它用于
"Tm"将位置移动到新行,因此更改了原始代码
extractText(),我曾经
"Tm"添加
n并在行中得到了文本
Arto Anttila. 1995. How to recognise subjects in English. In Karlsson et al., chapt. 9, pp. 315-358. Dekang Lin. 1996. evaluation of Principar with the Susanne corpus. In John Carroll, editor, Work- shop on Robust Parsing, pages 54-69, Prague. Jason M. Eisner. 1996. Three new probabilistic models for dependency parsing: An exploration. In The 16th International Conference on Compu- tational Linguistics, pages 340-345. Copenhagen. David G. Hays. 1964. Dependency theory: A formalism and some observations. Language, 40(4):511-525.
或与
---线之间
---Arto Anttila. 1995. How to recognise subjects in ---English. In Karlsson et al., chapt. 9, pp. 315-358. ---Dekang Lin. 1996. evaluation of Principar with the ---Susanne corpus. In John Carroll, editor, Work- ---shop on Robust Parsing, pages 54-69, Prague. ---Jason M. Eisner. 1996. Three new probabilistic ---models for dependency parsing: An exploration. ---In The 16th International Conference on Compu- ---tational Linguistics, pages 340-345. Copenhagen. ---David G. Hays. 1964. Dependency theory: A ---formalism and some observations. Language, ---40(4):511-525.
但是它仍然不是那么有用,但是现在我用来获得这个结果的代码
import PyPDF2from PyPDF2.pdf import * # to import function used in origimal `extractText`# --- functions ---def myExtractText(self): # pre from original `extractText()` # https://github.com/mstamy2/PyPDF2/blob/d7b8d3e0f471530267827511cdffaa2ab48bc1ad/PyPDF2/pdf.py#L2645 text = u_("") content = self["/Contents"].getObject() if not isinstance(content, ContentStream): content = ContentStream(content, self.pdf) for operands, operator in content.operations: # used only for test to see values in variables #print('>>>', operator, operands) if operator == b_("Tj"): _text = operands[0] if isinstance(_text, TextStringObject): text += _text elif operator == b_("T*"): text += "n" elif operator == b_("'"): text += "n" _text = operands[0] if isinstance(_text, TextStringObject): text += operands[0] elif operator == b_('"'): _text = operands[2] if isinstance(_text, TextStringObject): text += "n" text += _text elif operator == b_("TJ"): for i in operands[0]: if isinstance(i, TextStringObject): text += i text += "n" # new pre to add `n` when text moves to new line elif operator == b_("Tm"): text += 'n' return text# --- main ---pdfFileObj = open('A97-1011.pdf', 'rb')pdfReader = PyPDF2.PdfFileReader(pdfFileObj)text = ''for page in pdfReader.pages: #text += page.extractText() # original function text += myExtractText(page) # modified function# get only text after word `References`pos = text.lower().find('references')text = text[pos+len('references '):]# print all at onceprint(text)# print line by linefor line in text.split('n'): print(line) print('---')挖掘之后,似乎
Tm也有值,并且有一个新位置
x,y可以用来计算文本行之间的距离,并且
n当距离大于某个值时可以添加。我测试了不同的价值,从价值中
17我得到了预期的结果
---Arto Anttila. 1995. How to recognise subjects in English. In Karlsson et al., chapt. 9, pp. 315-358. ---Dekang Lin. 1996. evaluation of Principar with the Susanne corpus. In John Carroll, editor, Work- shop on Robust Parsing, pages 54-69, Prague. ---Jason M. Eisner. 1996. Three new probabilistic models for dependency parsing: An exploration. In The 16th International Conference on Compu- tational Linguistics, pages 340-345. Copenhagen. ---David G. Hays. 1964. Dependency theory: A formalism and some observations. Language, 40(4):511-525. ---
这里的代码
import PyPDF2from PyPDF2.pdf import * # to import function used in origimal `extractText`# --- functions ---def myExtractText2(self): # original pre from `page.extractText()` # https://github.com/mstamy2/PyPDF2/blob/d7b8d3e0f471530267827511cdffaa2ab48bc1ad/PyPDF2/pdf.py#L2645 text = u_("") content = self["/Contents"].getObject() if not isinstance(content, ContentStream): content = ContentStream(content, self.pdf) prev_x = 0 prev_y = 0 for operands, operator in content.operations: # used only for test to see values in variables #print('>>>', operator, operands) if operator == b_("Tj"): _text = operands[0] if isinstance(_text, TextStringObject): text += _text elif operator == b_("T*"): text += "n" elif operator == b_("'"): text += "n" _text = operands[0] if isinstance(_text, TextStringObject): text += operands[0] elif operator == b_('"'): _text = operands[2] if isinstance(_text, TextStringObject): text += "n" text += _text elif operator == b_("TJ"): for i in operands[0]: if isinstance(i, TextStringObject): text += i text += "n" elif operator == b_("Tm"): x = operands[-2] y = operands[-1] diff_x = prev_x - x diff_y = prev_y - y #print('>>>', diff_x, diff_y - y) #text += f'| {diff_x}, {diff_y - y} |' if diff_y > 17 or diff_y < 0: # (bigger margin) or (move to top in next column) text += 'n' #text += 'n' # to add empty line between elements prev_x = x prev_y = y return text# --- main ---pdfFileObj = open('A97-1011.pdf', 'rb')pdfReader = PyPDF2.PdfFileReader(pdfFileObj)text = ''for page in pdfReader.pages: #text += page.extractText() # original function text += myExtractText(page) # modified function# get only text after word `References`pos = text.lower().find('references')text = text[pos+len('references '):]# print all at onceprint(text)# print line by linefor line in text.split('n'): print(line) print('---')它适用于此PDF,但其他文件可能具有不同的结构或彼此之间的距离,
references并且可能需要其他更改。
编辑:
更通用的版本-它有第二个论点
如果没有第二个参数运行
text += myExtractText(page)
那么它的工作原理就像原始的一样,
extractText()并且您将所有内容集中在一个字符串中。
如果第二个参数是
True
text += myExtractText(page, True)
然后每次添加新行
Tm-就像我的第一个版本一样。
如果第二个参数是整数-即。
17
text += myExtractText(page, 17)
然后它会在距离更大时添加新行
17-就像我的第二个版本一样。
import PyPDF2from PyPDF2.pdf import * # to import function used in origimal `extractText`# --- functions ---def myExtractText(self, distance=None): # original pre from `page.extractText()` # https://github.com/mstamy2/PyPDF2/blob/d7b8d3e0f471530267827511cdffaa2ab48bc1ad/PyPDF2/pdf.py#L2645 text = u_("") content = self["/Contents"].getObject() if not isinstance(content, ContentStream): content = ContentStream(content, self.pdf) prev_x = 0 prev_y = 0 for operands, operator in content.operations: # used only for test to see values in variables #print('>>>', operator, operands) if operator == b_("Tj"): _text = operands[0] if isinstance(_text, TextStringObject): text += _text elif operator == b_("T*"): text += "n" elif operator == b_("'"): text += "n" _text = operands[0] if isinstance(_text, TextStringObject): text += operands[0] elif operator == b_('"'): _text = operands[2] if isinstance(_text, TextStringObject): text += "n" text += _text elif operator == b_("TJ"): for i in operands[0]: if isinstance(i, TextStringObject): text += i text += "n" if operator == b_("Tm"): if distance is True: text += 'n' elif isinstance(distance, int): x = operands[-2] y = operands[-1] diff_x = prev_x - x diff_y = prev_y - y #print('>>>', diff_x, diff_y - y) #text += f'| {diff_x}, {diff_y - y} |' if diff_y > distance or diff_y < 0: # (bigger margin) or (move to top in next column) text += 'n' #text += 'n' # to add empty line between elements prev_x = x prev_y = y return text# --- main ---pdfFileObj = open('A97-1011.pdf', 'rb')pdfReader = PyPDF2.PdfFileReader(pdfFileObj)text = ''for page in pdfReader.pages: #text += page.extractText() # original function #text += myExtractText(page) # modified function (works like original version) #text += myExtractText(page, True) # modified function (add `n` after every `Tm`) text += myExtractText(page, 17) # modified function (add `n` only if distance is bigger then `17`)# get only text after word `References`pos = text.lower().find('references')text = text[pos+len('references '):]# print all at onceprint(text)# print line by linefor line in text.split('n'): print(line) print('---')顺便说一句: 它不仅对
References文本而且对其余文本都是有用的-似乎将段落分割了。
PDF开始的结果
---A non-projective dependency parser ---Pasi Tapanainen and Timo J~irvinen University of Helsinki, Department of General Linguistics Research Unit for Multilingual Language Technology P.O. Box 4, FIN-00014 University of Helsinki, Finland {Pas i. Tapanainen, Timo. Jarvinen}@l ing. Hel s inki. f i ---Abstract ---We describe a practical parser for unre- stricted dependencies. The parser creates links between words and names the links according to their syntactic functions. We first describe the older Constraint Gram- mar parser where many of the ideas come from. Then we proceed to describe the cen- tral ideas of our new parser. Finally, the parser is evaluated. ---1 Introduction ---We are concerned with surface-syntactic parsing of running text. Our main goal is to describe syntac- tic analyses of sentences using dependency links that show the he~t-modifier relations between words. In addition, these links have labels that refer to the syntactic function of the modifying word. A simpli- fied example is in Figure 1, where the link between I and see denotes that I is the modifier of see and its syntactic function is that of subject. Similarly, a modifies bird, and it is a determiner. ---see bi i ~ d'~b~ bird ---Figure 1: Dependencies for sentence I see a bird. ---First, in this paper, we explain some central con- cepts of the Constraint Grammar framework from which many of the ideas are derived. Then, we give some linguistic background to the notations we are using, with a brief comparison to other current de- pendency formalisms and systems. New formalism is described briefly, and it is utilised in a small toy grammar to illustrate how the formalism works. Fi- nally, the real parsing system, with a grammar of some 2 500 rules, is evaluated. ---64 ---The parser corresponds to over three man-years of work, which does not include the lexical analyser and the morphological disambiguator, both parts of the existing English Constraint Grammar parser (Karls- son et al., 1995). The parsers can be tested via WWW t . ---2 Background ---Our work is partly based on the work done with the Constraint Grammar framework that was orig- inally proposed by Fred Karlsson (1990). A de- tMled description of the English Constraint Gram- mar (ENGCG) is in Karlsson et al. (1995). The basic rule types of the Constraint Grammar (Tapanainen, 1996) 2 are REMOVE and SELECT for discarding and se- lecting an alternative reading of a word. Rules also have contextual tests that describe the condition ac- cording to which they may be applied. For example, the rule ---


