正如plinth和David van
Driessche在他们的答案中已经指出的那样,从PDF文件中提取文本并非易事。幸运的是,iText解析器包中的类为您完成了大部分繁重的工作。您已经从该程序包中找到至少一个类,
PdfTextExtractor,但是如果您仅对页面的纯文本感兴趣,则该类本质上是使用iText的解析器功能的便捷实用程序。在您的情况下,您必须更深入地研究该软件包中的类。
出发点,以获得与iText的文本提取的主题信息是一款15.3 解析PDF文件
的的iText在行动-第2版,特别是该方法
extractText的样本ParsingHelloWorld.java:
public void extractText(String src, String dest) throws IOException{ PrintWriter out = new PrintWriter(new FileOutputStream(dest)); PdfReader reader = new PdfReader(src); RenderListener listener = new MyTextRenderListener(out); PdfContentStreamProcessor processor = new PdfContentStreamProcessor(listener); PdfDictionary pageDic = reader.getPageN(1); PdfDictionary resourcesDic = pageDic.getAsDict(PdfName.RESOURCES); processor.processContent(ContentByteUtils.getContentBytesForPage(reader, 1), resourcesDic); out.flush(); out.close();}它利用
RenderListener实现MyTextRenderListener.java:
public class MyTextRenderListener implements RenderListener{ [...] public void renderText(TextRenderInfo renderInfo) { out.print("<"); out.print(renderInfo.getText()); out.print(">"); }}尽管此
RenderListener实现仅输出文本,但它检查的TextRenderInfo对象提供了更多信息:
public LineSegment getbaseline(); // the baseline for the text (i.e. the line that the text 'sits' on)public LineSegment getAscentLine(); // the ascentline for the text (i.e. the line that represents the topmost extent that a string of the current font could have)public LineSegment getDescentLine(); // the descentline for the text (i.e. the line that represents the bottom most extent that a string of the current font could have)public float getRise() ; // the rise which represents how far above the nominal baseline the text should be renderedpublic String getText(); // the text to renderpublic int getTextRenderMode(); // the text render modepublic documentFont getFont(); // the fontpublic float getSingleSpaceWidth(); // the width, in user space units, of a single space character in the current fontpublic List<TextRenderInfo> getCharacterRenderInfos(); // details useful if a listener needs access to the position of each individual glyph in the text render operation
因此,如果你
RenderListener除了与检查文本
getText()还考虑
getbaseline(),甚至
getAscentLine()和
getDescentLine().你把所有的坐标,你可能会需要。
PS:
有是在代码的包装类
ParsingHelloWorld.extractText(),PdfReaderContentParser,它允许您只需编写以下给出
PdfReaderreader,的
int page,和
RenderListener renderListener:
PdfReaderContentParser parser = new PdfReaderContentParser(reader);parser.processContent(page, renderListener);



