pom引入
org.apache.pdfbox pdfbox2.0.24
代码
public void pdf2Image(String pdfPath, String path) throws InterruptedException, IOException {
File file = new File(pdfPath);
try {
PDdocument doc = PDdocument.load(file);
int endPage = null == doc ? Integer.MAX_VALUE : doc.getNumberOfPages();
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(true);
stripper.setStartPage(1);
stripper.setEndPage(endPage);
String content = stripper.getText(doc);
System.out.println("pdf 文件解析,内容为:" + content);
PDFRenderer renderer = new PDFRenderer(doc);
int pageCount = doc.getNumberOfPages();
for (int i = 0; i < pageCount; i++) {
BufferedImage image = renderer.renderImage(i, 1.5f);// Windows native DPI
File file1 = new File(path);
ImageIO.write(image, "JPG", file1);
}
} catch (Exception e) {
e.printStackTrace();
}
}
运行之后发现转换后的图片少了一个字,但是内容识别出来了,看了下日志
Using fallback FZCHSJW--GB1-0 for CID-keyed font STSong-Light No glyph for 27765 (CID 38ac) in font STSong-Light
发现是运行时没有找到STSong-Light这个字体,使用FZCHSJW–GB1-0 这个字体进行替换,这个字体导致文字缺失
解决方法:
https://www.jianshu.com/p/b8692da38692



