java使用POI实现html和word相互转换

项目后端使用了springboot，maven，前端使用了ckeditor富文本编辑器。目前从html转换的word为doc格式，而图片处理支持的是docx格式，所以需要手动把doc另存为docx，然后才可以进行图片替换。

一.添加maven依赖

主要使用了以下和poi相关的依赖，为了便于获取html的图片元素，还使用了jsoup：


  org.apache.poi
  poi
  3.14



  org.apache.poi
  poi-scratchpad
  3.14



  org.apache.poi
  poi-ooxml
  3.14



  fr.opensagres.xdocreport
  xdocreport
  1.0.6



  org.apache.poi
  poi-ooxml-schemas
  3.14



  org.apache.poi
  ooxml-schemas
  1.3



  org.jsoup
  jsoup
  1.11.3

二.word转换为html

在springboot项目的resources目录下新建static文件夹，将需要转换的word文件temp.docx粘贴进去，由于static是springboot的默认资源文件，所以不需要在配置文件里面另行配置了，如果改成其他名字，需要在application.yml进行相应配置。

doc格式转换为html：

public static String docToHtml() throws Exception {
  File path = new File(ResourceUtils.getURL("classpath:").getPath());
  String imagePathStr = path.getAbsolutePath() + "\static\image\";
  String sourceFileName = path.getAbsolutePath() + "\static\test.doc";
  String targetFileName = path.getAbsolutePath() + "\static\test2.html";
  File file = new File(imagePathStr);
  if(!file.exists()) {
    file.mkdirs();
  }
  HWPFdocument worddocument = new HWPFdocument(new FileInputStream(sourceFileName));
  org.w3c.dom.document document = documentBuilderFactory.newInstance().newdocumentBuilder().newdocument();
  WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(document);
  //保存图片，并返回图片的相对路径
  wordToHtmlConverter.setPicturesManager((content, pictureType, name, width, height) -> {
    try (FileOutputStream out = new FileOutputStream(imagePathStr + name)) {
      out.write(content);
    } catch (Exception e) {
      e.printStackTrace();
    }
    return "image/" + name;
  });
  wordToHtmlConverter.processdocument(worddocument);
  org.w3c.dom.document htmldocument = wordToHtmlConverter.getdocument();
  DOMSource domSource = new DOMSource(htmldocument);
  StreamResult streamResult = new StreamResult(new File(targetFileName));
  TransformerFactory tf = TransformerFactory.newInstance();
  Transformer serializer = tf.newTransformer();
  serializer.setOutputProperty(OutputKeys.ENCODING, "utf-8");
  serializer.setOutputProperty(OutputKeys.INDENT, "yes");
  serializer.setOutputProperty(OutputKeys.METHOD, "html");
  serializer.transform(domSource, streamResult);
  return targetFileName;
}

docx格式转换为html

public static String docxToHtml() throws Exception {
  File path = new File(ResourceUtils.getURL("classpath:").getPath());
  String imagePath = path.getAbsolutePath() + "\static\image";
  String sourceFileName = path.getAbsolutePath() + "\static\test.docx";
  String targetFileName = path.getAbsolutePath() + "\static\test.html";

  OutputStreamWriter outputStreamWriter = null;
  try {
    XWPFdocument document = new XWPFdocument(new FileInputStream(sourceFileName));
    XHTMLOptions options = XHTMLOptions.create();
    // 存放图片的文件夹
    options.setExtractor(new FileImageExtractor(new File(imagePath)));
    // html中图片的路径
    options.URIResolver(new BasicURIResolver("image"));
    outputStreamWriter = new OutputStreamWriter(new FileOutputStream(targetFileName), "utf-8");
    XHTMLConverter xhtmlConverter = (XHTMLConverter) XHTMLConverter.getInstance();
    xhtmlConverter.convert(document, outputStreamWriter, options);
  } finally {
    if (outputStreamWriter != null) {
      outputStreamWriter.close();
    }
  }
  return targetFileName;
}

转换成功后会生成对应的html文件，如果想在前端展示，直接读取文件转换为String返回给前端即可。

public static String readfile(String filePath) {
  File file = new File(filePath);
  InputStream input = null;
  try {
    input = new FileInputStream(file);
  } catch (FileNotFoundException e) {
    e.printStackTrace();
  }
  StringBuffer buffer = new StringBuffer();
  byte[] bytes = new byte[1024];
  try {
    for (int n; (n = input.read(bytes)) != -1;) {
      buffer.append(new String(bytes, 0, n, "utf8"));
    }
  } catch (IOException e) {
    e.printStackTrace();
  }
  return buffer.toString();
}

在富文本编辑器ckeditor中的显示效果：

三.html转换为word

实现思路就是先把html中的所有图片元素提取出来，统一替换为变量字符”${imgReplace}“，如果多张图片，可以依序排列下去，之后生成对应的doc文件（之前试过直接生成docx文件发现打不开，这个问题尚未找到好的解决方法），我们将其另存为docx文件，之后就可以替换变量为图片了：

public static String writeWordFile(String content) {
    String path = "D:/wordFile";
    Map param = new HashMap();

    if (!"".equals(path)) {
      File fileDir = new File(path);
      if (!fileDir.exists()) {
 fileDir.mkdirs();
      }
      content = HtmlUtils.htmlUnescape(content);
      List> imgs = getImgStr(content);
      int count = 0;
      for (HashMap img : imgs) {
 count++;
 //处理替换以“/>”结尾的img标签
 content = content.replace(img.get("img"), "${imgReplace" + count + "}");
 //处理替换以“>”结尾的img标签
 content = content.replace(img.get("img1"), "${imgReplace" + count + "}");
 Map header = new HashMap();

 try {
   File filePath = new File(ResourceUtils.getURL("classpath:").getPath());
   String imagePath = filePath.getAbsolutePath() + "\static\";
   imagePath += img.get("src").replaceAll("/", "\\");
   //如果没有宽高属性，默认设置为400*300
   if(img.get("width") == null || img.get("height") == null) {
     header.put("width", 400);
     header.put("height", 300);
   }else {
     header.put("width", (int) (Double.parseDouble(img.get("width"))));
     header.put("height", (int) (Double.parseDouble(img.get("height"))));
   }
   header.put("type", "jpg");
   header.put("content", OfficeUtil.inputStream2ByteArray(new FileInputStream(imagePath), true));
 } catch (FileNotFoundException e) {
   e.printStackTrace();
 }
 param.put("${imgReplace" + count + "}", header);
      }
      try {
 // 生成doc格式的word文档，需要手动改为docx
 byte by[] = content.getBytes("UTF-8");
 ByteArrayInputStream bais = new ByteArrayInputStream(by);
 POIFSFileSystem poifs = new POIFSFileSystem();
 DirectoryEntry directory = poifs.getRoot();
 documentEntry documentEntry = directory.createdocument("Worddocument", bais);
 FileOutputStream ostream = new FileOutputStream("D:\wordFile\temp.doc");
 poifs.writeFilesystem(ostream);
 bais.close();
 ostream.close();

 // 临时文件（手动改好的docx文件）
 CustomXWPFdocument doc = OfficeUtil.generateWord(param, "D:\wordFile\temp.docx");
 //最终生成的带图片的word文件
 FileOutputStream fopts = new FileOutputStream("D:\wordFile\final.docx");
 doc.write(fopts);
 fopts.close();
      } catch (Exception e) {
 e.printStackTrace();
      }

    }
    return "D:/wordFile/final.docx";
  }

  //获取html中的图片元素信息
  public static List> getImgStr(String htmlStr) {
    List> pics = new ArrayList>();

    document doc = Jsoup.parse(htmlStr);
    Elements imgs = doc.select("img");
    for (Element img : imgs) {
      HashMap map = new HashMap();
      if(!"".equals(img.attr("width"))) {
 map.put("width", img.attr("width").substring(0, img.attr("width").length() - 2));
      }
      if(!"".equals(img.attr("height"))) {
 map.put("height", img.attr("height").substring(0, img.attr("height").length() - 2));
      }
      map.put("img", img.toString().substring(0, img.toString().length() - 1) + "/>");
      map.put("img1", img.toString());
      map.put("src", img.attr("src"));
      pics.add(map);
    }
    return pics;
  }

OfficeUtil工具类，之前发现网上的写法只支持一张图片的修改，多张图片就会报错，是因为添加了图片，processParagraphs方法中的runs的大小改变了，会报ArrayList的异常，就和我们循环list中删除元素会报异常道理一样，解决方法就是复制一个新的Arraylist进行循环即可：

package com.example.demo.util; 

import java.io.ByteArrayInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;

import org.apache.poi.POIXMLdocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFRun;
import org.apache.poi.xwpf.usermodel.XWPFTable;
import org.apache.poi.xwpf.usermodel.XWPFTableCell;
import org.apache.poi.xwpf.usermodel.XWPFTableRow; 

 
public class OfficeUtil { 

   
  public static CustomXWPFdocument generateWord(Map param, String template) { 
    CustomXWPFdocument doc = null;
    try { 
      OPCPackage pack = POIXMLdocument.openPackage(template); 
      doc = new CustomXWPFdocument(pack); 
      if (param != null && param.size() > 0) { 

 //处理段落 
 List paragraphList = doc.getParagraphs(); 
 processParagraphs(paragraphList, param, doc); 

 //处理表格 
 Iterator it = doc.getTablesIterator(); 
 while (it.hasNext()) {
   XWPFTable table = it.next(); 
   List rows = table.getRows(); 
   for (XWPFTableRow row : rows) { 
     List cells = row.getTableCells(); 
     for (XWPFTableCell cell : cells) { 
List paragraphListTable = cell.getParagraphs(); 
processParagraphs(paragraphListTable, param, doc); 
     } 
   } 
 } 
      } 
    } catch (Exception e) { 
      e.printStackTrace(); 
    } 
    return doc; 
  } 
   
  public static void processParagraphs(List paragraphList,Map param,CustomXWPFdocument doc){ 
    if(paragraphList != null && paragraphList.size() > 0){ 
      for(XWPFParagraph paragraph:paragraphList){
 //poi转换过来的行间距过大，需要手动调整
 if(paragraph.getSpacingBefore() >= 1000 || paragraph.getSpacingAfter() > 1000) {
   paragraph.setSpacingBefore(0);
   paragraph.setSpacingAfter(0);
 }
 //设置word中左右间距
 paragraph.setIndentationLeft(0);
 paragraph.setIndentationRight(0);
 List runs = paragraph.getRuns();
 //加了图片，修改了paragraph的runs的size，所以循环不能使用runs
 List allRuns = new ArrayList(runs);
 for (XWPFRun run : allRuns) {
   String text = run.getText(0); 
   if(text != null){
     boolean isSetText = false; 
     for (Entry entry : param.entrySet()) { 
String key = entry.getKey(); 
if(text.indexOf(key) != -1){ 
  isSetText = true; 
  Object value = entry.getValue(); 
  if (value instanceof String) {//文本替换 
    text = text.replace(key, value.toString()); 
  } else if (value instanceof Map) {//图片替换 
    text = text.replace(key, ""); 
    Map pic = (Map)value; 
    int width = Integer.parseInt(pic.get("width").toString()); 
    int height = Integer.parseInt(pic.get("height").toString()); 
    int picType = getPictureType(pic.get("type").toString()); 
    byte[] byteArray = (byte[]) pic.get("content"); 
    ByteArrayInputStream byteInputStream = new ByteArrayInputStream(byteArray); 
    try { 
      String blipId = doc.addPictureData(byteInputStream,picType); 
      doc.createPicture(blipId,doc.getNextPicNameNumber(picType), width, height,paragraph);
    } catch (Exception e) { 
      e.printStackTrace(); 
    } 
  } 
} 
     } 
     if(isSetText){ 
run.setText(text,0); 
     } 
   } 
 } 
      } 
    } 
  } 
   
  private static int getPictureType(String picType){ 
    int res = CustomXWPFdocument.PICTURE_TYPE_PICT; 
    if(picType != null){ 
      if(picType.equalsIgnoreCase("png")){ 
 res = CustomXWPFdocument.PICTURE_TYPE_PNG; 
      }else if(picType.equalsIgnoreCase("dib")){ 
 res = CustomXWPFdocument.PICTURE_TYPE_DIB; 
      }else if(picType.equalsIgnoreCase("emf")){ 
 res = CustomXWPFdocument.PICTURE_TYPE_EMF; 
      }else if(picType.equalsIgnoreCase("jpg") || picType.equalsIgnoreCase("jpeg")){ 
 res = CustomXWPFdocument.PICTURE_TYPE_JPEG; 
      }else if(picType.equalsIgnoreCase("wmf")){ 
 res = CustomXWPFdocument.PICTURE_TYPE_WMF; 
      } 
    } 
    return res; 
  } 
   
  public static byte[] inputStream2ByteArray(InputStream in,boolean isClose){ 
    byte[] byteArray = null; 
    try { 
      int total = in.available(); 
      byteArray = new byte[total]; 
      in.read(byteArray); 
    } catch (IOException e) { 
      e.printStackTrace(); 
    }finally{ 
      if(isClose){ 
 try { 
   in.close(); 
 } catch (Exception e2) { 
   System.out.println("关闭流失败"); 
 } 
      } 
    } 
    return byteArray; 
  } 
}

我认为之所以word2003不支持图片替换，主要是处理2003版本的HWPFdocument对象被声明为了final，我们就无法重写他的方法了。而处理2007版本的类为XWPFdocument，是可以继承的，通过继承XWPFdocument，重写createPicture方法即可实现图片替换，以下为对应的CustomXWPFdocument类：

package com.example.demo.util;  

import java.io.IOException; 
import java.io.InputStream; 
import org.apache.poi.openxml4j.opc.OPCPackage; 
import org.apache.poi.xwpf.usermodel.XWPFdocument; 
import org.apache.poi.xwpf.usermodel.XWPFParagraph; 
import org.apache.xmlbeans.XmlException; 
import org.apache.xmlbeans.XmlToken; 
import org.openxmlformats.schemas.drawingml.x2006.main.CTNonVisualDrawingProps; 
import org.openxmlformats.schemas.drawingml.x2006.main.CTPositiveSize2D; 
import org.openxmlformats.schemas.drawingml.x2006.wordprocessingDrawing.CTInline; 

 
public class CustomXWPFdocument extends XWPFdocument {  
  public CustomXWPFdocument(InputStream in) throws IOException {  
    super(in);  
  }  

  public CustomXWPFdocument() {  
    super();  
  }  

  public CustomXWPFdocument(OPCPackage pkg) throws IOException {  
    super(pkg);  
  }  

   
  public void createPicture(String blipId, int ind, int width, int height,XWPFParagraph paragraph) {  
    final int EMU = 9525;  
    width *= EMU;  
    height *= EMU;  
    CTInline inline = paragraph.createRun().getCTR().addNewDrawing().addNewInline();  
    String picXml = ""  
 + ""  
 + "  "  
 + "   "  
 + "     " + "      "  
 + "      "  
 + "     "  
 + "     "  
 + "      "  
 + "      "  
 + " "  
 + "      "  
 + "     "  
 + "     "  
 + "      "  
 + " "  
 + " "  
 + "      "  
 + "      "  
 + " "  
 + "      "  
 + "     "  
 + "   "  
 + "  " + "";  

    inline.addNewGraphic().addNewGraphicData();  
    XmlToken xmlToken = null;  
    try {  
      xmlToken = XmlToken.Factory.parse(picXml);  
    } catch (XmlException xe) {  
      xe.printStackTrace();  
    }  
    inline.set(xmlToken);  

    inline.setDistT(0);   
    inline.setDistB(0);   
    inline.setDistL(0);   
    inline.setDistR(0);   

    CTPositiveSize2D extent = inline.addNewExtent();  
    extent.setCx(width);  
    extent.setCy(height);  

    CTNonVisualDrawingProps docPr = inline.addNewDocPr();   
    docPr.setId(ind);   
    docPr.setName("图片" + ind);   
    docPr.setDescr("测试");  
  }  
}

以上就是通过POI实现html和word的相互转换，对于html无法转换为可读的docx这个问题尚未解决，如果大家有好的解决方法可以交流一下。

java使用POI实现html和word相互转换

Java相关栏目本月热门文章