JAVA验证码识别和tesseract训练过程

目录 1、tesseract-ocr下载安装 2、jTessBoxEditor2.0下载安装 1、下载安装tesseract ocr(图像识别类库) v4.0.0.20181030（目的是加载识别路径）

1.1 java代码加载tesseract路径并识别

private static void getIdentifyPictrue(){
//验证码图片存储地址
File imageFile = new File(“E:tessreactpictrue11.jpg”);
if(!imageFile.exists()){
System.out.println(“图片不存在”);;
}
Tesseract tessreact = new Tesseract();
tessreact.setLanguage(“eng”);
tessreact.setDatapath(“E:tessreacttesseract-ocrtessdata”);

    String result;
    try {
        result = "测验结果：" + tessreact.doOCR(imageFile);
        System.out.println(result);
    } catch (TesseractException e) {
        e.printStackTrace();
    }
}

2、jTessBoxEditor2.0验证码图片训练过程 2.1 下载jTessBoxEditor2.0工具，用于调整图片上文字的内容和位置（下载地址：https://sourceforge.net/projects/vietocr/files/jTessBoxEditor/） 2.2验证码训练步骤 1、需要配置环境变量(如果不配置就要定位tesseract.exe，命令行需要加tesseract.exe)

C:Program Files (x86)Tesseract-OCR

2、使用jTessBoxEditor工具生成tif文件（Tools->Merge TIFF）

注意tif文件的命名规则eng.normal.exp0.tif（命名格式[lang].[fontname].exp[num].tif）num自定义

3、生成box文件

tesseract eng.normal.exp0.tif eng.normal.exp0 batch.nochop makebox

4、使用jTessBoxEditor工具对tif文件进行训练 5、生成tr文件

tesseract eng.normal.exp0.tif eng.normal.exp0 nobatch box.train

6、生成生成字符集文件unicharset文件

unicharset_extractor eng.normal.exp0.box

7、生成font_properties文件（此文件没有后缀名）

echo normal 0 0 0 0 0 > font_properties

8、生成shape文件(执行完之后，会生成 shapetable 和 eng.unicharset 两个文件)

shapeclustering -F font_properties -U unicharset -O eng.unicharset eng.normal.exp0.tr

9、生成聚字符特征文件（会生成 inttemp、pffmtable、shapetable和zwp.unicharset四个文件）

mftraining -F font_properties -U unicharset -O unicharset eng.normal.exp0.tr

10、生成字符正常化特征文件（会生成 normproto 文件）

cntraining eng.normal.exp0.tr

11、文件重命名（重新命名inttemp、pffmtable、shapetable和normproto这四个文件的名字为[lang].xxx）

rename normproto eng.normproto
rename inttemp eng.inttemp
rename pffmtable eng.pffmtable
rename shapetable eng.shapetable

12、合并训练文件（会生成zwp.traineddata文件）

combine_tessdata eng

JAVA验证码识别和tesseract训练过程

Java相关栏目本月热门文章