如何使用Tesseract训练基于Python的OCR，以训练不同的国民身份证？

改善Pytesseract识别的步骤：

1）清洁图像阵列，以使只有文本（生成字体，而不是手写字体）。字母的边缘应无扭曲。 应用阈值（尝试不同的值）
。同时应用一些平滑过滤器。我还建议使用Morfholofical开/关-但这仅是一个奖励。这是应该以数组形式输入pytesseract识别的夸张示例：https
://i.ytimg.com/vi/1ns8tGgdpLY/maxresdefault.jpg

2）使用您要识别的文字将图像调整为更高的分辨率

3）Pytesseract通常应该识别任何类型的字母，但是通过安装用于书写文本的字体，可以极大地提高准确性。

如何在pytesseract中安装新字体：

1）以TIFF格式获取所需字体

2）将其上传到http://trainyourtesseract.com/并将经过培训的数据接收到您的电子邮件中

3）将训练后的数据文件（* .traineddata）添加到此文件夹C： Program Files（x86） Tesseract-OCR
tessdata

4）将此字符串命令添加到pytesseract重构函数中：

假设您有2种经过训练的字体：font1.traineddata和font2.traineddata
要同时使用这两个命令

txt = pytesseract.image_to_string（img，lang = ‘font1 + font2’ ）

这是测试您对网络图像的识别的代码：

import cv2import pytesseractimport cv2import numpy as npimport urllibimport requestspytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'TESSDATA_PREFIX = 'C:/Program Files (x86)/Tesseract-OCR'from PIL import Imagedef url_to_image(url):    resp = urllib.request.urlopen(url)    image = np.asarray(bytearray(resp.read()), dtype="uint8")    image = cv2.imdepre(image, cv2.IMREAD_COLOR)    return imageurl='http://jeroen.github.io/images/testocr.png'img = url_to_image(url)#img = cv2.GaussianBlur(img,(5,5),0)img = cv2.medianBlur(img,5) retval, img = cv2.threshold(img,150,255, cv2.THRESH_BINARY)txt = pytesseract.image_to_string(img, lang='eng')print('recognition:', txt)>>> txt'This ts a lot of 12 point text to test thenocr pre and see if it works on all typesnof file formatnnThe quick brown dog jumped over thenlazy fox The quick brown dog jumpednover the lazy fox The quick brown dognjumped over the lazy fox The quicknbrown dog jumped over the lazy fox'

如何使用Tesseract训练基于Python的OCR，以训练不同的国民身份证？

面试问答相关栏目本月热门文章