字符识别 | The War Of Mine

这是之前花一下午帮别人做的一个项目…收入600.

一个optical character recognition (OCR) system.
最后的结果如下

训练集是这样的图片

然后对测试集图片进行测试

思路很简单

灰度阈值分割, 先二值化, 把原来的图片化为01点阵, 这一步可以直接用matlab的graytresh来做, 会自动帮你找出最好的哪一个阈值

二值化

%% gray test
level = graythresh(im);
th=level*255 % level is in [0,1]
im2 = uint8(im < th);
figure
imagesc(~im2)
colormap gray

找出每一待识别原始形成训练集

feat = BoundingBox('z.bmp');

这里的boundingBox是我自己写的函数, 自动检测图片并进行boudingBox检测.

function Features = BoundingBox(im_file)
% 这个函数就是给定一个图片地址
% 生成boundingbox的图片
im = imread(im_file);
level = graythresh(im);
th = level*255; % level is in [0,1]
im2 = uint8(im < th);
L = bwlabel(im2);
Nc=max(max(L));
figure
imagesc(L)
hold on;
Features = [];
count = 0;
for i=1:Nc; [r,c]=find(L==i);
    maxr=max(r);
    minr=min(r);
    maxc=max(c);
    minc=min(c);
    if abs(minr - maxr) < 10 || abs(minc - maxc) < 10 || abs(minr - maxr) > 100 || abs(minc - maxc) > 100 || minr == maxr || maxc == minc
        continue
    else
        rectangle('Position',[minc,minr,maxc-minc+1,maxr-minr+1], 'EdgeColor','w');
        cim = im2(minr-1:maxr+1,minc-1:maxc+1);
        [centroid, theta, roundness, inmo] = moments(cim, 1);
        Features = [Features; theta roundness inmo];
        count = count + 1;
    end
end
feature_mean = mean(Features);
feature_std = std(Features);
for i = 1:size(Features,1)
    Features(i,:) = (Features(i,:) - feature_mean)./feature_std;
end
title([num2str(count),' components detected'])
figure
imagesc(L)
title([num2str(count),' components detected'])
end

函数思路为, 首先利用matlab的bwlabel进行连通区域检测, 然后对每一个连通区域找出其左上角和右下角的坐标, 利用rectangle函数绘制连通区域, 这里需要注意的是为了避免有的过大连通区域和过小连通区域噪声的影响, 提前做了一个阈值处理.

有了连通区域, 取出这一连通区域的像素, 然后计算特征值作为训练集的特征来使用的, 利用的是moments函数, 函数如下

% MOMENTS
%
% Function calculates the moments of a binary image and returns the centroid,
% the angle of axis of minimum inertia, a measure of 'roundness', and a vector
% of invariant moments. The function assumes that there is only one object in
% the binary image. Function also displays the image and overlays the position
% of the centroid and the axis of minimum inertia.
%
% Usage: [centroid, theta, roundness, inmo] = moments(im, plotchoice)
%
% Args: im         = binary image containing values of 0 or 1
%       plotchoice = display image and axis; 0 for display, 1 for no display
%
% Returns: centroid  = 2-element vector
%          theta     = angle of axis of minimum inertia (radians)
%          roundness = ratio of (minimum inertia)/(maxiumum inertia)
%          inmo      = 4-element vector containing the first four
%                      invariant moments
%
% Note: Positive x is to the right and positive y is downwards, thus
%       angles are positive clockwise.

function [centroid, theta, roundness, inmo] = moments(im, plotchoice)

    % calculate centroid
    area = sum(sum(im));
    [rows,cols] = size(im);
    x = ones(rows,1)*[1:cols];
    y = [1:rows]'*ones(1,cols);
    meanx = sum(sum(double(im).*x))/area;
    meany = sum(sum(double(im).*y))/area;
    centroid = [meanx,meany];

    % calculate theta and roundness
    xdash = x - meanx;
    ydash = y - meany;
    a = sum(sum(double(im).*xdash.*xdash));
    b = 2*(sum(sum(double(im).*xdash.*ydash)));
    c = sum(sum(double(im).*ydash.*ydash));
    aminusc = (a-c);
    denom = sqrt((b*b)+(aminusc*aminusc));
    if denom == 0
    theta = 0;
    roundness = 1;
    else
    sin2theta = b/denom;
    cos2theta = aminusc/denom;
    twotheta = atan2(sin2theta,cos2theta);
    theta = twotheta/2;

    costheta2 = cos(theta)*cos(theta);
    minaxis = a*(1-costheta2)+c*costheta2-b*sin2theta/2;
    maxtheta = atan2(-1*sin2theta, -1*cos2theta);
    costheta2 = cos(maxtheta)*cos(maxtheta);
    maxaxis = a*(1-costheta2)+c*costheta2+b*sin2theta/2;
    roundness = minaxis/maxaxis;
    end

    % display axis of minimum inertia
    if plotchoice == 0
    startx = 0;
    startrho = -meanx / cos(theta);
    starty = meany + startrho * sin(theta);

    endx = cols;
    endrho = (endx - meanx) / cos(theta);
    endy = meany + endrho * sin(theta);

    imagesc(im);
    hold on;
    plot([startx; meanx; endx], [starty; meany; endy],'gx--');
    end

    % calculate invariant moments
    % phi1
    mu20 = sum(sum(double(im).*xdash.*xdash));
    gamma = (2 + 2)/2;
    n20 = mu20/(area^gamma);
    mu02 = sum(sum(double(im).*ydash.*ydash));
    n02 = mu02/(area^gamma);
    phi1 = n20 + n02;
    % phi2
    mu11 = sum(sum(double(im).*xdash.*ydash));
    n11 = mu11/(area^gamma); % gamma is the same
    phi2 = (n20 - n02)^2 + 4*n11^2;
    % phi3 and phi4
    mu30 = sum(sum(double(im).*xdash.*xdash.*xdash));
    gamma = (3 + 2)/2;
    n30 = mu30/(area^gamma);
    mu12 = sum(sum(double(im).*xdash.*ydash.*ydash));
    n12 = mu12/(area^gamma);
    mu21 = sum(sum(double(im).*xdash.*xdash.*ydash));
    n21 = mu21/(area^gamma);
    mu03 = sum(sum(double(im).*ydash.*ydash.*ydash));
    n03 = mu03/(area^gamma);
    phi3 = (n30 - 3*n12)^2 + (3*n21 - n03)^2;
    phi4 = (n30 - n12)^2 + (n21 - n03)^2;
    % inmo
    inmo = [phi1 phi2 phi3 phi4];

然后保存下来, 封装成boundingbox函数, 来提取每一图片所有字符及其特征形成训练集, 接下来就是重复性的工作了

得到所有的训练样本

function res = MyExtractorFunction(plo,km)
% 这个函数是用来获得训练数据的特征和labels的
% 调用格式
% Features = MyExtractorFunction('~/documents/matlab/assignment4_/H4-16images',0);
% 第一个包含图片的文件夹, 第二个设置0就好, 1的话会画图, 不过画图这个功能我没有太测试, 可能有点问题
% 你可以改改
img_path_list = dir('*.bmp');
img_num = length(img_path_list);
Features = [];
Labels = [];
count = 0;
if img_num > 0
    for kk = 1:img_num
        if length(img_path_list(kk).name) >= 6
            continue
        else
            count = count + 1;
            im = imread(img_path_list(kk).name);
            size(im);
            %imshow(im)
            h = imhist(im);
            %triangle test
            %level=triangle_th(h,200);
            %gray test
            level = graythresh(im);
            th=level*255; % level is in [0,1]
            im2 = uint8(im < th);

            %connected component analysis
            L = bwlabel(im2);
            %res = regionprops(im2, 'BoundingBox')
            Nc=max(max(L));
            if plo == 1 && km == kk
                subplot(2,2,1)
                plot(h(1:220))
                title('Intensity Histogram')
                subplot(2,2,2)
                imagesc(im2)
                colormap gray
                subplot(2,2,3)
                imagesc(L)
                title('Connected Components')
                subplot(2,2,4)
                imagesc(L)
            end
            for i=1:Nc; [r,c]=find(L==i);
                maxr=max(r);
                minr=min(r);
                maxc=max(c);
                minc=min(c);
                if abs(minr - maxr) < 10 || abs(minc - maxc) < 10 || abs(minr - maxr) > 100 || abs(minc - maxc) > 100 || minr == maxr || maxc == minc
                    continue
                else
                    cim = im2(minr-1:maxr,minc-1:maxc);
                    [centroid, theta, roundness, inmo] = moments(cim, 1);
                    Features = [Features; theta roundness inmo];
                    rectangle('Position',[minc,minr,maxc-minc+1,maxr-minr+1], 'EdgeColor','w');
                    Labels = [Labels;count];
                end
            end
        end
    end
end 
res = [Features Labels];
count

算法

这里算法部分到没太大要求

有了训练集后, 最简单的自然是knn, 从1开始到几十, 目前测试的情况是15左右效果比较好的,60%多.