OCR with Julia

I am looking for suggestion on doing Optical Character Recognition. This does not necessarily have to be native Julia (see below). Also I am not looking for a full-on Machine Learnign framework which will recognise images.
I am looking for a library or an external program which if given an image of a typewritten page it will return a scoring value - the higher the score the more readable or more easily recognised is the text.
If this is best implemented usign Kaggle (which I know next to nothing about) I will hoist that aboard.

To explain, I have had a programming exercise in my head for years. Take an A4 typewritten page. Send it through a straight-cut shredder. Take the set of strips you have and scan them as images. Of course as an exercise you can do this virtually. Then combine the strips into an image page, at random. How can we piece together the original typewritten page?
That leads to using optimisations -perhaps a genetic algorithm - to arrange the strips.
Or it maybe that simple brute force is easier - just look at all possible arrangements to the strips.

Perhaps bad form to answer my own question… Tesseract may be what I am looking for
https://github.com/tesseract-ocr/tesseract

Other suggestions are of course welcomed. (I have searched for this - I just came across Tesseract after I asked on here)

2 Likes

I wish we could mobilize the community and build in Julia something better and simpler than this new LSTM-based OCR you just posted. Could we build a repo to start mapping out its design? Maybe if I saw its beginnings I could contribute to the completion of it. In any case, I am completing A. Ng’s ML course which teaches OCR on the eleventh week - maybe I could start mapping out the design.

2 Likes

Any progress on this front?

There exists a julia wrapper for tesseract, see GitHub - leferrad/OCReract.jl: A simple Julia wrapper for Tesseract OCR.
However, it does not appear to be maintained.

I guess Tesseract is a good tool to begin at least (from version 4.0, it has very good results in images with good enough resolution, even noisy ones). Then you can start thinking on adding good pre-processing steps to improve the image or even better ML models for the OCR task.
As far as I know, there are no Julia frameworks for OCR tasks as you asked. I’ve developed OCReract.jl to be just a simple wrapper for Tesseract to allow retrieving results in a Julia session and, as @uwechsler mentioned, it was not maintained. But recently, I’ve released a stable version with proper documentation for usage so you can try it to see if that meets your needs.
Repo: GitHub - leferrad/OCReract.jl: A simple Julia wrapper for Tesseract OCR
Doc: Home · OCReract.jl

5 Likes

If tesseract.exe is not in %path% on Windows OS you may add something like this in your startup.jl:

if ~occursin(raw"C:\bin\Tesseract-OCR", ENV["path"])
    ENV["path"] = string(ENV["path"], raw"C:\bin\Tesseract-OCR;")
end

Thanks for the suggestion @ellocco! Could you submit it as an issue here? Then it could have a better attention, as well as a thread to understand better the changes required.