OCR with Julia

John_Hearns · July 19, 2017, 8:04am

I am looking for suggestion on doing Optical Character Recognition. This does not necessarily have to be native Julia (see below). Also I am not looking for a full-on Machine Learnign framework which will recognise images.
I am looking for a library or an external program which if given an image of a typewritten page it will return a scoring value - the higher the score the more readable or more easily recognised is the text.
If this is best implemented usign Kaggle (which I know next to nothing about) I will hoist that aboard.

To explain, I have had a programming exercise in my head for years. Take an A4 typewritten page. Send it through a straight-cut shredder. Take the set of strips you have and scan them as images. Of course as an exercise you can do this virtually. Then combine the strips into an image page, at random. How can we piece together the original typewritten page?
That leads to using optimisations -perhaps a genetic algorithm - to arrange the strips.
Or it maybe that simple brute force is easier - just look at all possible arrangements to the strips.

John_Hearns · July 19, 2017, 8:35am

Perhaps bad form to answer my own question… Tesseract may be what I am looking for
https://github.com/tesseract-ocr/tesseract

Other suggestions are of course welcomed. (I have searched for this - I just came across Tesseract after I asked on here)

hpoit · June 7, 2018, 3:56pm

I wish we could mobilize the community and build in Julia something better and simpler than this new LSTM-based OCR you just posted. Could we build a repo to start mapping out its design? Maybe if I saw its beginnings I could contribute to the completion of it. In any case, I am completing A. Ng’s ML course which teaches OCR on the eleventh week - maybe I could start mapping out the design.

raja4u · September 6, 2018, 2:24pm

Any progress on this front?

uwechsler · March 9, 2020, 1:48pm

There exists a julia wrapper for tesseract, see GitHub - leferrad/OCReract.jl: A simple Julia wrapper for Tesseract OCR.
However, it does not appear to be maintained.

leferrad · May 25, 2020, 8:46pm

I guess Tesseract is a good tool to begin at least (from version 4.0, it has very good results in images with good enough resolution, even noisy ones). Then you can start thinking on adding good pre-processing steps to improve the image or even better ML models for the OCR task.
As far as I know, there are no Julia frameworks for OCR tasks as you asked. I’ve developed OCReract.jl to be just a simple wrapper for Tesseract to allow retrieving results in a Julia session and, as @uwechsler mentioned, it was not maintained. But recently, I’ve released a stable version with proper documentation for usage so you can try it to see if that meets your needs.
Repo: GitHub - leferrad/OCReract.jl: A simple Julia wrapper for Tesseract OCR
Doc: Home · OCReract.jl

ellocco · December 7, 2021, 5:24pm

If tesseract.exe is not in %path% on Windows OS you may add something like this in your startup.jl:

if ~occursin(raw"C:\bin\Tesseract-OCR", ENV["path"])
    ENV["path"] = string(ENV["path"], raw"C:\bin\Tesseract-OCR;")
end

leferrad · December 12, 2021, 10:08pm

Thanks for the suggestion @ellocco! Could you submit it as an issue here? Then it could have a better attention, as well as a thread to understand better the changes required.

Topic		Replies	Views
Optical character recognition using JUlia New to Julia	1	834	March 5, 2020
Extract floating point number from .png image General Usage images	2	885	November 2, 2020
OpenCV support Machine Learning question	1	634	June 1, 2019
File error in OCReract writing to temp folder New to Julia images	0	447	December 4, 2020
Some project already done in Julia about image processing General Usage	1	371	May 23, 2019

OCR with Julia

Related topics