Using Julia to extract information from a ballot

Nosferican · September 25, 2020, 2:39am

I would like to identify tools in the Julia ecosystem to parse a model ballot. For example, from the following PDF,

obtain

Candidature
Candidate’s name
Party Affiliation

Nosferican · September 25, 2020, 3:06am

@sambitdash would it be possible to use the PDF structure to scrape those fields with PDFIO.jl?

johnh · September 25, 2020, 4:53am

A search shows Avik has written a package called Taro

baggepinnen · September 25, 2020, 2:18pm

If all rectangles have the same size you can probably find them with a 2D matched filter?

Simply extract one rectangle and correlate the image with your little rectangle pattern (kernel). The correlation image will have peaks located on all rectangles in the image.

Nosferican · September 26, 2020, 12:39am

Ended up using OCReract.jl
Ref: https://github.com/Nosferican/CandidatosEleccionesGeneralesPR2020

sylvaticus · September 26, 2020, 12:58am

In this specific case, isn’t possible to find the data in an other format?

Nosferican · September 26, 2020, 1:17am

Highly unlikely. The PR government is known for not being great at government transparency / openness. I was able to programmatically get the political contributions by using Twitter to contact the Electoral Board Comptroller Office and have them tweak an internal API so I could get the data. Right now I have the executive and legislative ballots done. The municipal is a bit trickier because the location / dimensions are not consistent due to some minority parties that don’t list candidates for all local governments.

sambitdash · November 12, 2020, 10:14am

Sorry for my delay in response as I do not login to the forum as often. If you can estimate the rectangular regions, we could use the suggestion given in issue: https://github.com/sambitdash/PDFIO.jl/issues/55

Topic		Replies	Views
Effective Text Extraction from Documents (PDFs) General Usage question , strings , data , nlp , etl	2	1126	February 9, 2021
How to extract data from pdf with two columns General Usage	6	642	December 26, 2023
How to extract edge list/contour from image using Images.jl General Usage images	2	2338	January 9, 2020
Read pdf as image New to Julia images	1	664	January 19, 2021
OCR with Julia Machine Learning	7	5081	December 12, 2021

Using Julia to extract information from a ballot

Related topics