LaTeX table reader in Julia

Is there a LaTeX table reader in Julia? I know there are several writers, but I haven’t been able to find something that takes the LaTeX source as input.

The reason I’m interested is that MathPix is really good at OCRing tables, but it outputs to LaTeX. (Example use case: I want to do some analysis with some data I find printed in a PDF.)

I know this would be really easy to throw together a basic version of, but I didn’t want to reinvent the wheel if it’s already out there.


Ha. This was not my first thought.

Not aware of anything. Is there an implementation in another language? You could start with that via some of the interop packages Julia has if so.


Can you give an example of how a table would look?

1 Like

To make sure the example is representative, I went to Google image search and OCRed one of the first tables I saw. Here was the output from MathPix:

&\text { Table } 1.1 . \text { Nonlinear Model Results }\\
\hline \hline \text { Case } & \text { Method#1 } & \text { Method#2 } & \text { Method#3 } \\
\hline 1 & 50 & 837 & 970 \\
2 & 47 & 877 & 230 \\
3 & 31 & 25 & 415 \\
4 & 35 & 144 & 2356 \\
5 & 45 & 300 & 556 \\

Here’s how it renders:

The part I’d be interested in is just the part between the \begin{array} and \end{array}.

@tbeason I’m starting to think you’re right about the difficulties here. At the most basic level, you’re just splitting into rows and then into entries. For the simplest examples, this would be trivial. But I’m thinking it’s likely to happen fairly often that MathPix outputs something that the naive algorithm isn’t prepared for. Handling that complexity in a graceful way is almost certainly more work than it’s worth.

My understanding is that there are LaTeX table readers in Python. That’s probably the right solution.

1 Like

Considering the different codes/environments one can use to generate a latex table, this is probably really difficult. I guess it would be marginally easy to develop something for mathpix only, as you have some gaurantee what the output code will include.

You could try using Pandoc to convert to a simpler format (like markdown pipe tables) and then try to parse those (I think they’re basicslly CSVs).