Thank you for the references.
My intuition tells me that for now, it might be more effective to use LLMs to translate from natural language to formal code, then run symbolic inference with a theorem prover, and translate the result to natural language.
The tools we use for training might also help uncover proof strategies, intermediate theorems that might shorten the inference.