Does CSV.read allow another process to keep writing to file?

I’m running a pretty expensive job on a cluster. This generates plaintext output files by appending new lines with data regularly (roughly once every second or so).

I’ve written some julia code to perform quick visualizations (also on the cluster!). But before accidentally messing anything up I want to double-check:

Is there any chance that CSV.read("filename.dat", DataFrame) could interfere with the other process writing to these files?
The CSV docs on input use a lot of words that I’m unfamiliar with.

In principle, nothing prevents a different process from opening a file you’ve already opened and writing to it at the same time as you’re reading from it. This is irrespective of CSV.jl.

I’m worried about the “in principle” part here ;-). But I guess if I’m rsyncing the files over to another drive while it’s running it is probably doing something similar…

You might want to use a front-end IO handler in the reader to make sure the CSV parser doesn’t choke on incomplete records.

In the end I had to copy the data to a different drive anyway for more permanent storage. I’m making the figures now by reading the data from there.

In future if you want consistency and simultaneous read+write access, you might try DuckDB or SQLite.

The CSV.read will not mess with the process writing to the file, but the “quick visualizations” are not guaranteed to see the latest writes. This would depend on whether/when the writing process flushes/syncs the file system.

1 Like