For my package JSONLines.jl I am considering a refactoring and provide 3 options to read a JSONLines file:
- Iterator over an mmaped file. Basically returns the mmaped file at first and implements an iterator that produces the next row on each interation (can be parsed or returned as
Vector{UInt8}
)
- Index of an mmaped file. Mmaps the file and iterates over it once saving the indices for the newlines such that rows can be accesed via
getindex
(can be parsed or returned as Vector{UInt8}
).
- Read and parse the whole file.
Would it be prefereable to export three different functions or one function with additional arguments specifying what version the user wants?
Separate functionality should go into separate functions. But you need at most 2.
You can have a function that returns an object that supports the iteration and abstract array protocols.
And then maybe a convenience function that just collects
over this.
1 Like
Thanks for the input! The question is then in what order the operations should be performed. The “laziest” option would be to return the iterator and if the user calls getindex
index the rows and return the appropriate row. This would make reading the file fast and the first getindex unexpectedly slow. Or break it up into multiple steps
file = File("path/to.jsonl")
file[1] # error
iterate(file, 1) # return first row
index!(file)
file[1] # return firstrow
In any case there are two costly operations: Indexing the rows and parsing the strings (rows). The main idea is to be able to defer both until needed.
I am only mildly familiar with the format, but if you need to find line breaks sequentially anyway, then a random access API makes little sense. Just support iteration.
1 Like