Hi All,
I am working on a project, where I want to do some machine learning over large data (about 50Gb) stored in AWS S3. I would like to ask people about their opinions and experiences with the dataformat they use and they would recommend. My current approach uses JLD, but since different folks of the team I am a member use Spark, they obviously do not fancy JLD. And because big part of the preprocessing is already in Spark, they do not fancy hdf5, because there is no good support.
Their preferred format is Avro, but I have found that the library is quite poorly written and reading a 3Gb file is 10 times slower than in Scala. I would like to ask about experiences with Feather.jl and Parquet.jl? Are they in a good state (supported)? Are there other alternatives I am not aware off?
Thanks for any answers in advance
Tomas