This threat is very clarifying, thank so much!.
Nevertheless, there is a case where I still can not see the light: What happens when the source is itself an Arrow.Stream?. In this case it is not obvious to me how to convert each table (in a partion) to a dataframe and then back to a new partition…in my naive understanding something like this should work:
using Arrow
using DataFrames
using IterTools
MassivePartionedTable_Input = Arrow.Stream("inputFile.arrow")
firstPartition, restPartitions = firstrest( MassivePartionedTable_Input )
df_firstPartition = DataFrame( firstPartition )
largerTableOuput = DoesSomeThingOnThisPartition( df_firstPartition)
Arrow.write("outputFile.arrow",largerTableOutput)
for eachPartition in restPartitions
df_eachPartition = DataFrame( eachPartition)
largerTableOuput = DoesSomeThingOnThisPartition( df_eachPartition )
Arrow.append("outputFile.arrow",largerTableOutput)
end
However I obtain as error:
ERROR: MethodError: no method matching append(::WindowsPath, ::DataFrame)
I am not sure about how to proceed…should I convert largerTableOUtput back as table with 1 partition…
sorry…in this case I find the documentation a bit fussy. I really will appreciate your ideas.
Javier