How can I split large data using a faster and more efficient function (data science)?

For large files you want to stream through them, i.e., not load them in memory first, but iterate over the rows one by one. Others have already suggested CSV.Rows for that purpose. Then, try to think in separate steps and use independent tools for each step:

rows = CSV.Rows(<my_file>)                       # 1. Get iterator over all rows
chunks = Base.Iterators.partition(rows, 427905)  # 2. Chunk into 427905 rows each (this will need to fit into memory though)
map(my_task, chunks)                             # 3. Do the actual work with each chunk
# for chunk in chunks                            # 3. In case you prefer loops or do not want to store results of my_task, e.g., writing them to another file immediately 
#    my_task(chunk)
# end
2 Likes