I am loading a large .dta file and then converting it to a DataFrame. However, because of its size, the conversion takes a very long time (~20mins). I don’t need all of the columns in the file so I am wondering whether anyone knows of a way to convert only a subset of it.
Specifically:
using DataFrames, StatFiles
df_1 = load("data.dta");
df_2 = DataFrame(df_1);
df_3 = df_2[!,:col1];
step 2 takes a long time so if there were a way to invert steps 3 and 2, namely first select the columns I want and then convert to DataFrame, that would speed things up a lot. Any help would be greatly appreciated.
Thanks!
             
            
              
              
              1 Like
            
            
           
          
            
            
              I think you can convert .dta to .csv first, manipulate CSV (e.g. with CSV.jl) and then convert that to a DataFrame.
             
            
              
              
              3 Likes
            
            
           
          
            
            
              I don’t think there is a way to load a subset of the columns.
I agree that you should save the data as .csv. If you have access to stata thats very easy. Otherwise both R and pandas have .dta. However I’m not confident they are any better than Julia’s, since I think StatFiles uses the same underlying library as they do.
             
            
              
              
              1 Like
            
            
           
          
            
            
              Thank you both. I ended up calling Julia via RCall. That seems to work much faster than loading the .dta file directly via Julia.
Thanks!
             
            
              
              
              
            
            
           
          
            
            
              It’s weird that it takes much longer than R. @davidanthoff probably knows.
             
            
              
              
              1 Like