Loading Data


#1

After many days of frustrating attempts to load data from a directory into my JuliaPro environment i have successfully done so. This is a huge accomplishment for me and honestly it shouldn’t be. I am an R user and trying to make the switch but it is proving to me quite the challenge. Loading data in R-Studio is actually easy…loading data in JuliaPro is seemingly easy.

If there is anyone from Boston on this forum and are open to education please reach out to me!


#2

“Loading data” is pretty general. There are of course a huge number of ways of doing this in the general case. Is there something specific you need help with?


#3

Where do i start…

I did need help with loading CSV file ( i have managed to use CSVFiles) that worked today.

Ive been watching some tutorials, namely the Queryverse JuliaCon talk and the presenter uses a load function with extreme ease that just isnt working for me in JuliaPro (Queryverse wont even precompile gives me a nodeJS error but its installed on my machine)

using the load function: i get this scary error

This is all very frustrating =(


#4

I haven’t used queryverse much so I probably can’t be super useful. but have you tried CSV.jl?

There is (admittedly somewhat lacking) documentation for it here. You can load a CSV with, for example

df = CSV.File(filename) |> DataFrame

That package is still not quite in a state where I’d call it “mature” but an enormous amount of really nice work has been done on it and it’s pretty flexible these days.

xlsx is a nightmare no matter which way you look at it. They are notorious for not loading the same way in any two different programs or API’s. In my job as a data scientist I forbid anyone from giving me an xlsx because they are so horrible to deal with. That said, if you have some files that load consistently in R, there is probably some way of loading them consistently in Julia. You can give XLSX.jl a shot if you want. It is clearly maintained but I’m not sure how many of Excels lunatic edge cases it handles.


#5

@Aeonglacial Thanks for your feedback! I’m the presenter from the Queryverse talk :slight_smile:

So my understanding is that CSVFiles.jl is working for you at this point? But the two errors you face are:

  1. You can’t load Queryverse.jl.
  2. ExcelFiles.jl doesn’t work for you, i.e. you get the error you pasted above.

Where there any other issues along the way? It would be great if you could let us know about those, even if you resolved them. Minimally it can help us improve our documentation. Even things that weren’t bugs, but just difficult for you to find out: please let us know, that kind of feedback really helps us improve things!

The issue 1) seems a installation issue. On what platform are you? Could you paste the exact error message you get? The most helpful place to post this is in a new issue over at https://github.com/queryverse/Queryverse.jl, but if that is too cumbersome, feel free to just post it here.

Issue 2) I’m less sure. The XLDateAmbiguous suggests that there is something in that file that makes the underlying python excel library choke. Is there a chance you could post the Excel file? The best place to post that would be an issue at https://github.com/queryverse/ExcelFiles.jl, but again, if that is too cumbersome, feel free to just post it here.

Apart from these bugs, you might find this helpful.


#6

Thank you for your reply! I have watched your presentations on VSCode, Queryverse at JuliaCon and the live stream video of Queryverse and i find them to be excellent presentations so thank you for your work! Would love to reproduce the ease of working/exploring my datasets as you demonstrated.

Issue 1: Windows 10 and im using JuliaPro 1.0.2.1
Here is the error:

Here’s what happens when i run Pkg.build(“NodeJS”)

I don’t mind posting the issue to github.

I would love to share the dataset but its patient data and protected by HIPPA. i could reduce the patient identifiers but maybe i can describe the dataset. 267,368 visits with arrival times. appointment dates, race, “gender”, zip-codes, running-total of previous no-shows, and appointment status(arrived no-show). Maybe this file is too large for the ExcelFiles.jl ??

Side note: In R i am able to correct erroneous zipcodes(mislabeled or maybe a ‘O’ instead of a 0) and get their geospatial coordinates to calculate distance from clinic, anything similar in Julia?


#7

There’s ZipCode.jl but unfortunately it doesn’t seem upgraded to Julia 1.0. You could extract useful functionality from there (e.g. zip code cleaning), and/or quite easily download a list of geocoded zip codes (e.g. from here) and use Geodesy.jl to calculate distance.


#8

Alright, the short version of a long story is that the precompile problem seems to be due to a JuliaPro issue. I can reproduce the problem on JuliaPro, but everything works on a normal julia… I think the idea of JuliaPro is nice, but quite frankly, until it is more robust it might be a better idea to just use the normal julia download and stay away from JuliaPro… From my non-scientific observations on this forum here, JuliaPro just doesn’t seem to provide the “it just works” type experience that it promises…

The Excel issue goes back to how the underlying xlrd Python library handles dates… There is a long discussion about that here, and apparently you are running into this problem there. It is unclear to me right now whether there is a workaround, or what to do about that… I’m tracking it here, if anyone wants to dig in deeper, it would be great.