Entering xlsx columns into HypothosisTests

I’m trying to do Chi Squared tests of variables in a dataframe, from an xlsx file. When I use describe() I get a table:

────┼─────────┼──────────┼──────────┤
│ 1   │ Q1       │ 1.47727 │ 1     │ 1.0     │ 2     │ 2       │ 0        │ Any      │
│ 2   │ Q2       │ 1.65909 │ 1     │ 1.5     │ 3     │ 3       │ 0        │ Any      │
│ 3   │ Q3       │ 1993.14 │ 1988  │ 1993.0  │ 1996  │ 9       │ 0        │ Any      │

So, I know that Q1, Q2, Q3 are varables, from my dataframe df, but I’m not sure how to enter these into a Chi Squared, or any other hypothesis test.

ChisqTest(df,normalize(:Q1,1))

I’m not sure I understand what you’re trying to do, it would be helpful if you could provide a MWE. I’ll take a stab anyway:

I’m assuming you’re trying to perform Pearson’s chi-squared test from HypothesisTests.jl. As the documentation says, it expects either a vector or a matrix of integers, depending on what you are trying to test.

Below an example of how to do this assuming your data is in a DataFrame. Note that the tests I run below might not make any sense whatsoever, as I don’t understand what your data actually is!

julia> using DataFrames, HypothesisTests

julia> data = DataFrame(variable = ["Q1","Q2", "Q3"], x1 = [1.47, 1.65, 1993.14], x2 = [1, 1, 1988], x3 = [2,3,9])

julia> ChisqTest(data.x2)
Pearson's Chi-square Test
-------------------------
Population details:
    parameter of interest:   Multinomial Probabilities
    value under h_0:         [0.3333333333333333, 0.3333333333333333, 0.3333333333333333]
    point estimate:          [0.0005025125628140704, 0.0005025125628140704, 0.9989949748743718]
    95% confidence interval: Tuple{Float64,Float64}[(0.0, 0.0016), (0.0, 0.0016), (0.998, 1.0)]

(...)

julia> ChisqTest(Matrix(data[:, [:x2, :x3]]))
Pearson's Chi-square Test
-------------------------
Population details:
    parameter of interest:   Multinomial Probabilities
    value under h_0:         [0.001486547862359114, 0.001982063816478819, 0.9895453603770503, 1.0458125664837987e-5, 1.3944167553117318e-5, 0.006961625650893821]
    point estimate:          [0.000499001996007984, 0.000499001996007984, 0.9920159680638723, 0.000998003992015968, 0.0014970059880239522, 0.004491017964071856]
    95% confidence interval: Tuple{Float64,Float64}[(0.0, 0.0042), (0.0, 0.0042), (0.989, 0.9957), (0.0, 0.0047), (0.0, 0.0052), (0.0015, 0.0082)]
1 Like

I should have a tag for the XLSX package, (help with that would be awesome), but I’m not asking to do a chi squared. What I want to know is what each column is called in my XLSX, whether it’s the names from the Excel document, or different, and how to input them into a function like ChisqTest().

Ah I see - what is typeof(your_data)? How did you read the file? You probably want df = DataFrame(XLSX.readtable("myfile.xlsx", "mysheet")...) which returns a DataFrame.

1 Like

And note the ... at the end of the function. That’s actual code (for “splatting” the results), not an ellipsis meant as a place holder. The return type from XLSX.jl is a bit weird.

OT - Someone should really add a Tables interface to it…

It’s from an xlsx file and I made it into a data frame,
df = DataFrame(XLSX.readtable("C:/Users/brett/Downloads/Sampledataset.xlsx", "Sheet1")...)

I could attach the file if you need, but the first row is all strings, that indicate the variable name. Variables are is columns.

I did that, just not sure how to put a column in a test.

The lack of features like that are a disadvantage of new languages over established ones. This is why community is so important.

Have you read the Dataframes documentation? You can get a column as a vector by doing df.column_name or df[:, :column_name].

Though it sounds like things won’t work while your XLSX file isn’t being read properly. From your code above it looks like things are being read in correctly, so we can’t help you further without the file itself.

Do you absolutely have to use excel? A .csv file is more robust and easier to read in.

1 Like

Thank’s a sort of went through a tutorial, that wasn’t complete. Does Julia have a version of CRAN? That would be a huge help.

Excel saves as .csv, so if it’s better then I’ll just save the file as a csv. I’ve read lots of forums where people were complaining about the xlsx reader, but they were all from 2018, so I hoped the problems were sorted. The weird thing is that describe(df) works fine, and it’s clear that the columns are seperate from the data from when I run the file and describe.

I saved the sheet as Sampledata-Sheet1.csv (actually how Drive saved it). When I tried to put it into CSV:

tbl = CSV.File("Sampledataset-Sheet1.csv")|> DataFrame!

I’m told it’s not a valid file,

ERROR: ArgumentError: "Sampledataset-Sheet1.csv" is not a valid file
I also did a verson with the full destination, and I took this off of the CSV github, so it should be complete.

Can you show us what describe(df) means? If it works, as in, numeric columns are numeric instead of strings, then you should be all set.

XLSX.jl has worked well for me in the past, but obviously the excel sheet needs to be well organized for the output to be useful.

it’s clear that the columns are seperate from the data from when I run the file and describe.

Could you clarify what this means?

The DataFrames documentation discusses how to get access to the columns here. It should be easy to find.

Additionally, Julia has a single website with documentation in one place here. You can search DataFrames (or HypothesisTests etc) using the search on the left.

Thank you so much! I’ll be sure to read over all of those. The table above is for describe(df)

10×8 DataFrame
│ Row │ variable │ mean    │ min   │ median  │ max   │ nunique │ nmissing │ eltype   │
│     │ Symbol   │ Float64 │ Int64 │ Float64 │ Int64 │ Int64   │ Int64    │ DataType │
├─────┼──────────┼─────────┼───────┼─────────┼───────┼─────────┼──────────┼──────────┤
│ 1   │ Q1       │ 1.47727 │ 1     │ 1.0     │ 2     │ 2       │ 0        │ Any      │
│ 2   │ Q2       │ 1.65909 │ 1     │ 1.5     │ 3     │ 3       │ 0        │ Any      │
│ 3   │ Q3       │ 1993.14 │ 1988  │ 1993.0  │ 1996  │ 9       │ 0        │ Any      │
⋮
│ 7   │ Q7       │ 1.77273 │ 1     │ 1.5     │ 4     │ 4       │ 0        │ Any      │
│ 8   │ Q8       │ 1.68182 │ 1     │ 1.0     │ 3     │ 3       │ 0        │ Any      │
│ 9   │ Q9       │ 1.54545 │ 1     │ 2.0     │ 2     │ 2       │ 0        │ Any      │
│ 10  │ Q10      │ 1.5     │ 1     │ 1.5     │ 2     │ 2       │ 0        │ Any      │

I cut the top part off, because it didn’t line up in the chat window right (but it looks right in the published preview now: sorry). But anyhow, here is my description, and you can see that it shows the variable names as variables, and it does the mean, mind, medaian, and max, so that looks good.

This is how I tried to do Chi Squared and the error I got:

julia> ChisqTest(df,normalize(:Q1,1))
ERROR: MethodError: no method matching normalize(::Symbol, ::Int64)
Stacktrace:
 [1] top-level scope at none:0

I think my mistake is how I tried to enter the variable into the ChisqTest.

You need to post the command you used to get the value, not just the error.

Sorry, I posted the line I entered, but didn’t have it formatted. I fixed it now.

ChisqTest(df,normalize(:Q1,1))

So this error just tells you that you cannot normalize a symbol (:Q1). You probably meant to normalize the column :Q1 in your DataFrame. I’m not sure what normalize function you are using, i.e. which package it is from (as I said it would help a lot if you could post an MWE), but presumably you need to pass it the columns in your DataFrame.

Something like this might work (again subject to how the normalize function you’re using actually works):

df[!, :Q1_normalized] = normalize(df.Q1)

ChisqTest(df.Q1_normalized)

As the documentation for ChisqTest explains, this tests whether the population probabilities are all equal.

I highly recommend you have a look at the DataFrames docs or this tutorial to get the hang of working with DataFrames in Julia - the way you’ve used :Q1 in the function call above suggests you might have experience with R, where indexing works slightly differently.

using Printf, Statistics, XLSX
df = DataFrame(XLSX.readtable(
"C:/Users/brett/Downloads/Sampledataset.xlsx", "Sheet1")...)
describe(df)
ChisqTest(:Q1)

I think this is my MWE, everything works except the line I,m asking about. I added the packages I’m using, and if I need any others, that would be good to know.

Thanks, this is helpful - although to be nitpicky not quite an MWE, given that this relies on a file on your computer that others don’t have access to. For a complete MWE, you could have used XLSX to create a sample *.xlsx file, read that in, and move on from there.

But this is just for future reference to make it easier for others to help you - this works much better if you can actually run and debug the code on your own machine!

In this case the issue is apparent though: you are calling ChisqTest on the symbol :Q1, which is not the same as the column df.Q1 which holds your data. Try ChisqTest(df.Q1) as I said above.

Again, this might be motivated by a background in R or similar languages that have special parsing rules for characters used in functions related to tables. This is not the case in plain Julia, so you always need to select data from your table with the appropriate indexing strategy (broadly df[row_indexer, column_indexer]), unless you are using packages relying on macros such as Query.jl or DataFramesMeta

3 Likes

Hi Brett.