CSV.read seems to expect at least 2 columns

Axze-rgb · September 26, 2023, 8:58am

Hello, I am using CSV.read to load a single column dataframe. Here is the command:

df_MQBZ=CSV.read(“MQBZ.hist”, DataFrame, header=false)

That works but here is the result

> Column1	Column2

Missing String15
1 missing “.”
2 missing “.”
3 missing “0.11702”
4 missing “-0.854428”

I can select column2 only, but why is it expecting a first column to begin with? It should be able to realize the data are in a single column no?

thanks

nilshg · September 26, 2023, 9:09am

It doesn’t:

julia> using CSV

julia> CSV.write("onecol.csv", (x = rand(5),))
"onecol.csv"

julia> using DataFrames

julia> CSV.read("onecol.csv", DataFrame)
5×1 DataFrame
 Row │ x
     │ Float64
─────┼───────────
   1 │ 0.0121646
   2 │ 0.19368
   3 │ 0.0875001
   4 │ 0.440309
   5 │ 0.895765

Or put differently, the issue is with your file and/or parsing options, not CSV.jl

Axze-rgb · September 26, 2023, 9:14am

Thanks for the prompt answer, the problem is that I am sure I have a single column. I tested with awk

awk ‘{print NF; exit}’ MQBZ.hist
1

The data looks like this

head MQBZ.hist
.
.
0.11702
-0.854428
0.239624
-0.663206
0.229609
0.611315
-0.0516864
-0.0172187

is it interpreting a “.” or the “-” symbol in a specific way?

Dan · September 26, 2023, 9:20am

This awk script tests only the first line.

nilshg · September 26, 2023, 9:20am

No. Copying your data into a simple text file:

julia> CSV.read("mqbz.csv", DataFrame)
9×1 DataFrame
 Row │ .
     │ String15
─────┼────────────
   1 │ .
   2 │ 0.11702
   3 │ -0.854428
   4 │ 0.239624
   5 │ -0.663206
   6 │ 0.229609
   7 │ 0.611315
   8 │ -0.0516864
   9 │ -0.0172187

Dan · September 26, 2023, 9:23am

A more appropriate awk command would be:

awk -F, 'BEGIN {maxnf=0} {if (NF>maxnf) {maxnf = NF}} END {print "maxnf=", maxnf}' MQBZ.hist

Axze-rgb · September 26, 2023, 9:29am

Thanks for the improved awk but

awk -F, ‘BEGIN {maxnf=0} {if (NF>maxnf) {maxnf = NF}} END {print “maxnf=”, maxnf}’ MQBZ.hist
maxnf= 1

However, something not correct is that these data are interpreted as string instead of float, could it be the source of the error?

My data file is almost 2 million lines long so I can’t paste it here for testing

Dan · September 26, 2023, 9:48am

This is strange, can you give the output of:

awk -vFS="" '{for(i=1;i<=NF;i++)w[$i]++}END{for(i in w) print i,w[i]}' MQBZ.hist

which gives a count for each character in the file?

Axze-rgb · September 26, 2023, 9:56am

Sure, here it is

awk -vFS=“” ‘{for(i=1;i<=NF;i++)w[$i]++}END{for(i in w) print i,w[i]}’ MQBZ.hist
4 355365
5 347440
6 341754
7 335825
8 331927
9 326924

365027
620395
. 618861
0 626287
1 545963
2 423568
3 373330

It looks like it gives indeed 2 columns, really puzzling.

Dan · September 26, 2023, 10:07am

Thanks. It would be better to surround the output with triple quotes (```) on both sides to format it better.

Now, another request, can you post the output of:

od -xc MQBZ.hist | head

?

Axze-rgb · September 26, 2023, 10:11am

Here is the new output


0000000    2e20    200a    0a2e    3020    312e    3731    3230    200a
              .  \n       .  \n       0   .   1   1   7   0   2  \n    
0000020    302d    382e    3435    3234    0a38    3020    322e    3933
          -   0   .   8   5   4   4   2   8  \n       0   .   2   3   9
0000040    3236    0a34    2d20    2e30    3636    3233    3630    200a
          6   2   4  \n       -   0   .   6   6   3   2   0   6  \n    
0000060    2e30    3232    3639    3930    200a    2e30    3136    3331
          0   .   2   2   9   6   0   9  \n       0   .   6   1   1   3
0000100    3531    200a    302d    302e    3135    3836    3436    200a
          1   5  \n       -   0   .   0   5   1   6   8   6   4  \n  
```
 
  
In case you would know bioinformatics , the file was generated like this (yes I made a type in the output file name). 

> bcftools query -f '[ %BQBZ]\n' 0001.vcf > MQBZ.hist

Dan · September 26, 2023, 10:35am

Use:

df_MQBZ=CSV.read(“MQBZ.hist”, DataFrame, header=false, delim=',')

with the added delim parameter.

The reason for the problem (which I’ve reproduced) is the extra space at the beginning of each line. The space confused the delimiter auto-detect code.

The reason the problem didn’t reproduce, is that pasting the lines with a space at the beginning instructed Discourse website to format as quote and remove the initial space.

Additionally, you can get rid of the extra space character by removing it from the bcftools format string.

Axze-rgb · September 26, 2023, 10:39am

Ohhh, this is amazing. Thanks for having spent so much time to help me debug this. That works. It seems the program “bcftools” introduces spaces. I don’t know why exactly it does that, and it also confuses the type detection since the value are detected as string, but I can fix that with types=[Float64]. I can indeed remove that space, The doc contains the space but I believe it’s because they show a very specific case, and don’t show the simple case.

Again, thanks so much

Topic		Replies	Views
CSV.read not able to read properly a simple file General Usage	3	1083	December 2, 2018
Read file with CSV.read New to Julia	8	19793	September 9, 2019
CSV, DataFrame read data file with string and Float64 columns New to Julia dataframes	3	76	September 3, 2024
String7 type with read CSV? New to Julia	8	361	June 23, 2023
Problem with CSV.jl reading csv file with extra spaces General Usage csv	5	661	September 3, 2021

CSV.read seems to expect at least 2 columns

Related topics