CSV.read seems to expect at least 2 columns

Hello, I am using CSV.read to load a single column dataframe. Here is the command:

df_MQBZ=CSV.read(“MQBZ.hist”, DataFrame, header=false)

That works but here is the result

> Column1	Column2

Missing String15
1 missing “.”
2 missing “.”
3 missing “0.11702”
4 missing “-0.854428”

I can select column2 only, but why is it expecting a first column to begin with? It should be able to realize the data are in a single column no?

thanks

It doesn’t:

julia> using CSV

julia> CSV.write("onecol.csv", (x = rand(5),))
"onecol.csv"

julia> using DataFrames

julia> CSV.read("onecol.csv", DataFrame)
5×1 DataFrame
 Row │ x
     │ Float64
─────┼───────────
   1 │ 0.0121646
   2 │ 0.19368
   3 │ 0.0875001
   4 │ 0.440309
   5 │ 0.895765

Or put differently, the issue is with your file and/or parsing options, not CSV.jl

1 Like

Thanks for the prompt answer, the problem is that I am sure I have a single column. I tested with awk

awk ‘{print NF; exit}’ MQBZ.hist
1

The data looks like this

head MQBZ.hist
.
.
0.11702
-0.854428
0.239624
-0.663206
0.229609
0.611315
-0.0516864
-0.0172187

is it interpreting a “.” or the “-” symbol in a specific way?

This awk script tests only the first line.

No. Copying your data into a simple text file:

julia> CSV.read("mqbz.csv", DataFrame)
9×1 DataFrame
 Row │ .
     │ String15
─────┼────────────
   1 │ .
   2 │ 0.11702
   3 │ -0.854428
   4 │ 0.239624
   5 │ -0.663206
   6 │ 0.229609
   7 │ 0.611315
   8 │ -0.0516864
   9 │ -0.0172187

A more appropriate awk command would be:

awk -F, 'BEGIN {maxnf=0} {if (NF>maxnf) {maxnf = NF}} END {print "maxnf=", maxnf}' MQBZ.hist

Thanks for the improved awk but

awk -F, ‘BEGIN {maxnf=0} {if (NF>maxnf) {maxnf = NF}} END {print “maxnf=”, maxnf}’ MQBZ.hist
maxnf= 1

However, something not correct is that these data are interpreted as string instead of float, could it be the source of the error?

My data file is almost 2 million lines long so I can’t paste it here for testing

This is strange, can you give the output of:

awk -vFS="" '{for(i=1;i<=NF;i++)w[$i]++}END{for(i in w) print i,w[i]}' MQBZ.hist

which gives a count for each character in the file?

Sure, here it is

awk -vFS=“” ‘{for(i=1;i<=NF;i++)w[$i]++}END{for(i in w) print i,w[i]}’ MQBZ.hist
4 355365
5 347440
6 341754
7 335825
8 331927
9 326924

  • 365027
    620395
    . 618861
    0 626287
    1 545963
    2 423568
    3 373330

It looks like it gives indeed 2 columns, really puzzling.

Thanks. It would be better to surround the output with triple quotes (```) on both sides to format it better.

Now, another request, can you post the output of:

od -xc MQBZ.hist | head

?

Here is the new output


0000000    2e20    200a    0a2e    3020    312e    3731    3230    200a
              .  \n       .  \n       0   .   1   1   7   0   2  \n    
0000020    302d    382e    3435    3234    0a38    3020    322e    3933
          -   0   .   8   5   4   4   2   8  \n       0   .   2   3   9
0000040    3236    0a34    2d20    2e30    3636    3233    3630    200a
          6   2   4  \n       -   0   .   6   6   3   2   0   6  \n    
0000060    2e30    3232    3639    3930    200a    2e30    3136    3331
          0   .   2   2   9   6   0   9  \n       0   .   6   1   1   3
0000100    3531    200a    302d    302e    3135    3836    3436    200a
          1   5  \n       -   0   .   0   5   1   6   8   6   4  \n  
```
 
  
In case you would know bioinformatics , the file was generated like this (yes I made a type in the output file name). 

> bcftools query -f '[ %BQBZ]\n' 0001.vcf > MQBZ.hist

Use:

df_MQBZ=CSV.read(“MQBZ.hist”, DataFrame, header=false, delim=',')

with the added delim parameter.

The reason for the problem (which I’ve reproduced) is the extra space at the beginning of each line. The space confused the delimiter auto-detect code.

The reason the problem didn’t reproduce, is that pasting the lines with a space at the beginning instructed Discourse website to format as quote and remove the initial space.

Additionally, you can get rid of the extra space character by removing it from the bcftools format string.

1 Like

Ohhh, this is amazing. Thanks for having spent so much time to help me debug this. That works. It seems the program “bcftools” introduces spaces. I don’t know why exactly it does that, and it also confuses the type detection since the value are detected as string, but I can fix that with types=[Float64]. I can indeed remove that space, The doc contains the space but I believe it’s because they show a very specific case, and don’t show the simple case.

Again, thanks so much

3 Likes