Hello, I am using CSV.read to load a single column dataframe. Here is the command:
df_MQBZ=CSV.read(“MQBZ.hist”, DataFrame, header=false)
That works but here is the result
> Column1 Column2
Missing String15
1 missing “.”
2 missing “.”
3 missing “0.11702”
4 missing “-0.854428”
I can select column2 only, but why is it expecting a first column to begin with? It should be able to realize the data are in a single column no?
thanks
nilshg
September 26, 2023, 9:09am
2
It doesn’t:
julia> using CSV
julia> CSV.write("onecol.csv", (x = rand(5),))
"onecol.csv"
julia> using DataFrames
julia> CSV.read("onecol.csv", DataFrame)
5×1 DataFrame
Row │ x
│ Float64
─────┼───────────
1 │ 0.0121646
2 │ 0.19368
3 │ 0.0875001
4 │ 0.440309
5 │ 0.895765
Or put differently, the issue is with your file and/or parsing options, not CSV.jl
1 Like
Thanks for the prompt answer, the problem is that I am sure I have a single column. I tested with awk
awk ‘{print NF; exit}’ MQBZ.hist
1
The data looks like this
head MQBZ.hist
.
.
0.11702
-0.854428
0.239624
-0.663206
0.229609
0.611315
-0.0516864
-0.0172187
is it interpreting a “.” or the “-” symbol in a specific way?
Dan
September 26, 2023, 9:20am
4
Axze-rgb:
awk ‘{print NF; exit}’
This awk
script tests only the first line.
nilshg
September 26, 2023, 9:20am
5
No. Copying your data into a simple text file:
julia> CSV.read("mqbz.csv", DataFrame)
9×1 DataFrame
Row │ .
│ String15
─────┼────────────
1 │ .
2 │ 0.11702
3 │ -0.854428
4 │ 0.239624
5 │ -0.663206
6 │ 0.229609
7 │ 0.611315
8 │ -0.0516864
9 │ -0.0172187
Dan
September 26, 2023, 9:23am
6
A more appropriate awk
command would be:
awk -F, 'BEGIN {maxnf=0} {if (NF>maxnf) {maxnf = NF}} END {print "maxnf=", maxnf}' MQBZ.hist
Thanks for the improved awk but
awk -F, ‘BEGIN {maxnf=0} {if (NF>maxnf) {maxnf = NF}} END {print “maxnf=”, maxnf}’ MQBZ.hist
maxnf= 1
However, something not correct is that these data are interpreted as string instead of float, could it be the source of the error?
My data file is almost 2 million lines long so I can’t paste it here for testing
Dan
September 26, 2023, 9:48am
8
This is strange, can you give the output of:
awk -vFS="" '{for(i=1;i<=NF;i++)w[$i]++}END{for(i in w) print i,w[i]}' MQBZ.hist
which gives a count for each character in the file?
Sure, here it is
awk -vFS=“” ‘{for(i=1;i<=NF;i++)w[$i]++}END{for(i in w) print i,w[i]}’ MQBZ.hist
4 355365
5 347440
6 341754
7 335825
8 331927
9 326924
365027
620395
. 618861
0 626287
1 545963
2 423568
3 373330
It looks like it gives indeed 2 columns, really puzzling.
Dan
September 26, 2023, 10:07am
10
Thanks. It would be better to surround the output with triple quotes (```) on both sides to format it better.
Now, another request, can you post the output of:
od -xc MQBZ.hist | head
?
Here is the new output
0000000 2e20 200a 0a2e 3020 312e 3731 3230 200a
. \n . \n 0 . 1 1 7 0 2 \n
0000020 302d 382e 3435 3234 0a38 3020 322e 3933
- 0 . 8 5 4 4 2 8 \n 0 . 2 3 9
0000040 3236 0a34 2d20 2e30 3636 3233 3630 200a
6 2 4 \n - 0 . 6 6 3 2 0 6 \n
0000060 2e30 3232 3639 3930 200a 2e30 3136 3331
0 . 2 2 9 6 0 9 \n 0 . 6 1 1 3
0000100 3531 200a 302d 302e 3135 3836 3436 200a
1 5 \n - 0 . 0 5 1 6 8 6 4 \n
```
In case you would know bioinformatics , the file was generated like this (yes I made a type in the output file name).
> bcftools query -f '[ %BQBZ]\n' 0001.vcf > MQBZ.hist
Dan
September 26, 2023, 10:35am
12
Use:
df_MQBZ=CSV.read(“MQBZ.hist”, DataFrame, header=false, delim=',')
with the added delim
parameter.
The reason for the problem (which I’ve reproduced) is the extra space at the beginning of each line. The space confused the delimiter auto-detect code.
The reason the problem didn’t reproduce, is that pasting the lines with a space at the beginning instructed Discourse website to format as quote and remove the initial space.
Additionally, you can get rid of the extra space character by removing it from the bcftools
format string.
1 Like
Ohhh, this is amazing. Thanks for having spent so much time to help me debug this. That works. It seems the program “bcftools” introduces spaces. I don’t know why exactly it does that, and it also confuses the type detection since the value are detected as string, but I can fix that with types=[Float64]. I can indeed remove that space, The doc contains the space but I believe it’s because they show a very specific case, and don’t show the simple case.
Again, thanks so much
3 Likes