Julia program reading CSV file from stdin

iwelch · August 28, 2018, 8:06pm

I am wondering whether it is possible to embed a CSV file in a julia program. It almost works with CSV.read(), except I do not see a way to signal “EOF” (like with ^D on the REPL).

julia> using DataFrames, CSV

julia> d= CSV.read( stdin )
yyyymmdd,days,rate
19960102,9,5.763067
19960102,15,5.745902
19960103,8,5.763067
19960103,14,5.747397
^D   ## please no unprintable control characters

any suggestions?

Iagoba_Apellaniz · August 28, 2018, 8:13pm

Have you tried using the head function from DataFrames? See Loading a Classic Data Set

iwelch · August 28, 2018, 8:23pm

I think head shortens output, and does little with input.

quinnj · August 28, 2018, 10:36pm

Hmmm…I’m not quite clear on the question here; is it really to read a CSV file from stdin or do you just want to “hard-code” a csv file in a script? For the latter, you could do something as simple as:

csv = CSV.read(IOBuffer("""
col1,col2
1,2
3,4
"""))

For the former, I think you’re right in I’m not sure how you would signal EOF so CSV knows to stop reading (at least in a non-blocking way).

iwelch · August 28, 2018, 10:56pm

this is a pretty nice alternative, although I still would prefer my version.

PS: I am going to file a suggestion for a “endat=” keyword.

Tamas_Papp · August 29, 2018, 8:51am

This is not an answer to your original question, but if I wanted to include nontrivial data in a package/project, I would just Serialization.serialize, and read it at runtime from a path I determined with @__DIR__.

iwelch · August 29, 2018, 4:10pm

agreed. the principal use of END would be for small data frames, such as illustrative data sets.

perl also has an useful DATA feature that one can stick at the end of illustrative programs. but this is not a package feature that would be easy to implement. then again, perl is not so very good dealing with multiple files packaged together and residing elsewhere for its quick-and-dirty uses.

regards,

/iaw

Tamas_Papp · August 29, 2018, 4:16pm

When small, why not just have it in the source directly evaluated, possibly in a file that is included? Eg

const example_data = DataFrame(a = ..., b = ...)

with appropriate linebreaks etc if applicable.

iwelch · August 29, 2018, 4:29pm

definitely not as good visually for the reader. see, our illustrative finance data sets are often not calculated, but real data (think interest rates and stock returns in different months), and can be up to, say, 24 months long. just aligning them this way is a pain.

Tamas_Papp · August 29, 2018, 5:06pm

The CSV you display in the opening post is not aligned either.

I am now not really sure what you want. For small datasets, you can just use code and align (and comment!) as you prefer. For larger datasets, this is presumably not a concern as they would not be eyeballed directly.

Perhaps you can also include an IJulia notebook that visualizes certain features of the data (eg distribution, lag-1 scatter plot).

yuyichao · August 29, 2018, 5:38pm

A END keyword won’t work as what you suggested since the script is NOT read through STDIN. You’ll never be able to read anything. What you are asking for is just a way to embed string in a script and I don’t see what’s missing from a normal multi-line string constant as @quinnj suggests.

iwelch · August 29, 2018, 6:18pm

thx, yuyichao. if the script itself is not going through stdin, then this is moot…unless it is possible to designate the input stream.

tamas—think about

julia> using DataFrames, CSV

julia> d= CSV.read( scriptstream )
  yyyymmdd  xmkt    rf
1  20121227 -0.11 -0.19
2  20121228 -1.00 -0.03
3  20121231  1.71 -0.04
4  20130102  2.62  0.30
5  20130103 -0.14  0.11
6  20130104  0.55  0.48
7  20130107 -0.31 -0.42
8  20130108 -0.26  0.02
9  20130109  0.34 -0.40
10 20130110  0.66  0.45
11 20130111  0.02 -0.38
12 20130114 -0.06  0.00
__END__

julia> ## do many different types of data analysis.

the """ construct is a reasonable alternative. not as nice, but close enough. the listing data on long lines as in DataFrame( yyyymmdd= ..., ...) is also feasible, but again not as nice for this purpose.

yuyichao · August 29, 2018, 6:37pm

Still, are you talking about typing in data in a REPL or embeding data in a script. You always talk about script and then give REPL as example.

What I don’t understand is what’s not nice about it, or really, what are you looking for. AFAICT it’s just the difference between

d = read()
<data>
__END__

and

str = """
<data>
""" # __END__
d = read(IOBuffer(str))

That’s roughly the same number of lines. You get your __END__ in the comment if you want. Syntax highlight already works. read comes after the data but if you really want you can create a string macro and do

data = csv"""
<data>
"""

if it hasn’t been done already. (Should be as simple as macro csv_str(str) :(read(IOBuffer($(esc(str))))) end)

This still won’t work. You’ll still need to convience the parser to ignore your data, which is almost certainly invalid code.

iwelch · August 29, 2018, 7:56pm

on reflection, I think you are right.

Topic		Replies	Views
What is the best way to read a CSV file New to Julia csv , io	1	510	September 13, 2021
Read file with CSV.read New to Julia	8	19777	September 9, 2019
Handle large csv file using `enumerate(CSV.File())` or `CSV.read()`? New to Julia	3	550	April 21, 2019
CSV, DataFrame read data file with string and Float64 columns New to Julia dataframes	3	75	September 3, 2024
Debugger with CSV read General Usage	0	245	January 29, 2021

Julia program reading CSV file from stdin

Related topics