Julia program reading CSV file from stdin

I am wondering whether it is possible to embed a CSV file in a julia program. It almost works with CSV.read(), except I do not see a way to signal “EOF” (like with ^D on the REPL).

julia> using DataFrames, CSV

julia> d= CSV.read( stdin )
yyyymmdd,days,rate
19960102,9,5.763067
19960102,15,5.745902
19960103,8,5.763067
19960103,14,5.747397
^D   ## please no unprintable control characters

any suggestions?

Have you tried using the head function from DataFrames? See Loading a Classic Data Set

I think head shortens output, and does little with input.

Hmmm…I’m not quite clear on the question here; is it really to read a CSV file from stdin or do you just want to “hard-code” a csv file in a script? For the latter, you could do something as simple as:

csv = CSV.read(IOBuffer("""
col1,col2
1,2
3,4
"""))

For the former, I think you’re right in I’m not sure how you would signal EOF so CSV knows to stop reading (at least in a non-blocking way).

1 Like

this is a pretty nice alternative, although I still would prefer my version.

PS: I am going to file a suggestion for a “endat=” keyword.

This is not an answer to your original question, but if I wanted to include nontrivial data in a package/project, I would just Serialization.serialize, and read it at runtime from a path I determined with @__DIR__.

agreed. the principal use of END would be for small data frames, such as illustrative data sets.

perl also has an useful DATA feature that one can stick at the end of illustrative programs. but this is not a package feature that would be easy to implement. then again, perl is not so very good dealing with multiple files packaged together and residing elsewhere for its quick-and-dirty uses.

regards,

/iaw

When small, why not just have it in the source directly evaluated, possibly in a file that is included? Eg

const example_data = DataFrame(a = ..., b = ...)

with appropriate linebreaks etc if applicable.

definitely not as good visually for the reader. see, our illustrative finance data sets are often not calculated, but real data (think interest rates and stock returns in different months), and can be up to, say, 24 months long. just aligning them this way is a pain.

The CSV you display in the opening post is not aligned either.

I am now not really sure what you want. For small datasets, you can just use code and align (and comment!) as you prefer. For larger datasets, this is presumably not a concern as they would not be eyeballed directly.

Perhaps you can also include an IJulia notebook that visualizes certain features of the data (eg distribution, lag-1 scatter plot).

A END keyword won’t work as what you suggested since the script is NOT read through STDIN. You’ll never be able to read anything. What you are asking for is just a way to embed string in a script and I don’t see what’s missing from a normal multi-line string constant as @quinnj suggests.

1 Like

thx, yuyichao. if the script itself is not going through stdin, then this is moot…unless it is possible to designate the input stream.

tamas—think about

julia> using DataFrames, CSV

julia> d= CSV.read( scriptstream )
  yyyymmdd  xmkt    rf
1  20121227 -0.11 -0.19
2  20121228 -1.00 -0.03
3  20121231  1.71 -0.04
4  20130102  2.62  0.30
5  20130103 -0.14  0.11
6  20130104  0.55  0.48
7  20130107 -0.31 -0.42
8  20130108 -0.26  0.02
9  20130109  0.34 -0.40
10 20130110  0.66  0.45
11 20130111  0.02 -0.38
12 20130114 -0.06  0.00
__END__

julia> ## do many different types of data analysis.

the """ construct is a reasonable alternative. not as nice, but close enough. the listing data on long lines as in DataFrame( yyyymmdd= ..., ...) is also feasible, but again not as nice for this purpose.

Still, are you talking about typing in data in a REPL or embeding data in a script. You always talk about script and then give REPL as example.

What I don’t understand is what’s not nice about it, or really, what are you looking for. AFAICT it’s just the difference between

d = read()
<data>
__END__

and

str = """
<data>
""" # __END__
d = read(IOBuffer(str))

That’s roughly the same number of lines. You get your __END__ in the comment if you want. Syntax highlight already works. read comes after the data but if you really want you can create a string macro and do

data = csv"""
<data>
"""

if it hasn’t been done already. (Should be as simple as macro csv_str(str) :(read(IOBuffer($(esc(str))))) end)

This still won’t work. You’ll still need to convience the parser to ignore your data, which is almost certainly invalid code.

1 Like

on reflection, I think you are right.