Help with string mapping from input


#1

Hello

I’m trying to import a custom text file that isn’t well aligned and a mixture of data types.

The input data looks something like this

inputtext = ["ABC=1,2,3,cats,dogs","DEF=1,apples,oranges,4,5,6.78,bananas","XYZ=Julia,R,Python,9.9999"]

Currently, my code is very brute force manual and I was hoping that there was a better way to generate this, perhaps with some sort of mapping function

ABC = DataFrame(a=Int64[],b=Int64[],c=Int64[],animal=String[],put=String[])
DEF = DataFrame(a=Int64[],fruit=String[],lunch=String[],b=Int64[],c=Int64[],d=Float64[],breakfast=String[])
XYZ = DataFrame(Best=String[],OK=String[],Yuck=String[],a=Float64[])

for i in inputtext
	data = split(split(i,"=")[2],",")
	if startswith(i,"ABC")
		push!(ABC, [parse(Int64,data[1]),parse(Int64,data[2]),parse(Int64,data[3]),data[4],data[5]])
	elseif startswith(i,"DEF")
		push!(DEF, [parse(Int64,data[1]),data[2],data[3],parse(Int64,data[4]),parse(Int64,data[5]),parse(Float64,data[6]),data[7]])
	elseif startswith(i,"XYZ")
		push!(XYZ, [data[1],data[2],data[3],parse(Float64,data[4])])
	end
end

Can someone suggest a more elegant way to do this?


Fastest way to parse a string of numbers
#2

You could define something crude as

parsefield(T::Type{<:Real}, str::AbstractString) = parse(T, str)
parsefield(::Type{T}, str::T) where T <: AbstractString = str
parsefield(::Type{T}, str::AbstractString) where T <: AbstractString = T(str)

schemas = Dict(["ABC" => (a = Int64, b = Int64, c = Int64, animal = String, put = String)])

function parsewithschema(row, schemas)
    key, fields = split(row, '=')
    schema = schemas[key]
    NamedTuple{keys(schema)}(map(parsefield, schemas[key], split(fields, ',')))
end

then

julia> parsewithschema(inputtext[1], schemas)
(a = 1, b = 2, c = 3, animal = "cats", put = "dogs")

Would need a bit of refinement, whitespace stripping, and validation for real-life applications, but you get the idea. You can then interface into Tables.jl.

Alternatively, leverage the row-reading functionality from one of the CSV reading libraries.

EDIT: Try using Parsers.jl for the fields, as described here.