Request for comments: filenames with metadata

When managing some bills using a toy framework I’m developing in Julia, I am finding it’s useful to embed some metadata in the filename. I wanted to share an example here for comments: does anyone know of a library in any language which implements this idiomatic “fields separated by underscores” file naming strategy? I have used it several times but always written a one-off parser for it.

I wonder if it’s worth making a package? If a package existed, what would it be named? MetadataFilenames.jl? FilenameDataFields.jl? Appreciate any collaboration! I’m sure other people do this too…

using Dates

testdata = [
	# natural gas bills
	"semco2019-03-25to04-17_thm62p068_usd57p93.pdf",  "semco2019-04-17to05-17_thm52p650_usd104p61.pdf",
	"semco2019-05-17to06-18_thm25p344_usd29p48.pdf",  "semco2019-06-18to07-18_thm9p477_usd19p05.pdf",
	"semco2019-07-18to08-16_thm6p318_usd17p21.pdf",   "semco2019-08-16to09-17_thm6p288_usd17p18.pdf",
	"semco2019-09-17to10-18_thm23p034_usd26p93.pdf",  "semco2019-10-18to11-18_thm89p505_usd63p76.pdf",
	"semco2019-11-18to12-18_thm107p814_usd72p91.pdf", "semco2019-12-18to2020-01-17_thm124p726_usd79p72.pdf",
	"semco2020-01-17to02-18_thm141p504_usd87p84.pdf", "semco2020-02-18to03-18_thm104p445_usd66p87.pdf",
	"semco2020-03-18to04-16_thm83p661_usd60p27.pdf",  "semco2020-04-16to05-19_thm69p630_usd53p38.pdf",
	"semco2020-05-19to06-18_thm14p686_usd21p32.pdf",  "semco2020-06-18to07-20_thm6p312_usd16p42.pdf",
	"semco2020-07-20to08-18_thm17p901_usd23p19.pdf",  "semco2020-08-18to09-17_thm16p848_usd22p81.pdf",
	"semco2020-09-17to10-19_thm32p767_usd32p84.pdf",
]

p2f(p::AbstractString, t=Float64) = parse(t, replace(p, "p" => "."))

for x in testdata
	d = Dict{Symbol,Any}(:filename => x)
	name, d[:extension] = splitext(x)

	# terms (separated by underscores) have the following forms:
	# * foobar                       - flag, stores :foobar => true
	# * nofoobar                     - flag, stores :foobar => false
	# * foobar2000p0                 - numerical, stores :foobar => 2000.0
	# * foobar-2001p0                - numerical, stores :foobar => -2001.0
	# * foobar2000p1to-2001          - numerical range, stores :foobar => [2000.1, -2001.0]
	# * foobar2000-01-01to02         - date range, stores :foobar => [Date(2000, 1, 1), Date(2000, 1, 2)]
	# * foobar2000-01-01to02-01      - date range, stores :foobar => [Date(2000, 1, 1), Date(2000, 2, 1)]
	# * foobar2000-01-01to2001-01-01 - date range, stores :foobar => [Date(2000, 1, 1), Date(2001, 1, 1)]
	for term in split(name, "_")

		# flag
		# TODO

		# date
		m = match(r"^([A-Za-z]+)(\d{4}-\d{2}-\d{2})$", term)
		if !isnothing(m)
			key, val = m.captures
			d[Symbol(key)] = Date(val, "yyyy-mm-dd")
			continue
		end

		# date range
		m = match(r"^([A-Za-z]+)(\d{4}-\d{2}-\d{2})to((?:(?:\d{4}-)?\d{2}-)?\d{2})$", term)
		if !isnothing(m)
			key, val_start, val_end = m.captures
			d[Symbol(key)] = Date.([val_start val_start[1:end-length(val_end)]*val_end], "yyyy-mm-dd")
			continue
		end

		# numerical
		m = match(r"^([A-Za-z]+)(-?\d+p\d+)$", term)
		if !isnothing(m)
			key, val = m.captures
			d[Symbol(key)] = p2f(val)
			continue
		end

		# numerical range
		m = match(r"^([A-Za-z]+)(-?\d+p\d+)to(-?\d+p\d+)$", term)
		if !isnothing(m)
			key, val_start, val_end = m.captures
			d[Symbol(key)] = p2f.([val_start, val_end])
			continue
		end

		d[:unknown] = [String(term), get(d, :unknown, [])...]
	end

	@show d
end

# note - perhaps add a helper to allow a Parameters.jl struct to be populated
# using fields from the filename?

DrWatson.jl has some capabilities like this.

1 Like