Worst-case example on how NOT to publish your data

sylvaticus · September 15, 2024, 9:01am

I am a bit angry at the company managing the Electricity Grid of France (RTE).
Here you can download, inter alias, the statistics on electricity production by hour and source, arranged in yearly files.

What is wrong ?

You need to register. Thsi is just boring, but very common unfortuatly.
The file extention of the file inside the zip (“.xls”) is wrong. It isn’t a (old format) Excel file, it is a tabled separated text file (with multiple tabs sometimes)
The layout of the data is funny. Instead of a single table with day/production and then the by source as columns, there are 365 different tables (see this thread)
The structure of the table is different on the different days (sic!). Some days for example don’t have “Offshore wind farms”
The encoding instead of being utf is ISO-8859-1 (in 2023!) (use the StringEncodings.jl package)
Their “Total” computed column doesn’t consider (propagate) the presence of “missing” data, and so it’s useless

I get the feeling that these guys only published their data because they HAD to do it by some sort of legal requirement, but made no effort to present the data in a useful, straightforward way…

Marcelo_Simas · September 15, 2024, 10:58pm

At least they are publishing the data…

cjdoris · September 16, 2024, 4:17pm

Now you get to write and publish a nice Julia package which parses it into a nice form for follow-on analysis.

sylvaticus · September 17, 2024, 7:47am

Here is my script… by the way, do grid authorities in other countries publish their historical data of {production by source , consumption by sector} too ?

Script

using Pkg
Pkg.activate(@__DIR__)
using Statistics, CSV, DataFrames, StringEncodings, StatsPlots, Dates, ZipFile, Pipe

### Script parameters...
data_folder   = @__DIR__
base_filename = "RealisationDonneesProduction_"
years         = 2015:2023
allnames      = ["year","day","hour","Heures","Biomasse","Gaz","Charbon","Fioul","Hydraulique STEP","Hydraulique fil de l'eau / éclusée","Hydraulique lacs","Nucléaire","Solaire","Éolien terrestre","Éolien en mer","Déchets","Total"]

### Data container (start empty)...
df = DataFrame()

### Main loop to fill the data container...
for year in years

    println("** processing year $year ... ")
    zipfile = joinpath(data_folder,"$(base_filename)$(year).zip")

    data = read(ZipFile.Reader(zipfile).files[1],String,enc"ISO-8859-1")
    ndays = isleapyear(Date(year))  ? 366 : 365

    dheaders = [d+2 for d in (0:ndays-1).*27]
    #dheaders = [2,29]
    dfs = [CSV.read(IOBuffer(data), DataFrame; header=dheader, limit=24, delim="\t", missingstring="*", ntasks=1, buffer_in_memory=true, ignorerepeated=true) for dheader in dheaders]

    insertcols!.(dfs, 1, :hour => 0:23)
    [insertcols!(dfs[d], 1, :day => d)  for d in 1:365]
    insertcols!.(dfs, 1, :year => year)

    for (i,df) in enumerate(dfs)
        for n in allnames
            if !(n in names(df))
                insertcols!(df, 1, Symbol(n) => 0)
            end
        end
        dfs[i] = df[:,allnames] 
    end
    df = vcat(df,dfs...)
end

# Some further aggregate columns...
fossil_fuel      = ["Gaz","Charbon","Fioul"]
ren_controllable = ["Biomasse","Hydraulique STEP","Hydraulique lacs","Déchets"]
ren_variable     = ["Hydraulique fil de l'eau / éclusée","Solaire","Éolien terrestre","Éolien en mer"]

df.fossil_fuel      .= sum(df[:,c] for c in fossil_fuel ) 
df.ren_controllable .= sum(df[:,c] for c in ren_controllable)
df.ren_variable     .= sum(df[:,c] for c in ren_variable)

# Some checks
balance = df.Total .- (df.Nucléaire .+ df.fossil_fuel .+ df.ren_controllable .+ df.ren_variable)
minimum(skipmissing(balance))
maximum(skipmissing(balance))

# Some plotting...
# We work on 2023 for the plot
winterweek = df[in.(df.day,Ref(10:16)) .&& df.year .== 2023,:]
plot(hcat(winterweek.Total,winterweek.fossil_fuel,winterweek.ren_controllable,winterweek.ren_variable,winterweek.Nucléaire), labels=["Total" "FF" "RC" "RV" "Nuclear"], ylabel="MW", title="Winter week")
#savefig("winterweek.svg")
summerweek = df[in.(df.day,Ref(190:196)) .&& df.year .== 2023,:]
plot(hcat(summerweek.Total,summerweek.fossil_fuel,summerweek.ren_controllable,summerweek.ren_variable,summerweek.Nucléaire), labels=["Total" "FF" "RC" "RV" "Nuclear"], ylabel="MW", title="Summer week")
#savefig("summerweek.svg")

# Sum by sources, within year
avgyear = combine(groupby(df,["day"])) do subdf # slower
    (total = mean(subdf.Total), n = mean(subdf.Nucléaire), ff = mean(subdf.fossil_fuel), rc = mean(subdf.ren_controllable), rv = mean(subdf.ren_variable))
end

plot(hcat(avgyear.total,avgyear.ff,avgyear.rc,avgyear.rv,avgyear.n), labels=["Total" "FF" "RC" "RV" "Nuclear"], ylabel="MW", title="Seasonal variations in electricity production",legend=:top, xlabel="Day of the year")
#savefig("avgyear.svg")

# Sum by sources, within year
trend = combine(groupby(df,["year"])) do subdf # slower
    (total = mean(skipmissing(subdf.Total)), n = mean(skipmissing(subdf.Nucléaire)), ff = mean(skipmissing(subdf.fossil_fuel)), rc = mean(skipmissing(subdf.ren_controllable)), rv = mean(skipmissing(subdf.ren_variable)))
end
plot(trend.year,hcat(trend.total,trend.ff,trend.rc,trend.rv,trend.n), labels=["Total" "FF" "RC" "RV" "Nuclear"], ylabel="MW", title="Trend by years",legend=:top)
#savefig("trend.svg")

# Some stats.. last year vs first year...
tratio = trend.total[end]/trend.total[1]
nratio = trend.n[end]/trend.n[1]
ffratio = trend.ff[end]/trend.ff[1]
rcratio = trend.rc[end]/trend.rc[1]
rvratio = trend.rv[end]/trend.rv[1]

tdiff = 100*(trend.total[end]-trend.total[1])/trend.total[1]
ndiff = 100*(trend.n[end]-trend.n[1])/trend.n[1]
ffdiff = 100*(trend.ff[end]-trend.ff[1])/trend.ff[1]
rcdiff = 100*(trend.rc[end]-trend.rc[1])/trend.rc[1]
rvdiff = 100*(trend.rv[end]-trend.rv[1])/trend.rv[1]

fgerick · September 17, 2024, 9:09am

The German Federal Network Agency (including data for Luxembourg and Austria) does: SMARD | SMARD - Market data and I believe they do a better job at data accessibility. No registration, and you can choose the time span, resolution and data format.

gcv · September 17, 2024, 1:52pm

Dirty data are the norm in every industry and sector. I’ve dealt with US power data in the past (that data is available from independent system operators, ISOs), and it was a mess. Each ISO did (probably still does) something different to publish it. Wait til you discover that your loading code always breaks twice a year because of daylight saving time. Which the raw data encodes differently every single time, even from the same source.

mthelm85 · September 17, 2024, 3:31pm

That reminds me of this:

It’s a great watch for any programmer that hasn’t seen it before.

Tamas_Papp · September 17, 2024, 3:53pm

I sympathize, but as an economist I have seen panel data with much worse properties.

Fortunately, the Julia package ecosystem makes it relatively easy to ingest data in weird and irregular formats.

Boris · September 18, 2024, 4:38pm

I have downloaded exactly this data years ago. You didn’t need to register, it was in xls files and it was horribly organized back then. I literally deleted it about six months ago, otherwise I would have shared it to see how it was back then.

Btw, data from the German statistical office is also horrible arranged in a table where one year is split on four rows. The months are in columns in reverse order, three of them in each row…

Column 1	Column 2	Column 3	Column 4
2024	dec	nov	oct
merged cell to eff up your loading	sep	…
merged cell to eff up your loading

I have no idea who came up with this time series format but its exceptional.

Eben60 · September 19, 2024, 9:52am

pjuergens · September 23, 2024, 7:24am

Consider downloading the data from Energy-Charts France. There are tons of energy related data for all of Europe. You can change the country you get the data for in the top right corner. They also provide an API.

If there is any interface missing in the API or if there are any inconsistencies or missing data, don’t hesitate to contact them. They really make an effort to collect the data from various public sources and check for data errors, etc.

sylvaticus · September 23, 2024, 7:47am

That’s super cool, thanks…

Topic		Replies	Views
16000 cases lost in UK Covid reporting Offtopic csv , excel	16	1023	October 15, 2020
[ANN] Write Excel files with XLSX.jl v0.2.0 Community package , xlsx	40	6531	June 2, 2025
[ANN] PowerFlowData.jl: a parser for PSS/E-format `.raw` power flow data files Package Announcements package , announcement	9	1601	November 19, 2021
Question about my data General Usage question	23	596	August 4, 2020
[ANN] PowerSystems.jl Data structures in Julia to enable power systems analysis Package Announcements	0	699	November 24, 2020

Worst-case example on how NOT to publish your data

Related topics