Worst-case example on how NOT to publish your data

I am a bit angry at the company managing the Electricity Grid of France (RTE).
Here you can download, inter alias, the statistics on electricity production by hour and source, arranged in yearly files.

What is wrong ?

  1. You need to register. Thsi is just boring, but very common unfortuatly.
  2. The file extention of the file inside the zip (“.xls”) is wrong. It isn’t a (old format) Excel file, it is a tabled separated text file (with multiple tabs sometimes)
  3. The layout of the data is funny. Instead of a single table with day/production and then the by source as columns, there are 365 different tables (see this thread)
  4. The structure of the table is different on the different days (sic!). Some days for example don’t have “Offshore wind farms”
  5. The encoding instead of being utf is ISO-8859-1 (in 2023!) (use the StringEncodings.jl package)
  6. Their “Total” computed column doesn’t consider (propagate) the presence of “missing” data, and so it’s useless

I get the feeling that these guys only published their data because they HAD to do it by some sort of legal requirement, but made no effort to present the data in a useful, straightforward way…

30 Likes

At least they are publishing the data…

4 Likes

Now you get to write and publish a nice Julia package which parses it into a nice form for follow-on analysis.

5 Likes

Here is my script… by the way, do grid authorities in other countries publish their historical data of {production by source , consumption by sector} too ?

Script
using Pkg
Pkg.activate(@__DIR__)
using Statistics, CSV, DataFrames, StringEncodings, StatsPlots, Dates, ZipFile, Pipe

### Script parameters...
data_folder   = @__DIR__
base_filename = "RealisationDonneesProduction_"
years         = 2015:2023
allnames      = ["year","day","hour","Heures","Biomasse","Gaz","Charbon","Fioul","Hydraulique STEP","Hydraulique fil de l'eau / éclusée","Hydraulique lacs","Nucléaire","Solaire","Éolien terrestre","Éolien en mer","Déchets","Total"]

### Data container (start empty)...
df = DataFrame()

### Main loop to fill the data container...
for year in years

    println("** processing year $year ... ")
    zipfile = joinpath(data_folder,"$(base_filename)$(year).zip")

    data = read(ZipFile.Reader(zipfile).files[1],String,enc"ISO-8859-1")
    ndays = isleapyear(Date(year))  ? 366 : 365

    dheaders = [d+2 for d in (0:ndays-1).*27]
    #dheaders = [2,29]
    dfs = [CSV.read(IOBuffer(data), DataFrame; header=dheader, limit=24, delim="\t", missingstring="*", ntasks=1, buffer_in_memory=true, ignorerepeated=true) for dheader in dheaders]

    insertcols!.(dfs, 1, :hour => 0:23)
    [insertcols!(dfs[d], 1, :day => d)  for d in 1:365]
    insertcols!.(dfs, 1, :year => year)

    for (i,df) in enumerate(dfs)
        for n in allnames
            if !(n in names(df))
                insertcols!(df, 1, Symbol(n) => 0)
            end
        end
        dfs[i] = df[:,allnames] 
    end
    df = vcat(df,dfs...)
end

# Some further aggregate columns...
fossil_fuel      = ["Gaz","Charbon","Fioul"]
ren_controllable = ["Biomasse","Hydraulique STEP","Hydraulique lacs","Déchets"]
ren_variable     = ["Hydraulique fil de l'eau / éclusée","Solaire","Éolien terrestre","Éolien en mer"]

df.fossil_fuel      .= sum(df[:,c] for c in fossil_fuel ) 
df.ren_controllable .= sum(df[:,c] for c in ren_controllable)
df.ren_variable     .= sum(df[:,c] for c in ren_variable)

# Some checks
balance = df.Total .- (df.Nucléaire .+ df.fossil_fuel .+ df.ren_controllable .+ df.ren_variable)
minimum(skipmissing(balance))
maximum(skipmissing(balance))

# Some plotting...
# We work on 2023 for the plot
winterweek = df[in.(df.day,Ref(10:16)) .&& df.year .== 2023,:]
plot(hcat(winterweek.Total,winterweek.fossil_fuel,winterweek.ren_controllable,winterweek.ren_variable,winterweek.Nucléaire), labels=["Total" "FF" "RC" "RV" "Nuclear"], ylabel="MW", title="Winter week")
#savefig("winterweek.svg")
summerweek = df[in.(df.day,Ref(190:196)) .&& df.year .== 2023,:]
plot(hcat(summerweek.Total,summerweek.fossil_fuel,summerweek.ren_controllable,summerweek.ren_variable,summerweek.Nucléaire), labels=["Total" "FF" "RC" "RV" "Nuclear"], ylabel="MW", title="Summer week")
#savefig("summerweek.svg")

# Sum by sources, within year
avgyear = combine(groupby(df,["day"])) do subdf # slower
    (total = mean(subdf.Total), n = mean(subdf.Nucléaire), ff = mean(subdf.fossil_fuel), rc = mean(subdf.ren_controllable), rv = mean(subdf.ren_variable))
end

plot(hcat(avgyear.total,avgyear.ff,avgyear.rc,avgyear.rv,avgyear.n), labels=["Total" "FF" "RC" "RV" "Nuclear"], ylabel="MW", title="Seasonal variations in electricity production",legend=:top, xlabel="Day of the year")
#savefig("avgyear.svg")

# Sum by sources, within year
trend = combine(groupby(df,["year"])) do subdf # slower
    (total = mean(skipmissing(subdf.Total)), n = mean(skipmissing(subdf.Nucléaire)), ff = mean(skipmissing(subdf.fossil_fuel)), rc = mean(skipmissing(subdf.ren_controllable)), rv = mean(skipmissing(subdf.ren_variable)))
end
plot(trend.year,hcat(trend.total,trend.ff,trend.rc,trend.rv,trend.n), labels=["Total" "FF" "RC" "RV" "Nuclear"], ylabel="MW", title="Trend by years",legend=:top)
#savefig("trend.svg")

# Some stats.. last year vs first year...
tratio = trend.total[end]/trend.total[1]
nratio = trend.n[end]/trend.n[1]
ffratio = trend.ff[end]/trend.ff[1]
rcratio = trend.rc[end]/trend.rc[1]
rvratio = trend.rv[end]/trend.rv[1]

tdiff = 100*(trend.total[end]-trend.total[1])/trend.total[1]
ndiff = 100*(trend.n[end]-trend.n[1])/trend.n[1]
ffdiff = 100*(trend.ff[end]-trend.ff[1])/trend.ff[1]
rcdiff = 100*(trend.rc[end]-trend.rc[1])/trend.rc[1]
rvdiff = 100*(trend.rv[end]-trend.rv[1])/trend.rv[1]
2 Likes

The German Federal Network Agency (including data for Luxembourg and Austria) does: SMARD | SMARD - Market data and I believe they do a better job at data accessibility. No registration, and you can choose the time span, resolution and data format.

3 Likes

Dirty data are the norm in every industry and sector. I’ve dealt with US power data in the past (that data is available from independent system operators, ISOs), and it was a mess. Each ISO did (probably still does) something different to publish it. Wait til you discover that your loading code always breaks twice a year because of daylight saving time. Which the raw data encodes differently every single time, even from the same source.

4 Likes

That reminds me of this:

It’s a great watch for any programmer that hasn’t seen it before.

9 Likes

I sympathize, but as an economist I have seen panel data with much worse properties.

Fortunately, the Julia package ecosystem makes it relatively easy to ingest data in weird and irregular formats.

2 Likes

I have downloaded exactly this data years ago. You didn’t need to register, it was in xls files and it was horribly organized back then. I literally deleted it about six months ago, otherwise I would have shared it to see how it was back then.

Btw, data from the German statistical office is also horrible arranged in a table where one year is split on four rows. The months are in columns in reverse order, three of them in each row…

Column 1 Column 2 Column 3 Column 4
2024 dec nov oct
merged cell to eff up your loading sep
merged cell to eff up your loading

I have no idea who came up with this time series format but its exceptional.

3 Likes
6 Likes

Consider downloading the data from Energy-Charts France. There are tons of energy related data for all of Europe. You can change the country you get the data for in the top right corner. They also provide an API.

If there is any interface missing in the API or if there are any inconsistencies or missing data, don’t hesitate to contact them. They really make an effort to collect the data from various public sources and check for data errors, etc.

3 Likes

That’s super cool, thanks…