My crude attempt at the SIR model

Warning! I am nobody with no Medical qualification
I literally pulled the parameter values out of thin air
Results here are NOT to be used for any serious endeavor

S = Array{Float64}(undef,1000)
I = Array{Float64}(undef,1000)
R = Array{Float64}(undef,1000)

β = 1.0/3.317
γ = 1.0/22.0
detect_rate = 0.2
death_rate = 0.02

S_zero = 1.4E9
I_zero = 40
R_zero = 0.0
N = S_zero + I_zero

ΔS = -β * (S_zero * I_zero / N)
ΔI = β * (S_zero * I_zero / N) - γ * I_zero
ΔR = γ * I_zero

S[1] = S_zero + ΔS
I[1] = I_zero + ΔI
R[1] = R_zero + ΔR

for t=2:29
    ΔS = -β * (S[t-1] * I[t-1] / N)
    ΔI = β * (S[t-1] * I[t-1] / N) - γ * I[t-1]
    ΔR = γ * I[t-1]
    S[t] = S[t-1] + ΔS
    I[t] = I[t-1] + ΔI
    R[t] = R[t-1] + ΔR

daterange = 1:29
DEAD = death_rate .* R[daterange]
DETECT = detect_rate .* I[daterange] + DEAD[daterange]

for t in daterange
    println("$t Jan 2020  DETECT=",round(DETECT[t]),"  DEAD=",round(DEAD[t]))
println("R0 is ",round(β/γ,sigdigits=3))

The next thing on my mind is how to improve the crude model. I have an idea.

There are five parameters

  1. I_zero
  2. beta
  3. gamma
  4. detect_rate
  5. death_rate

I don’t have a clue what these parameter values should be. But there are only 5 unknowns. Instead I do have the number of cases in China and the number of deaths in China and the dates in which those values are given out.

So I could have a random number generator to generate random values for the five parameters and write a cost function to compare the output vector DETECT and vector DEAD with the values that comes out of China. And see what values of the five parameters produce the lowest cost.

If I have a cost function, then maybe I could even use Simulated Annealing to find the five parameters. Just some ideas on what the next steps could be.

Oh yes, Any ideas on how to get the index for the Vectors to start at 0 instead of 1?

There is

1 Like

You may be interested in e.g. for fitting parameters (the do it for a Lotka-Volterra model, but this is of very similar kind as the SIR model) and for simulation there is also

1 Like

If you are interested, I have a notebook on estimation for a SIR model plus Gaussian noise: Parameter inference for a simple SIR model with the package Bridge


I am sure you guys ae eagerly awaiting the results.
I have decideds to perform a quick scan of the parameter space to get a quick feel of the Cost surface.

Here is the code

function Generate_Detect_Dead(param; enddayval=29)
    S = Array{Float64}(undef,100)
    I = Array{Float64}(undef,100)
    R = Array{Float64}(undef,100)

    β = param[1]
    γ = param[2]
    detect_rate = param[4]
    death_rate = param[5]

    S_zero = 1.4E9
    I_zero = param[3]
    R_zero = 0.0
    N = S_zero + I_zero

    ΔS = -β * (S_zero * I_zero / N)
    ΔI = β * (S_zero * I_zero / N) - γ * I_zero
    ΔR = γ * I_zero

    S[1] = S_zero + ΔS
    I[1] = I_zero + ΔI
    R[1] = R_zero + ΔR

    for t=2:endday
        ΔS = -β * (S[t-1] * I[t-1] / N)
        ΔI = β * (S[t-1] * I[t-1] / N) - γ * I[t-1]
        ΔR = γ * I[t-1]
        S[t] = S[t-1] + ΔS
        I[t] = I[t-1] + ΔI
        R[t] = R[t-1] + ΔR

    daterange = 1:endday
    DEAD = death_rate .* R[daterange]
    DETECT = detect_rate .* I[daterange] + DEAD[daterange]
    return (DETECT,DEAD)

function CostFunc(param,ActualCases,ActualDeaths)
    cost = 0.0
    for k = 1:length(ActualCases)
        (day,case) = ActualCases[k]
        cost += (DETECT[day] - case)^2.0
    for k = 1:length(ActualDeaths)
        (day,death) = ActualDeaths[k]
        cost += day^2.8 * (DEAD[day] - death)^2.0
    return cost

function main()
    # Actual Data
    ActualCases = [ (16,45),(17,62),(18,121),(19,198),
    (26,2744),(27,4515),(28,5974),(29,7771) ];
    ActualDeaths = [ (19,3),(20,6),(21,9),(22,17),
    (28,132),(29,170) ];

    lowest_cost = Inf
    lowest_param = undef
    for β = range(0.001,0.5,length=40)
        for γ = range(0.001,0.5,length=40)
            for detect_rate = range(0.01,0.99,length=40)
                for death_rate = range(0.01,0.99,length=40)
                    for I_zero = range(1,100,length=40)
                        # Below is the parameter for the
                        parameter = (β, # β
                                     γ,  # γ
                                     I_zero, # I_zero
                                     detect_rate, # detect_rate
                                     death_rate # death_rate
                        cost = CostFunc(parameter,ActualCases,ActualDeaths)
                        if cost < lowest_cost
                            lowest_cost = cost
                            lowest_param = parameter

    println("lowest cost is ",lowest_cost)
    println("lowest param is ",lowest_param)
    println("β is ",lowest_param[1])
    println("γ is ",lowest_param[2])
    println("I_zero is ",lowest_param[3])
    println("detect_rate is ",lowest_param[4])
    println("death_rate is ",lowest_param[5])

    daterange = 1:29
    for t in daterange
        println("$t Jan 2020  DETECT=",round(DETECT[t]),"  DEAD=",round(DEAD[t]))
    println("R0 is ",round(lowest_param[1]/lowest_param[2],sigdigits=3))


Before I talk about the results, Here is what I have learned.

The Cost function must be constructed carefully. Because there are two sets of numbers coming out of China which are (1) num of cases and (2) num of Deaths

I must make sure that the cost of “num of cases” mismatch be roughly equal to the cost of “num of deaths” mismatch. Furthuremore, the mismatch later in the reporting timeline must have more weight than the mismatches earlier in the timeline

function CostFunc(param,ActualCases,ActualDeaths)
    cost = 0.0
    for k = 1:length(ActualCases)
        (day,case) = ActualCases[k]
        cost += (DETECT[day] - case)^2.0
    for k = 1:length(ActualDeaths)
        (day,death) = ActualDeaths[k]
        cost += day^2.8 * (DEAD[day] - death)^2.0
    return cost

And now the results
β is 0.3976
γ is 0.05217
I_zero is 3.5
detect_rate is 0.38
death_rate is 0.060

We have

lowest cost is 3.986469644672301e6
lowest param is (0.39764102564102566, 0.05217948717948718, 3.5384615384615383, 0.3869230769230769, 0.06025641025641026)
β is 0.39764102564102566
γ is 0.05217948717948718
I_zero is 3.5384615384615383
detect_rate is 0.3869230769230769
death_rate is 0.06025641025641026
1 Jan 2020  DETECT=2.0  DEAD=0.0
2 Jan 2020  DETECT=3.0  DEAD=0.0
3 Jan 2020  DETECT=3.0  DEAD=0.0
4 Jan 2020  DETECT=5.0  DEAD=0.0
5 Jan 2020  DETECT=6.0  DEAD=0.0
6 Jan 2020  DETECT=8.0  DEAD=0.0
7 Jan 2020  DETECT=11.0  DEAD=0.0
8 Jan 2020  DETECT=15.0  DEAD=0.0
9 Jan 2020  DETECT=20.0  DEAD=0.0
10 Jan 2020  DETECT=27.0  DEAD=1.0
11 Jan 2020  DETECT=37.0  DEAD=1.0
12 Jan 2020  DETECT=49.0  DEAD=1.0
13 Jan 2020  DETECT=66.0  DEAD=1.0
14 Jan 2020  DETECT=89.0  DEAD=2.0
15 Jan 2020  DETECT=120.0  DEAD=3.0
16 Jan 2020  DETECT=162.0  DEAD=4.0
17 Jan 2020  DETECT=217.0  DEAD=5.0
18 Jan 2020  DETECT=293.0  DEAD=7.0
19 Jan 2020  DETECT=394.0  DEAD=9.0
20 Jan 2020  DETECT=530.0  DEAD=12.0
21 Jan 2020  DETECT=713.0  DEAD=16.0
22 Jan 2020  DETECT=959.0  DEAD=22.0
23 Jan 2020  DETECT=1290.0  DEAD=30.0
24 Jan 2020  DETECT=1736.0  DEAD=40.0
25 Jan 2020  DETECT=2335.0  DEAD=54.0
26 Jan 2020  DETECT=3142.0  DEAD=72.0
27 Jan 2020  DETECT=4227.0  DEAD=97.0
28 Jan 2020  DETECT=5688.0  DEAD=131.0
29 Jan 2020  DETECT=7652.0  DEAD=176.0
R0 is 7.62

Do you have the numbers as csv or similar?

Here you go
cat chinacase.txt


cat chinadeath.txt


You can also get the numbers from this webpage

1 Like

I think you are on the right way, if I round my code over the data I get

beta = (0.33, 0.39)
gamma = (0.046, 0.050)

(credible intervals, I estimate the intrinsic noisiness of the data but disregard the reporting issue.) I’ll put some code online over the days, fitting a SIR model seems of general interest.

Here is a paper trying to model the spread of the coronavirus

To predict the future outbreak profile, we developed a model based on the deterministic Susceptible-Exposed-Infectious-Recovered-Death-Cumulative (SEIRDC) structure.

R0 = 4.08
CFR/deathrate = 6.50%


Would be interested in general updates on this.

You wanted an update on this. I am afraid I have some bad news. Presumedly you are interested in the novel 2019Coronavirus outbreak.

The sad news is that, to put it diplomatically, the numbers out of China is unreliable. It is not possible to model the outbreak with unreliable numbers.

For one, the number of cases only refer to cases where the patient tested positive on the test kit. A lot of suspected coronavirus victims, do not get tested by doctors because the hospital literally ran out of beds. These are sent home. So we don’t know what the true number of cases are. We don’t even know how many are sent home without testing. We don’t even know how many even don’t get to consult a doctor at the hospital because there isn’t enough doctors.

Even if they consulted a doctor, they may not be tested with a test kit because they ran out of test kit. We don’t know how many was not tested due to lack of test kit.

Because of the system in China, anyone not tested with a test kit Is not consider a coronavirus victim. The true number of victims/cases are much higher than the official reported cases.

All this makes modelling next to impossible, you just can’t compare your model to reality to refine your model.

It is the same with the number of dead coronavirus victims. The official number only tells you how many dead victims has tested positive on the test kit before they die. If a true coronavirus victim died before they are tested, the doctors are not going to waste a test kit on a corpse when the living are screaming for test kits.
China, could tell the world about how many corpses they are cremating each day and how the numbers compare to the same time last year but they don’t. So we can’t even estimate how many suspected dead are due to the coronavirus. So you see there are various ways we can do analysis but reliable numbers are not forth coming out of China.

The best bet is to model the spread of coronavirus outside of China because reliable data is much more trustworthy. But here the numbers don’t grow as fast as in Hubei, China. It’s very weird, Either outside of China has very very good health system or the virus inside Hubei is very different from the virus outside China or everyone reported their illiness as soon as possible outside China.


Another important fact is that in China there were only 57 deaths of flu or some small number like that, because on the death certificate they put pneumonia and other pre-existing conditions as the the cause of death instead of the flu. That’s another additional reason the virus statistics are not accurate.

Also, recall in Wuhan they had a shared food banquet of 100,000 people a few days prior to the official outbreak, which didn’t happen in the rest of the world. This likely caused a greater outbreak rate there.


The Devil is in the details and in Chinese Communist Party.

It’s crap like this that we had to deal with when trying to model coronavirus…

China changes counting scheme to lower Wuhan virus numbers

TAIPEI (Taiwan News) — The daily reports of Wuhan virus infections in China will
likely begin to drop as the government has decided to stop counting patients who test
positive for the disease but do not exhibit symptoms as “confirmed cases.”

In a notice issued by China’s National Health Commission (NHC) on Feb. 6, it wrote
that the classification of new Wuhan virus infections will be divided into four categories:
“suspected cases,” “clinically diagnosed cases,” “confirmed cases,” and “positive tests.”
Among these, “positive tests” refers to “asymptomatic infected patients” who test
positive for the disease but have no symptoms.

There is also a clear stipulation in the official document stating that “If the reported
‘asymptomatic infected patient’ has clinical manifestations, their status shall be revised
to ‘confirmed case’ in a timely manner” (highlighted text in Tweet below). This indicates
that even if a person tests positive for the disease but does not exhibit any symptoms,
they will no longer be included in the daily infection reports.

Heilongjiang reduced 14 confirmed cases? Screening for asymptomatic infections? Poke here

February 09, 2020 20:13:28 Source: Xinhuanet


According to Zhao Yuhui, a second-level investigator of the Heilongjiang Provincial Health and Health Committee’s Medical Administration and Medical Administration Division, on the 8th, 14 patients were deducted from Heilongjiang Province. Edition) ", the positive cases of nucleic acid testing are divided into confirmed cases and asymptomatic infection, rather than collectively referred to as confirmed cases.

In Heilongjiang Province, there were 13 cases with no positive symptoms in the past, but the positive cases were counted as asymptomatic infection by the National Health and Medical Commission on the 8th, and no more confirmed cases were counted.

Researchers in various fields have been routinely modeling processes with missing/censored/truncated data for decades now. There are standard techniques to handle data missing not at random etc. Sure, it is more difficult than having fully reliable data, but not at all impossible.

If you are in the process of building up your modelling skills, perhaps starting with a less time sensitive and important research question could make sense.

Also, it is my impression that epidemiologists working on this question have access to more disaggregated data.


I had similar thoughts about the “impossibility”. I think parametrising all the factors which @StevenSiew mentioned and estimating them (in worst case just rough guesses) could also be used to predict and also adopt to the new numbers. It’s clearly much more complicated than working with clean data, but where on Earth do we have clean data :wink:

1 Like

This is how the professionals do it

The Novel Coronavirus, 2019-nCoV, is Highly Contagious and More Infectious Than Initially Estimated

Steven Sanche, Yen Ting Lin, Chonggang Xu, Ethan Romero-Severson, Nick Hengartner, Ruian Ke

SS and RK would like to acknowledge funding from DARPA (HR0011938513). CX acknowledges the support from the Laboratory Directed Research and Development (LDRD) Program at Los Alamos National Laboratory. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.


The novel coronavirus (2019-nCoV) is a recently emerged human pathogen that has spread widely since January 2020. Initially, the basic reproductive number, R0, was estimated to be 2.2 to 2.7. Here we provide a new estimate of this quantity. We collected extensive individual case reports and estimated key epidemiology parameters, including the incubation period. Integrating these estimates and high-resolution real-time human travel and infection data with mathematical models, we estimated that the number of infected individuals during early epidemic double every 2.4 days, and the R0 value is likely to be between 4.7 and 6.6. We further show that quarantine and contact tracing of symptomatic individuals alone may not be effective and early, strong control measures are needed to stop transmission of the virus.

Paper is avail at

I don’t know what kinda of computing power they have in Los Alamos but I think they have more than the iMac with 8Gb of memory that I have.


For those of us who craves nice clean reliable data to play with SIR model: Singapore


China’s Coronavirus Figures Don’t Add Up. ‘This Never Happens With Real Data.’

In terms of the virus data, the number of cumulative deaths reported is described by a simple mathematical formula to a very high accuracy, according to a quantitative-finance specialist who ran a regression of the data for Barron’s. A near-perfect 99.99% of variance is explained by the equation, this person said.

Put in an investing context, that variance, or so-called r-squared value, would mean that an investor could predict tomorrow’s stock price with almost perfect accuracy. In this case, the high r-squared means there is essentially zero unexpected variability in reported cases day after day.

Barron’s re-created the regression analysis of total deaths caused by the virus, which first emerged in the central Chinese city of Wuhan at the end of last year, and found similarly high variance. We ran it by Melody Goodman, associate professor of biostatistics at New York University’s School of Global Public Health.

“I have never in my years seen an r-squared of 0.99,” Goodman says. “As a statistician, it makes me question the data.”