User interface language: English | Español

Date May 2022 Marks available 1 Reference code 22M.3.AHL.TZ1.1
Level Additional Higher Level Paper Paper 3 Time zone Time zone 1
Command term Write down Question number 1 Adapted from N/A

Question

This question is about modelling the spread of a computer virus to predict the number of computers in a city which will be infected by the virus.


A systems analyst defines the following variables in a model:

The following data were collected:

A model for the early stage of the spread of the computer virus suggests that

Q't=βNQt

where N is the total number of computers in a city and β is a measure of how easily the virus is spreading between computers. Both N and β are assumed to be constant.

The data above are taken from city X which is estimated to have 2.6 million computers.
The analyst looks at data for another city, Y. These data indicate a value of β=9.64×108.

An estimate for Q(t), t5, can be found by using the formula:

Q'tQt+5-Qt-510.

The following table shows estimates of Q'(t) for city X at different values of t.

An improved model for Q(t), which is valid for large values of t, is the logistic differential equation

Q't=kQt1-QtL

where k and L are constants.

Based on this differential equation, the graph of Q'tQt against Q(t) is predicted to be a straight line.

Find the equation of the regression line of Q(t) on t.

[2]
a.i.

Write down the value of r, Pearson’s product-moment correlation coefficient.

[1]
a.ii.

Explain why it would not be appropriate to conduct a hypothesis test on the value of r found in (a)(ii).

[1]
a.iii.

Find the general solution of the differential equation Q't=βNQt.

[4]
b.i.

Using the data in the table write down the equation for an appropriate non-linear regression model.

[2]
b.ii.

Write down the value of R2 for this model.

[1]
b.iii.

Hence comment on the suitability of the model from (b)(ii) in comparison with the linear model found in part (a).

[2]
b.iv.

By considering large values of t write down one criticism of the model found in (b)(ii).

[1]
b.v.

Use your answer from part (b)(ii) to estimate the time taken for the number of infected computers to double.

[2]
c.

Find in which city, X or Y, the computer virus is spreading more easily. Justify your answer using your results from part (b).

[3]
d.

Determine the value of a and of b. Give your answers correct to one decimal place.

[2]
e.

Use linear regression to estimate the value of k and of L.

[5]
f.i.

The solution to the differential equation is given by

Qt=L1+Ce-kt

where C is a constant.

Using your answer to part (f)(i), estimate the percentage of computers in city X that are expected to have been infected by the virus over a long period of time.

[2]
f.ii.

Markscheme

Q(t)=3090t-54000  3094.27t-54042.3         A1A1


Note: Award at most A1A0 if answer is not an equation. Award A1A0 for an answer including either x or y.

 

[2 marks]

a.i.

0.755  0.754741         A1

 

[1 mark]

a.ii.

t is not a random variable OR it is not a (bivariate) normal distribution

OR data is not a sample from a population

OR data appears nonlinear

OR r only measures linear correlation         R1

 

Note: Do not accept “r is not large enough”.

 

[1 mark]

a.iii.

attempt to separate variables            (M1)

1QdQ=βNdt

lnQ=βNt+c           A1A1A1 

 

Note: Award A1 for LHS, A1 for βNt, and A1 for +c.

Award full marks for Q=eβNt+c  OR  Q=AeβNt.

Award M1A1A1A0 for Q=eβNt

 

[4 marks]

b.i.

attempt at exponential regression           (M1)

Q=1.15e0.292t  Q=1.14864e0.292055t           A1

OR

attempt at exponential regression           (M1)

Q=1.15×1.34t  1.14864×1.33917t           A1

 

Note: Condone answers involving y or x. Condone absence of “Q=” Award M1A0 for an incorrect answer in correct format.

 

[2 marks]

b.ii.

0.999  0.999431          A1

 

[1 mark]

b.iii.

comparing something to do with R2 and something to do with r        M1

 

Note:   Examples of where the M1 should be awarded:

R2>r
R>r
0.999>0.755
0.999>0.7552   =0.563
The “correlation coefficient” in the exponential model is larger.
Model B has a larger R2

Examples of where the M1 should not be awarded:

The exponential model shows better correlation (since not clear how it is being measured)
Model 2 has a better fit
Model 2 is more correlated

 

an unambiguous comparison between R2 and r2 or R and r leading to the conclusion that the model in part (b) is more suitable / better          A1

 

Note: Condone candidates claiming that R is the “correlation coefficient” for the non-linear model.

 

[2 marks]

b.iv.

it suggests that there will be more infected computers than the entire population       R1

 

Note: Accept any response that recognizes unlimited growth. 

 

[1 mark]

b.v.

1.15e0.292t=2.3  OR  1.15×1.34t=2.3  OR  t=ln20.292  OR using the model to find two specific times with values of Qt which double          M1

t=2.37  (days)          A1

 

Note: Do not FT from a model which is not exponential. Award M0A0 for an answer of 2.13 which comes from using (10, 20) from the data or any other answer which finds a doubling time from figures given in the table.

 

[2 marks]

c.

an attempt to calculate β for city X          (M1)


β=0.2920552.6×106  OR  β=ln1.339172.6×106

=1.12328×10-7          A1

this is larger than 9.64×10-8 so the virus spreads more easily in city X         R1

 

Note: It is possible to award M1A0R1.
Condone “so the virus spreads faster in city X” for the final R1.

 

[3 marks]

d.

a=38.3, b=3086.1          A1A1

 

Note: Award A1A0 if values are correct but not to 1 dp.

 

[2 marks]

e.

Q'Q=0.42228-2.5561×10-6Q          (A1)(A1)


Note:
Award A1 for each coefficient seen – not necessarily in the equation. Do not penalize seeing in the context of y and x.


identifying that the constant is k OR that the gradient is -kL          (M1)

therefore k=0.422   0.422228          A1

kL=2.5561×10-6

L=165000   165205          A1


Note:
Accept a value of L of 164843 from use of 3 sf value of k, or any other value from plausible pre-rounding.
Allow follow-through within the question part, from the equation of their line to the final two A1 marks.

 

[5 marks]

f.i.

recognizing that their L is the eventual number of infected        (M1)

1652052600000=6.35%    6.35403%          A1


Note:
Accept any final answer consistent with their answer to part (f)(i) unless their L is less than 120146 in which case award at most M1A0.

 

[2 marks]

f.ii.

Examiners report

A significant minority were unable to attempt 1(a) which suggests poor preparation for the use of the GDC in this statistics-heavy course. Large numbers of candidates appeared to use y, x, Q and t interchangeably. Accurate use of notation is an important skill which needs to be developed.

1(a)(iii) was a question at the heart of the Applications and interpretations course. In modern statistics many of the calculations are done by a computer so the skill of the modern statistician lies in knowing which tests are appropriate and how to interpret the results. Very few candidates seemed familiar with the assumptions required for the use of the standard test on the correlation coefficient. Indeed, many candidates answered this by claiming that the value was either too large or too small to do a hypothesis test, indicating a major misunderstanding of the purpose of hypothesis tests.

1(b)(i) was done very poorly. It seems that perhaps adding parameters to the equation confused many candidates – if the equation had been Q'(t)=5Q(t) many more would have successfully attempted this. However, the presence of parameters is a fundamental part of mathematical modelling so candidates should practise working with expressions involving them.

1(b)(ii) and (iii) were done relatively well, with many candidates using the data to recognize an exponential model was a good idea. Part (iv) was often communicated poorly. Many candidates might have done the right thing in their heads but just writing that the correlation was better did not show which figures were being compared. Many candidates who did write down the numbers made it clear that they were comparing an R2 value with an r value.

1(c) was not meant to be such a hard question. There is a standard formula for half-life which candidates were expected to adapt. However, large numbers of candidates conflated the data and the model, finding the time for one of the data points (which did not lie on the model curve) to double. Candidates also thought that the value of t found was equivalent to the doubling time, often giving answers of around 40 days which should have been obviously wrong.

1(d) was quite tough. Several candidates realized that β was the required quantity to be compared but very few could calculate β for city X using the given information.

1(e) was meant to be relatively straightforward but many candidates were unable to interpret the notation given to do the quite straightforward calculation.

1(f) was meant to be a more unusual problem-solving question getting candidates to think about ways of linearizing a non-linear problem. This proved too much for nearly all candidates.

a.i.
[N/A]
a.ii.
[N/A]
a.iii.
[N/A]
b.i.
[N/A]
b.ii.
[N/A]
b.iii.
[N/A]
b.iv.
[N/A]
b.v.
[N/A]
c.
[N/A]
d.
[N/A]
e.
[N/A]
f.i.
[N/A]
f.ii.

Syllabus sections

Topic 4—Statistics and probability » SL 4.4—Pearsons, scatter diagrams, eqn of y on x
Show 146 related questions
Topic 4—Statistics and probability

View options