Stats - Large Data Sets
Pearson Edexcel Mathematics 2022
The large data set and its sources
What data does Pearson’s large data set contain?
Information from weather stations.
When did the weather stations in Pearson’s large data set record data?
- May-Oct 1987
- May-Oct 2015
Measured variables and their units
To what accuracy are the units for total sunshine?
The nearest tenth of an hour.
What are the _ _ two _ _ different units for daily mean windspeed?
$\text{knots}/\text{kn}$ or the beaufort scale.
What does 100% represent in terms of humidity?
The maximum amount of water saturation in the air.
How is mean visibility measured?
The amount that can be seen into the horizon during daylight.
Wind direction and cardinal directions
What are the two ways wind direction is measured?
- A cardinal direction
- A bearing
What two data points about wind direction are recorded (not units)?
- Max gust direction
- Average wind direction
Weather station locations
Pascals, sampling and inference cautions
What shouldn’t you say as a reason why a simple random sample might not give the desired number of samples?
Don’t say it’s because there might be repeats.
If a small sample of large data set seems to agree with a hypothesis, should you jump to the conclusion that it’s good evidence for that hypothesis being true?
No, the sample size might be too small.
Mnemonic and listing the stations
What is mnemonic to remember the weather stations in the large data set?
Look! LLarge HHadron Collider, Bussin’ Job Piranha!
Using the mnemonic, can you list all weather stations in the large data set?
- Leuchars
- Leeming
- Heathrow
- Hurn
- Camborne
- Beijing
- Jacksonville
- Perth
Modelling variables with a normal distribution
What is a bad reason for stating you can’t model a variable in the large data set?
Saying that it is discrete.
Why can’t you model the wind speed (using the Beaufort scale) with a normal distribution for the large data set?
Because it is non-numeric (not because it is discrete!!)
Why can’t you model the daily rainfall with a normal distribution for the large data set?
Because the distribution is not bell shaped.
What’s another reason apart from values being non-numeric that a variable can’t be modelled using a normal distribution?
The actual distribution that describes it might not be symmetric.