Handling and describing data forum

Unit 1: issue of handling data and it helped me by refreshing the central limit theorem

 
 
Picture of Paul Alexander
Unit 1: issue of handling data and it helped me by refreshing the central limit theorem
by Paul Alexander - Thursday, 31 May 2007, 2:56 AM
 

Good day all, I hope everyone is great!. It is hot, near 35 in Toronto, smog alert today. I wanted to share a bit as I toggle unit 1:

This issue of 'random selection' is so fascinating as to the power it has in evening out, or smoothening or dampening any outrageous scores. Having 'equal' chance to be selected and included is likely the most powerful tool in statisitcs, as to being able to generalize results to a parent population.

Also, as a refresher to me, as I had to go back to this issue and it remains so fascinating what the behaviour of numbers can be: the central limit theorem states that in large enough samples, the distribution of a sample mean approximates a normal curve, amazingly, regardless of the shape of the distribution from which it is sampled. The larger the value of the sample size (n) the better the approximation to the normal.  I include a good visual example web site of what occurs in the sample distribution as I read this today:

http://www.statisticalengineering.com/central_limit_theorem.htm

I am hoping to spurn some debate on the issue of this '30' as a rule of thumb for it is both fascinating and confusing still at times, that is, when the size approaches or exceeds 30.

Picture of Marialena Trivella
Re: Unit 1: issue of handling data and it helped me by refreshing the central limit theorem
by Marialena Trivella - Friday, 1 August 2008, 3:10 PM
 

Thank you for an excellent contribution Paul. This is exactly what an online course is all about.

Although both concepts, the issue of a sample of '30' and the central limit theorem will come up again later on in the course, I wonder what thoughts everyone had on these.

Any contributions?

Marialena

Picture of Tom O\'Connell
Re: Unit 1: issue of handling data and it helped me by refreshing the central limit theorem
by Tom O\'Connell - Sunday, 10 June 2007, 6:16 PM
 

Hello Paul,

Sorry for tardy responses: trying to catch up. Is this related to the desired power of a test? That is, how powerful is the test in detecting a difference (effect), if a difference (effect) is present.

If I am right, sample size is chosen to reduce type I error (rejecting the initial hypothesis that there is no difference or effect, versus the opposing type II error, of accepting that there is no difference when one really does exist. So do sample sizes need to exceed 30 if the distibuion is really wide (is flattened)? I am not sure about this.

Also, I think the rule of 30 requires the distibution to have a real mean, e.g. not to be bi-modal or mulit-modal. Since the central limit theroem is based on the the distribution of the means, the means have to be valid estimators of the central tendency.

Another area where I get mixed up is how other properties of the underlying distribution impact the 30 rule. If there is some non-random component to the residuals, I would think you need a higher sample number, so that the test power is above 90% or whatever cutoff is needed. I am thinking of instances where there is autocorrelation or other issues with the residuals. Thus, if the residuals are not normally distributed, the results of regression analysis are biased, even if sample sizes are large (e.g. n > 100).

Marialena, is this correct? Or am I mixing up the concepts here.  ;-)

Best,
Tom

Picture of Marialena Trivella
Re: Unit 1: issue of handling data and it helped me by refreshing the central limit theorem
by Marialena Trivella - Monday, 28 July 2008, 3:24 PM
 

Hi All.

I'll touch upon the magic number of '30' in the simplest possible way.

The question of how 'large' is 'large enough' has occupied statisticians for quite a while. There are numerous different situations and the truth is that one size does not fit all. However, in the simplest of cases (as in a textbook example) when we have 'about' 30 cases then the distribution resembles the normal and hence we arbitrarily took this number as a 'cutpoint' between using the normal distribution or another.

It is important to emphasize that this doesn't apply to all situations. For instance, if we are looking at the number of lymph nodes affected by cancer following surgery in a number of patients, then it is very possible to find that majority will have zero nodes affected, hence the distribution of the count of nodes will have many zeros. This kind of distribution, whether it is constructed from 30 of 130 patients is unlikely to be normal, hence other methods should be considered.

Each case should be looked at on its merits.

Yes, it is true that the sample size doesn't just depend on whether we have a large enough sample to apply the normal distribution; it is related to a host of other issues, such as the types and size of tolerated error, the variability of the population studied, the margin of confidence required etc.

There is a 'science' behind the sample size calculations (we will see some of the issues in later units) and the 'magic number of 30' is only a guidance.

I hope this helps

Marialena

Picture of Flavio Monteiro de Oliveira Jr
Re: Unit 1: issue of handling data and it helped me by refreshing the central limit theorem
by Flavio Monteiro de Oliveira Jr - Sunday, 16 November 2008, 7:55 PM
 
WHY 30? WHY EXACTLY THIS NUMBER? I DIDN´T UNDERSTAND IT.