## Handling and describing data forum

### Sample Randomness

I think the probability of potential samples being present is not a statistical issue and needs an informed judgement relevant to the study and target population. While simple randomization removes bias from the allocation procedure, it does not guarantee, eg., that the subjects in each group have similar age distributions, or if you only recruit student volunteers in a study, you can't expect that outcomes have general applicability. The question regarding normal distribution of the sample plays a key role in statistical analysis. There are ways to test the normality of the sample. A simple one for me is the size of the sample, when it is large, it is more representative of the population - and then checking the centrality of the mean and median. Also using statistical graphs to draw the histogram for observing the frequencies where outliers reveal themselves**.**

True random samples i.e.samples that lack any pattern, are difficult to achieve. Their selection procedure has to be indistinguishable from that of truly random coin flips.

However,there are many ways in which unknowingly or knowingly, bias may creep into the selection procedure. For example, in a coin flip if one had knowledge of all the forces that were acting on it (e.g. gravity, wind resistance, the flipper's finger) , the outcome could be easily predicted. Thus, the procedure ceases to be random as both sides of the coin no longer have equal probability of being chosen.

However, in the same coin flip situation if another person is unaware of how the external factors are "biasing" the outcome, he may assume that the procedure is random.

If you would like to know more about this particular topic and how it effects statistical inference the book on Multivariate Analysis by K.V. Mardia is a good place to start.

"Statistical randomness does not necessarily imply "true" randomness, i.e., objective unpredictability. Pseudorandomness is sufficient for many uses." (Wikipedia)

Many statistical methods incorporate an "error term" to correct for any unseen bias or errors.

Although at this moment you neednt worry about this and in my opinion it would suffice to randomly select samples from a population without any knowledge of their distribution or any other property that would enable you to predict the outcome.

Computational random number generators often rely on the computer's real time clock as the seed (initial state). In this case, the bias stems from the hardware used and the pattern of random numbers generated may be repeated after numerous trials.

Hi Franz,

Why don't we use a computer to select the national lottery numbers? Why do we think that they are random when they come out of machine that stirs the balls around?

If you have a sufficiently complex system - regardless of it being deterministic, then it will generate a random sample. This is why we have the balls in the lottery. So we get randomness from a non-random process!

For sampling there are lots of problems with collecting data and trying to make sure that your behaviour is not affecting the sample. We usually collect the easy data first but the easy data might not be a true reflection of the true sample space.

For example for genomes there are none done from algae because they are hard work and not commercially significant. So out genome samples are biased towards medically or commercially important organisms. If this is the case then you have to be careful about how you use the data and what questions you ask.

Andy