GLOSSARY AND DEFINITIONS

The following are a selection of definitions of statistical terms. Various omissions were identified by participants at a workshop in Cape Town. Some of these have been included. Others will be added in a second edition of the Teaching Resource.

Alternative hypothesis (H_a): The hypothesis that contrasts with the null hypothesis in a significance test.

Attribute: An attribute is a characteristic that an observational unit possesses or a category into which it falls.

Analysis of variance (ANOVA): A frequently used statistical technique that models the variation in the response variable (assumed to be normally distributed) in terms of different sources of variation that are believed to influence the response. It is a technique that separates the total variation in the response into different components, each component representing some source of variation that influences the response.

Arc sign: The transformation sin^-1(√x) used to convert a binomially distributed variable x to a normally distributed one.

Arithmetic mean: M = (x₁ + x₂ + .... x_n) / n where n is the number of a group of observations.

Association: Two variables are said to be associated when there is a correlation between them. Usually neither of the variables takes on the role of the response variable. Both are independent variables.

Assumptions: Certain conditions of the data that are required to be met for a statistical test to be valid, for example, conditions of normality, independence or randomness.

Asymptotic: Refers to a curve that approaches the x or y axis but does not actually reach it until x or y equals infinity. The axis so approached is the asymptote. An example is the normal distribution curve. Sometimes one might hear it said that a statistical approaches normality

Bias: the average error which arises when estimating from sample a statistic in a population. An unbiased estimator is one that predicts a population value exactly except for random sampling error.

Binomial distribution: The binomial distribution gives the probability of obtaining exactly r successes in n independent events. An example might be the proportion of animals becoming clinically sick (where `success' takes on the meaning of clinically sick').

Block: A homogeneous grouping of experimental units or subjects in experimental design (also sometimes referred to as Replicates). Responses measured on two units the same block are likely to be more similar than the responses measured on two units in different blocks.

Blocking: The process of forming homogeneous groupings of experimental units, with the aim of reducing experimental error, and hence improving the precision which statistical comparison between treatment means can be made.

Chi-square (): The chi-square statistic compares observed values with predicted (or expected) values computed from a statistical model and hence evaluates the `goodness of fit' model.

Chi-squared distribution: A distribution derived from the normal distribution. Chi-square () is distributed with v degrees of freedom with mean = v and variance = 2v.

Chi-squared test: A common test used for testing `goodness-of-fit' of a statistical model. When used for the analysis of a Contingency Table the value is obtained by summing up [(observed - expected)²/expected] values for each cell in the table.

Coefficient of variation: A statistic that measures the spread of the data relative to the mean. It is calculated as 100 x (standard deviation/mean)

Completely randomised design: An experiment that is designed with treatments allocated to experimental units at random. In other words no form of blocking is applied. The design is often referred to as CRD.

Confidence interval: The interval within which a population estimate is expected to lie. This is calculated using estimates collected from a sample, usually the sample mean and its standard error. Confidence intervals can be expressed as 95%, 99% or 99.9% ranges that give the probability of the population value lying within the predicted range.

Contingency table: A multi-way table of frequency counts that is often used in a Chi-squared test to evaluate the independence or otherwise of the factors by which table is classified.

Contrast: Contrasts are used in analysis of variance to define particular comparisons, usually of a pre-determined nature defined during the design of a study). Each has one degree of freedom and can, for example represent a comparison between two means, or a linear, quadratic or cubic representation of the shape of the response curve over a series of treatment levels. Significant testing of a contrast is the same as that of a t-test but on a different scale.

Continuous variable: A variable that take on any value within reason (for example within capability of measurement) in a continuous range.

Correlation coefficient (r): This statistic represents the degree of association between two variables. It can take any value between -1 and +1; 0 means no linear association, 1 or -1 that all points lie on a straight line with a positive or negative slope, respectively. The absolute size of r shows the strength of the association.

Counts: Studies often involve data collected in the form of counts, e.g. number of diseased plants in a plot, number of insects on a plant in a given period, etc. Such counts often follow a Poisson distribution and can be analysed in the form of a log-linear model. When a total number is known (e,g the total number of plants in a plot are known) the data can follow binomial distribution and be analysed in the form of a logistic regression model.

Covariance: A measure of the association between two variables. It is calculated as the product of the deviation of two variables from their respective means. This forms the numerator in the calculation of a correlation coefficient.

Covariate: Generally used to describe an independent variable used in the context of experimental design. It is a variable collected during a study that is not used for blocking but is thought to possibly have a linear relationship to the response variable but to be unaffected by treatment.

Degrees of freedom (df): The degrees of freedom reflect the number of independent pieces of information available to estimate variability. For example, the standard deviation of n observations x₁, x₂, ..., x_n has (n-1) degrees of freedom. The reason for subtracting 1 is because the standard deviation is based on the sum of squares of deviations (x_i- X), i=1 ... n. Only the first (n-1) deviations are independent. The n^th can be calculated from X.

Dependent variable: The y variable in a statistical model.

Descriptive statistics: Summaries of available data. These can be, for example, numbers of observations, means, ranges, variations, and so on. These summaries can be presented numerically or graphically.

Discrete variable: a variable that can only take selected values. These can be nominal, e.g. gender, or ordinal (e.g. low, median, high).

Dummy variable: A variable used to represent a non-numeric variable, for example sex or a breed, so that their effects can be represented in a regression model. A set of n breeds would be represented by n-1 dummy (0,1) variables.

Error term: The term in a statistical model allowing for extra random variation not accounted for by the parameters in the model itself. The term can also referred to as the residual term.

Error mean square: Also called the Residual mean square. It is a component of an analysis of variance table that reflects a natural (unexplained) variance in the response variable. The error mean square is used to estimate standard errors for treatment means. More than one error mean square will occur for multi-layer data.

Estimate: an indication of the value of an unknown quantity based on observed data.

Estimation: the process by which a sample is used to predict the value of an unknown quantity in a population.

Estimator : a quantity calculated from the sample data which is used to give information about an unknown quantity in the population. For example, the sample mean is an estimator of the population mean.

Experiment: A study designed to be conducted under controlled conditions. An experiment is generally aimed at comparing the effects of various alternative treatments, one of which is applied to each of a number of experimental units.

Experimental unit: The basic object upon which a study (usually an experiment) is carried out, for example, an animal, a tree, a sample of soil, a household, a patient in a clinical trial. Sampling Unit and Observational Unit are similarly defined, but tend to be applied to surveys and observational studies, respectively.

F distribution: A continuous probability distribution of the ratio of two independent random variables, each having a Chi-squared distribution, divided by their respective degrees of freedom. Its commonest use is to assign P values to mean square ratios in am analysis of variance.

F test: The test based on the F distribution used to test for statistical significance of the ratio of two Chi-squared distributions.

Factor: A general term for a parameter in a statistical model generally used to describe a treatment with discrete levels. Two sets of treatments (or factors) occurring together give rise to what is known as a factorial design. When this happens each level of each factor is used in combination with each level of the other factor. Thus a 2x2 factorial involves 4 treatments (11,12, 21, 22).

Factorial design: A factorial design describes an experiment that evaluates two or more factors simultaneously.

Fixed effect: A parameter in a model whose levels are fixed in contrast to parameters that are random.Geometric mean: G = (x₁.x₂...x_n)^1/n where n is the sample size. This can also be expressed as the antilog of the mean of the logarithms of each x_i.

Geometric mean: G = (x₁.x₂...x_n)^1/n where n is the sample size. This can also be expressed as antilog ((1/n) ? log x), which means the antilog of the mean of the logarithms of each value.

General linear model: A linear model that assumes a normally distributed error structure.

Generalised linear model: A linear model that assumes a distribution other than normal for the error structure.

Goodness-of-fit test: A statistical test in which the validity of one hypothesis is tested without specification of an alternative hypothesis. It is usually used to describe how well a model fits the data.

Hierarchical model: A statistical model with different effects nested within each other. Such a model often occurs in genetics (e.g. offspring within dam within sire).

Histogram: A plot showing the distribution of values falling within specified classification intervals.

Independent variable: Independent variables (usually continuous and also referred to as explanatory variables) are used as parameters in a statistical model.

Inference: Information drawn from a sample that is used to draw conclusions about the population from which the sample is taken.

Interaction: If the effect of one factor on a response variable depends on the level of another factor, the two factors are said to interact.

Latin square: An experimental design that controls the variation in two directions (sometimes referred to as rows and columns). Latin square experiments are sometimes used in animal research where cows may be used for rows and periods of lactation for columns. Treatments are applied at random so that each treatment occurs once in each row and column.

Least squares: The method generally used in fitting a statistical model when the response variable is a continuous variable. The method results in an analysis of variance.

Linear model:A relationship between a dependent variable and a series of independent variables or parameters for which the relationship is additive (i.e. each of the parameters to be estimated on the right hand side of the equation are additive, i.e. separated by + signs).

Linear regression: a linear model that represents the relationship between a dependent variable y and an independent variable, i.e. y=ax+b.

Logistic regression: a generalised linear model in which the error distribution is assumed to be binomial.

Log-linear model: a generalised linear model in which the error distribution is assumed to be Poisson.

LSD: LSD, standing for Least Significant Difference, is a statistic that allows for any comparison between two treatment means to be assessed for statistical significance. f the difference between any two is greater than the LSD, then that difference is declared to be significant.

Main effect: The effect of one factor alone averaged across the levels of other factors.

Margin: The column in a multiway table which represent the sum or average of one factor across all levels of the other factors.

Maximum likelihood: A general method for finding estimated values of parameters that maximises the probability of obtaining the observed values.

Mean: a form of average. If not otherwise stated it is equivalent to the arithmetical mean.

Mean square: A sum of squares in an analysis of variance divided by its associated degrees of freedom square.

Median:The value that divides the frequency distribution in half when the data values are listed in order.

Mixed model: A generalised linear model that contains both fixed and random effects.

Mode: The observed value that occurs most frequently. The middle value.

Multiple regression: An extension of linear regression involving one independent variable to several independent variables.

Multiway table: A table of counts, percentages or mean values tabulated in two or more dimensions.

Normal distribution: A distribution sometimes referred to as a Gaussian distribution which tends to be the underlying distribution for quantitative biological variables. It has two parameters - a mean and variance.

Null hypothesis (H_o): The hypothesis that there is no difference between two groups. The null hypothesis is the one that a researcher sets out when planning a study.

Observational Unit: see Experimental Unit.

One-way ANOVA: A comparison of several groups of observations, all of which are independent and subjected to different levels of a single treatment.

Outcome variable: See dependent variable.

Outlier: an extreme observation that is well separated from the remainder of the data.

P value: The probability (e.g. P<0.05) that the null hypothesis is correct.

Parameter: An element that is estimated on the right hand side of a general linear model during the statistical analysis.

Parametric test: A statistical test in which parameters are estimated assuming a basic distribution structure to the data

Partial correlation: The correlation that occurs between two variables when controlling for the variation with other variables.

Pearson's correlation coefficient (r): A measure of the strength of the linear association between two quantitative variables

Poisson distribution: The probability distribution of the number of (rare) occurrences of some random event in an interval of time or space.

Population: The `universe' of things (e.g. animals, plants, people) from which a sample is assumed to be drawn. It is the entire group one is interested in, that one wishes to describe or draw conclusions about.

Power: The power of a hypothesis test is defined as the probability of NOT making a Type II error that is that is of accepting that the null hypothesis is true when indeed it is not.

Probability: The number of likely outcomes expressed as a proportion of the total of possible outcomes.

Qualitative: Qualitative, as distinct from quantitative, variables describe variables that are not continuous and have some categorical value.

Quantitative variable: Generally used to define a variable that is `continuous' and has a numeric value that can be measured.

Random sampling: A method of selecting a sample from a target population so that each unit has the same chance of being selected..

Randomisation: The process by which experimental or sampling units are selected or allocated by chance.

Randomised (complete) block design: An experimental design in which data are put into groups (or blocks) and treatments assigned at random to units in each block.

Range: The difference between the biggest and the smallest values in a group.

Regression line: A smooth curve fitted to a set of paired data following an analysis of regression.

Regression coefficient: A value (usually representing a slope) estimated in a regression equation.

REML (Restricted Maximum Likelihood) : A statistical process that allows parameters for both fixed and random effects to be simultaneously estimated in a general linear model applied to multi-level sets of data.

Repeated measurement: Separate measurements taken in time from the same experimental or sampling unit.

Replication: the repetition in a study of a treatment or other factor.

Residual (error) sum of squares: The sum of squares that remains when a linear model is fitted by analysis of variance.

Response variable: see dependent variable.

R-squared (R²): Measures how well a regression equation fits the data. R² is the square of the multiple correlation coefficient. It is sometimes referred to as coefficient of determination.

Sample: A group of units collected for study.

Sample size (n): The number of units in a sample

Sampling unit: See Experimental Unit.

Sensitivity: The proportion of true positives that are correctly identified by a diagnostic test.

Sensitivity analysis: Analysis of data in different ways to see how results depend on methods or assumptions.

Significance test: A statistical test of the likeli8hood of a a null hypothesis being true

Skewness: The lack of symmetry about a central value of a distribution

Standard deviation: A measure of spread (scatter) of a set of data calculated from the deviations between each individual value and the sample mean. It is the square root of the variance.

Standard error: The standard deviation of a mean

Standard normal distribution: a normal distribution with mean 1 and variance 1.

Statistic: A number that summarises data.

Stepwise regression: A method used in multiple regression studies that fits each independent variable one by one with the variable that accounts for most of the variation in the dependent variable being added at each step.

Stratified random sample: A sample from a population separated into relatively homogeneous groups (strata), from which subsets of the sample are drawn at random.

Survey: The process of collecting data from a sample drawn from a population with the purpose of making inferences about the population.

Treatment: Used in experimental design to define a treatment administered to a set of units.

t-statistic (t): The difference in sample means divided by standard error of difference between sample means.

t-test or Student's t-test: A parametric test for assessing the significance between two means.

Two-way ANOVA: An analysis of variance that involves two sets of factors or treatments.

Type I error: The error associated with rejecting the null hypothesis when it is true.

Type II error: The error associated with accepting the null hypothesis when it is false.

Variable (Variate) : A value or measurement that varies among experimental units.

Variance: A measure of variability of data. It is the square of the standard deviation.

Variance ratio: The ratio obtained by dividing the mean square representing the association with a parameter in the model divided by residual mean square

Wald test: A test similar to an F-test but based on the method of maximum likelihood and commonly used for assessing the statistical significance of parameters in a mixed model.

Wilcoxon test: A non-parametric significance test analogous to a paired t-test.