Statistical Treatment of Data, Chemistry tutorial

Introduction:

This is familiar that errors are variations which generally accompanied experiments performed and influence precision and also accuracy result.

The statistical treatment of data is necessary in order to make use of the data in the right form. Raw data collected is only one feature of any experiment. The organization of data is though very significant in such a manner that conclusions can be drawn. This is what statistical treatment of data is all about. The significant feature of statistical treatment of data is the handling of errors. All experiments invariably generate errors and noise. Both systematic and arbitrary errors require to be taken into the consideration.

Standard Deviation:

The need in making repeated measurements in some analytical experiments in order to reveal the presence of random errors was emphasized. Suppose the experimentalist performed five duplicates titration experiments and the given results were obtained.

Table: Titration results

Burette reading (cm3)           First   Second   Third   Fourth   Fifth                                   

                                                 titre    titre        titre     titre      titre

Final reading (cm3)                   10.08   10.31     15.19   10.12    20.10

Initial reading (cm3)                   0.00     0.20      5.10     0.00     10.00

Volume of acid used (cm3)        10.08   10.11   10.09   10.12     10.10

Two criteria can be employed to compare such results, the average values and the degree of spread. The average value employed is or else termed as the arithmetic mean, which is the sum of all the measurements divided via the number of measurements.  

Mathematically the mean, x‾, of n measurement is represented by:

x‾ = Σx1/n

The spread is as well known as the range which is the difference between the highest and the lowest value. A more helpful measure of spread that uses all the values is the standard deviation. The standard deviation, s, of n measurements is represented by:

s = √[Σi(x1 - x‾)2/(n - 1)]

Variance:

Variance is a very helpful statistics quantity that is the square of the standard deviation, s2

Variance = Square of standard deviation.

Coefficient of Variation (CV):

Coefficient of variance is as well termed as the relative standard variation (RSD) that is given by 100s/x‾ and is a broadly employed measure of spread 'σ'.

Confidence Limit of the Mean:

For a sample of 'n' measurements, the standard error of mean

(s.e.m) = σ/√n

The confidence interval for the mean is the range of values in which the population mean, μ, is expected to lie by a certain probability. The boundaries are termed as confidence limit.

1581_Sampling distribution of the mean.jpg

Fig: Sampling distribution of the mean

For the normal distribution, this accounts for around 68% of the set, whereas two standard deviations from the mean account for around 95%, and three standard deviations (that is, light medium and dark blue) account for around 99.7%. If we suppose that the distribution is normal, then 95% of the sample signifies will lie in the range represented by μ - 1.96 (σ/√n) < x‾ < μ + 1.96 (σ/√n)

The confidence level is the probability that the true mean lie a certain intervals and is frequently express in percentage. The confidence interval for the mean of 'n' measurements can be computed therefore;

CI = x‾ ± ts√N

Here, x‾ is the sample mean, 's' is the standard deviation and 't' is the t-statistic distribution or else termed as student's 't'.

For a single measurement having result x, t is represented as,

t = x‾ - (μ/s)

Though, for 'N' measurement, 't' is represented as, t = (x‾ - μ)/(s/√N)

't' depends on the desire confidence level, and also on the number of degree of freedom in the computation of standard deviation. The value of 't' is found via consulting the t-test table at N-1 degree of freedom. The student 't' level is illustrated in the table shown below:

Table: Values of t for confidence intervals

287_Values of t for confidence intervals.jpg

For large samples, the confidence limits of the mean are represented by, C.I = x‾ ± zs√n

Where the values of z based on the degree of confidence needed. The values for z at different confidence levels for small and large samples can be determined in the table illustrated below.

Table: Confidence levels for various values of z

Confidence level, %        z

50                              ±0.67

68                              ±1.00

80                              ±1.29

90                              ±1.64

95                              ±1.96

96                              ±2.00

99                              ±2.58

99.7                            ±3.00

99.9                            ±3.29

Significance Tests:

This method tests whether the difference between the two results is important, or whether it can be accounted for by virtue of random variation. Some of the tests which are very helpful to analytical chemist are considered below.

Comparison of an experimental mean with a known value:

For each and every significance test used, the truth of the hypothesis that is known as the null hypothesis, often represented by Ho, is tested. The word null is employed to mean that there is no difference between the observed and recognized values other than that which can be attributed to arbitrary variation. Let us assume that this null hypothesis is true that means that statistical theory can be employed to calculate the probability that the observe difference between the sample mean, x‾, and the true value, µ, aies primarily due to errors.

Null hypothesis is generally discarded if the probability of such a difference occurring by chance is less than 1 in 20 (that is, 0.05 or 5%). In another words, the difference is stated to be significant at 5% level. Higher levels of significance like 1% or 0.1% can be employed so as to be more certain that the accurate decision was made. To test Ho: the population mean is equivalent to µ, the statistic t is computed thus:

t = (x‾ - μ)√(n/s)

Here, x‾ = sample mean, s = sample standard deviation and n = sample size. If |t| computed exceeds a fixed critical value then the null hypothesis is discarded.

Comparison of two experimental means:

The result of a new analytical process might be tested by comparing them with those acquired by employing a second method. If the null hypothesis is that the two processes provide the similar result, in this case Ho: µ1 = µ2. Then we can test whether (x1‾ - x2‾) differs considerably from zero. A pooled estimate, s, of a standard deviation can be computed provided the two samples have standard deviations that are not considerably different.

In order to test the null hypothesis, Ho: µ1 = µ2, the statistical 't' is computed therefore:

t = (x1‾ - x2‾)/s√[(1/n1) + (1/n2)]

Here, 's' is computed as from:

s2 = [(n1 - 1)s12 + (n2 - 1)s22]/(n1 + n2 - 2)

't' consists of n1 + n2 - 2 degree of freedom

The fundamental supposition of this process is that the samples are drawn from populations with equivalent standard deviations.

F-Test:

F- Test is employed to compare the standard deviations in order to detect arbitrary errors of two sets of data. In order to test whether the difference between the two variances is significant, that is to test Ho: σ12 = σ22. The statistic F is computed thus:

F = s12/s22

The number of degrees of freedom of the numerator and denominator are n1-1 and n2-1 correspondingly. The test supposes that the populations from which the samples are taken are normal. The null hypothesis is true whenever the variance ratio is close to 1. Whenever the computed value of F surpasses a critical value than the null hypothesis is discarded.

Outliers:

A situation might arise in which one (or more) of the results appears to be different unreasonably from the others in the set. These measurements are termed as an outlier. The ISO recommended test for the outliers is Grubb' test.

In order to make use of Grubb's test for an outlier, the null hypothesis is tested: all the measurements come from the similar population.

Then G is computed therefore;

G = (suspect value -x‾)/s 

Here x‾ and s are computed with the suspected value included. The fundamental supposition of this test is that the population is normal.

Q-test: Detection of a single outlier:

a) Theory:

In a set of replicate measurements of the chemical or physical quantity, one or more of the obtained values might differ considerably from the majority of the rest. In this case there is for all time a strong motivation to remove those deviant values and not to comprise them in any subsequent computation (example: of the mean value and/or of the standard deviation). This is permitted merely if the suspect values can be 'legitimately' characterized as outliers. 

Generally, an outlier is stated as an observation which is produced from a different model or a different distribution than was the main 'body' of data. However this definition means that an outlier might be found anywhere in the range of observations, it is natural to suspect and observe as possible outliers only the extreme values. 

The refusal of suspect observations should be based exclusively on an objective criterion and not on subjective or the intuitive grounds. This can be accomplished by employing statistically sound tests for 'the detection of outliers'. 

The Dixon's Q-test is the simpler test of this kind. This test lets us to observe if one (and only one) observation from a small set of replicate observations (usually 3 to 10) can be 'legitimately' discarded or not. 

Q-test is mainly based on the statistical distribution of 'subrange ratios' of ordered data samples, drawn from the similar normal population. Therefore, a normal (Gaussian) distribution of data is supposed whenever this test is applied. In case of detection and refusal of an outlier, Q-test can't be reapplied on the set of the remaining observations. 

b) How the Q-test is applied:

The test is extremely simple and it is applied as follows: 

i) The N values including the set of observations under assessment are arranged in ascending order: 

x1 < x2 < . . . < xN 

ii) The statistic experimental Q-value (Qexp) is computed. This is a ratio stated as the difference of the suspect value from its nearest one divided by the range of the values (Q: refusal quotient). Therefore, for testing x1 or xN (as possible outliers) we use the given Qexp values:

Qexp = (X2-X1)/(XN-X1)     Qexp = (XN-XN-1)/(XN-X1)

iii) The obtained Qexp value is compared to a critical Q-value (Qcrit) found in the tables. This critical value must correspond to the confidence level (CL) we have decided to run the test (generally: CL = 95%).

iv) If Qexp > Qcrit, then the suspect value can be characterized as an outlier and it can be refused, if not, the suspect value should be retained and employed in all subsequent computations. 

The null hypothesis related to Q-test is as follows: 'There is no a significant difference between the suspect value and the rest of them, any differences should be completely attributed to arbitrary errors'.

A table having the critical Q values for CL 90%, 95% and 99% and N = 3 - 10 is represented below in the table.

c) A general comment on the rejection of outliers:

All the data rejection tests should be judiciously employed. Some of the statisticians object to the rejection of data from any small size data sample, unless it is solidly recognized that something went wrong throughout the corresponding measurement. Other recommends the accommodation of outliers and not their refusal, that is, they recommend comprising deviant values in all the subsequent computations though with reduced statistical weight (that is, Winsorized methods).

It must be as well stressed that the use of Q-test is increasingly discouraged in favor of other more robust processes. One such process is the Huber method, which takes into consideration all data present in the set, and not only three as in the case of Q-test.  

The test is valid for samples size 3 to 7 and if the computed value of Q surpasses the critical value, the suspected value is refused. The critical value of Q for P = 0.05 for a two sided test are represented in the table illustrated below.

Table: Critical values of Q

N   Qcrit (CL.90%)     Qcrit (CL.95%)     Qcrit (CL.99%)

3       0.941                      0.970                   0.994

4       0.765                      0.829                   0.926

5       0.642                      0.710                   0.821

6       0.560                      0.625                   0.740

7       0.507                      0.568                   0.680

8       0.468                      0.526                   0.634

9       0.437                      0.493                   0.598

10      0.412                      0.466                  0.568

Analysis of Variance:

Analysis of variance, also abbreviated as ANOVA, is very powerful statistical method which can be employed to separate and estimate the different causes of variation. The ANOVA can as well be employed in situations where there is more than one source of arbitrary variation.

Tutorsglobe: A way to secure high grade in your curriculum (Online Tutoring)

Expand your confidence, grow study skills and improve your grades.

Since 2009, Tutorsglobe has proactively helped millions of students to get better grades in school, college or university and score well in competitive tests with live, one-on-one online tutoring.

Using an advanced developed tutoring system providing little or no wait time, the students are connected on-demand with an expert at www.tutorsglobe.com. Students work one-on-one, in real-time with a tutor, communicating and studying using a virtual whiteboard technology.  Scientific and mathematical notation, symbols, geometric figures, graphing and freehand drawing can be rendered quickly and easily in the advanced whiteboard.

Free to know our price and packages for online chemistry tutoring. Chat with us or submit request at info@tutorsglobe.com