Java is used for data analysis

Data analysis for scientists and engineers

1 Introduction

Every experimental natural science deals - after going through a naively descriptive phase - with a quantitative, measuring pursuit of the phenomena of interest. In addition to the design and implementation of the experiments, the correct evaluation, in particular the full utilization of the data obtained, is one of the essential tasks.

2. Probabilities

In this book we are concerned with the analysis of experimental data material. We must therefore first of all be clear about what we mean by an experiment and its result. Just like in the laboratory, we define an experiment as precisely following a procedure, at the end of which we are given a quantity or set of quantities that represent the result. These quantities are continuous (temperature, length, current) or discrete (number of particles, a person's birthday, one of three possible colors).

3. Random variable. Distributions

We are now not looking at the probability of events occurring, but rather at the events themselves, and trying a particularly convenient classification or ordering of them. For example, when we toss a coin, we can assign the number 0 to the event “heads” and the number 1 to the event “number”. In general, a decomposition of the type (2. 3. 3) can be classified in such a way that for every event Ai assigns the real number i.

4. Computer generated random numbers. The Monte Carlo method

So far in this book we have described the observation of random variables, but not rules for their generation. However, it is useful for many applications to have a series of values ​​of a randomly distributed variable x. Since operations often have to be carried out with a large number of such random numbers, it is particularly convenient to have them directly available in the computer. The correct method for generating such random numbers would be to use a statistical process, e.g. B.

5. Various important distributions and rates

Some particular distributions will now be discussed in detail. You can think of this section as a collection of examples. In fact, all of these distributions are of great practical importance. You will encounter them again in many applications. In addition, when discussing these distributions, some important sentences will emerge freely.

6. Samples

In the previous chapter we got to know a number of distributions, but did not explain how such distributions are implemented in individual cases. We have only given the probability that a random variable lies in a certain interval with the limits x and x + dx. This probability still depends on parameters which are characteristic of the distribution (such as λ in the case of the Poisson distribution) and which are generally unknown. We therefore have no direct knowledge of the probability distribution and have to approximate it using an experimentally obtained frequency distribution.

7. The "Maximum Likelihood" method

In the last chapter we already dealt with the problem of estimating the parameters of a distribution by means of samples and discussed the desirable properties of estimating functions, but without specifying a rule on how to find such estimating functions in individual cases. We only gave estimation functions for the important special cases expectation value and variance. We now tackle the general problem.

8. Examination of statistical hypotheses (tests)

Often the problem with a statistical analysis of a random sample does not lie in the determination of originally completely unknown parameters. Rather, one has a very preconceived notion about the value of these parameters: a hypothesis. In the case of a random sample taken for production control, for example, one will initially assume that certain critical variables are normally distributed within the tolerance limits around their nominal value.

9. The least squares method

The least squares method goes back to LEGENDRE and GAUSS.

10. Minimizing a function

Finding extreme values ​​is critical to data analysis. The problem occurs in the solution of the least squares problems in the form M (x, y) = min and in maximum likelihood problems as L = max. By simply changing the sign, the last task can also be seen as a search for a minimum. We therefore always speak of minimization.

11. Analysis of Variance

The analysis of variance, which goes back to R. A. FISHER, is concerned with testing the hypothesis of equal means of a number of samples. Such problems occur e.g. For example, when comparing a series of measurements carried out under different conditions, or when checking the production of workpieces from different machines. An effort is made to uncover the influence that changes in external variables (such as test conditions, serial number of a machine) have on a sample.

12. Linear and Polynomial Regression

Fitting a linear function (or, more generally, a polynomial) to measured values ​​that depend on a controlled variable is arguably the most common task in data analysis. It is also known as linear (or polynomial) regression.

13. Time series analysis

In the last chapter we looked at a random variable y in its dependence on a controlled variable t.

Backmatter