# Statistical Thinking

for Machine Learning with Python and R

As a **social scientist** (PDF) and fellow of the Royal Statistical Society my skills include **exploratory data analysis** and **confirmatory data analysis**. Statistics was also the focus of my studies at the University of Cologne and Utrecht University (Netherlands). I got my doctorate at the Justus Liebig University Giessen for performing one of the first **longitudinal media analyses** and took additional courses in statistics at the University of Leuven (Belgium) and Bielefeld University. Reading Herbert George Wells, who said "statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write", made me interested in the Johns Hopkins University (USA) **data science specialization** (PDF), the Stanford University (UK) **machine learning course** (PDF) as well as the deepening **mathematics for machine learning specialization** (PDF) of the Imperial College London (UK) on Coursera. Besides my current academic lectures I advise public as well as governmental organisations on the application of multivariate statistics, machine learning algorithms and current limitations of **artificial intelligence** by providing some catchy introductions to Python and R.

See you soon

**Prof. Dr. Dennis Klinkhammer**

University of Applied Sciences Teacher

Machine learning algorithms shall enable the computer to generalize from experience by using **mathematical models** generated out of training data. For example, a simple Python **code** (ZIP) based upon logistic regression can be used to differentiate between good and bad wines based upon their chemical composition. Another well known machine learning algorithm is the **k-nearest neighbors algorithm**, a quite simple and non-parametric method for classification and regression. Training data is used to generate **vectors in a multidimensional feature space** with appropriate class labels in order to measure distances between data points. The k-nearest neighbors algorithm assumes that **class labels** (GIF) of nearer neighbors are more likely the same than class labels of more distant neighbors. With the corresponding Python **code** (ZIP) this process can be clarified via two-dimensional scatterplots that are merged into a three-dimensional **principal component analysis**. This specific machine learning algorithm as well as its accuracy will be displayed by using the famous IRIS dataset, for which there is also a code programmed in R down below.

Combining **statistical thinking** with a programming language like Python can also be used to create artificial neural networks. They are supposed to **imitate neurons within the human brain** in order to recognise patterns automatically and learn something new without the need to be specifically programmed - just like a machine learning algorithm. Given a common situation, as shown below, artificial neural networks can **predict the correct output data** (on the right) when provided with some **corresponding input data** (on the left):

A human brain identifies easily that the first input column seems to affect the output column. Thus, a new row of input data (010) should correspond to (0) as output data and (110) should correspond consistently to (1). By using a **logistic regression model** with three predictors (one for each column of the input data) the output data can be predicted correctly, if the **automated learning process** is capable of providing adjusted weights for each predictor. That's it - an artificial neural network that regocnises the patterns of each similar situation and adapts automatically. Furthermore, this Python **code** (ZIP) can be reprogrammed for linear and other non-linear contexts as well.

Within social media left- and right-wing extremism can be considered as widespread phenomena with a rising number of **radical content**. A quantitative **research focus** (PDF) on the process of radicalisation and **social structures** may uncover underlying mechanisms like frames and pull-factors within YouTube. There placed **extremist propaganda** is highly frequented and commented on and can be accessed via **application programming interfaces** in order to identify actors that might be relevant for criminal charges. Since social media provides large amounts of unstructured data, statistical methods common for **big data** have to be applied. However, most relevant variables in order to **identify extremist actors** seem to be the number of comments, likes and replies on YouTube as well as the content of each comment which can be itemized via **natural language processing**. Due to some security restrictions this specific code can't be made public, but more general tutorials on machine learning and artificial neural networks can be found on this page.

Let's have some first experiences with R by using the **SWISS** (ZIP) dataset for sociological analysis.

Create a multivariate model with the **MTCARS** (ZIP) dataset and get some insights into applied physics.

The **IRIS** (ZIP) dataset is a perfect playground in order to predict something that is really beautiful.

Learn to calculate the goodness of fit with **SIMULATED** (ZIP) datasets and nonlinear regression models.

Install the package corrplot and visualise the bivariate structure inside the **TREES** (ZIP) dataset.

A **TOOTHGROWTH** (ZIP) dataset for comparing effects of ascorbin acid and orange juice in guinea pigs.