Statistical methods for the social sciences 5th edition

What is collected in the bundle?

Summary of Statistical methods for the social sciences by Agresti, 5th edition, 2018. Summary in English.

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

What are statistical methods? – Chapter 1

What are statistical methods? – Chapter 1

Statistics is used more and more often to study the behavior of people, not only by the social sciences but also by companies. Everyone can learn how to use statistics, even without much knowledge of mathematics and even with fear of statistics. Most important are logic thinking and perseverance.To first step to using statistical methods is collecting data. Data are collected observations of characteristics of interest. For instance the opinion of 1000 people on whether marihuana should be allowed. Data can be obtained through questionnaires, experiments, observations or existing databases.But statistics aren't only numbers obtained from data. A broader definition of statistics entails all methods to obtain and analyze data. Before being able to analyze data, a design is made on how to obtain the data. Next there are two sorts of statistical analyses; descriptive statistics and inferential statistics. Descriptive statistics summarizes the information obtained from a collection of data, so the data is easier to interpret. Inferential statistics makes predictions with the help of data. Which kind of statistics is used, depends on the goal of the research (summarize or predict).To understand the differences better, a number of basic terms are important. The subjects are the entities that are observed in a research study, most often...

  • Read more about What are statistical methods? – Chapter 1

Which kinds of samples and variables are possible? – Chapter 2

Which kinds of samples and variables are possible? – Chapter 2

All characteristics of a subject that can be measured are variables. These characteristics can vary between different subjects within a sample or within a population (like income, sex, opinion). The use of variables is to indicate the variability of a value. As as example, the number of beers consumed per week by students. The values of a variable constitute the measurement scale. Several measurement scales, or ways to differ variables, are possible.The most important divide is that between quantitative and categorical variables. Quantitative variables are measured in numerical values, such as age, numbers of brothers and sisters, income. Categorical variables (also called qualitative variables) are measured in categories, such as sex, marital status, religion. The measurement scales are tied to statistical analyses: for quantitative variables it is possible to calculate the mean (i.e. the average age), but for categorical variables this isn't possible (i.e. there is no average sex.Also there are four measurement scales: nominal, ordinal, interval and ratio. Categorical variables have nominal or ordinal scales.The nominal scale is purely descriptive. For instance with sex as a variable, the possible values are man and woman. There is no order or hierarchy, one value isn't higher than the other.The ordinal scale on the other hand assumes a certain order. For instance happiness. If the possible values are unhappy, considerably unhappy, neutral, considerably happy and ecstatic, then there is a certain...

  • Read more about Which kinds of samples and variables are possible? – Chapter 2

What are the main measures and graphs of descriptive statistics? - Chapter 3

What are the main measures and graphs of descriptive statistics? - Chapter 3

Descriptive statistics serves to create an overview or summary of data. There are two kinds of data, quantitative and categorical, each has different descriptive statistics.To create an overview of categorical data, it's easiest if the categories are in a list including the frequence for each category. To compare the categories, the relative frequencies are listed too. The relative frequency of a category shows how often a subject falls within this category compared to the sample. This can be calculated as a percentage or a proportion. The percentage is the total number of observations within a certain category, divided by the total number of observations * 100. Calculating a proportion works largely similar, but then the number isn't multiplied by 100. The sum of all proportions should be 1.00, the sum of all percentages should be 100.Frequencies can be shown using a frequency distribution, a list of all possible values of a variable and the number of observations for each value. A relative frequency distributions also shows the comparisons with the sample.Example (relative) frequency distribution:GenderFrequenceProportionPercentageMale1500.4343%Female2000.5757%Total350 (=n)1.00100%Aside from tables also other visual displays are used...

  • Read more about What are the main measures and graphs of descriptive statistics? - Chapter 3

What role do probability distributions play in statistical inference? – Chapter 4

What role do probability distributions play in statistical inference? – Chapter 4

Randomization is important for collecting data, the idea that the possible observations are known but it's yet unknown which possibility will prevail. What will happen, depends on probability. The probability is the proportion of the number of times that a certain observation is prevalent in a long sequence of similar observations. The fact that the sequence is long, is important, because the longer the sequence, the more accurate the probability. Then the sample proportion becomes more like the population proportion. Probabilities can also be measured in percentages (such as 70%) instead of proportions (such as 0.7). A specific branch within statistics deals with subjective probabilities, called Bayesian statistics. However, most of statistics is about regular probabilities.A probability is written like P(A), where P is the probability and A is an outcome. If two outcomes are possible and they exclude each other, then the chance that B happens is 1- P(A).Imagine research about people's favorite colors, whether this is mostly red and blue. Again the assumption is made that the possibilities exclude each other without overlapping. The chance that someone's favorite color is red (A) or blue (B), is P(A of B) = P (A) + P (B).Next, imagine research...

  • Read more about What role do probability distributions play in statistical inference? – Chapter 4

How can you make estimates for statistical inference? – Chapter 5

How can you make estimates for statistical inference? – Chapter 5

Sample data is used for estimating parameters that give information about the population, such as proportions and means. For quantitative variables the population mean is estimated (like how much money on average is spent on medicine in a certain year). For categorical variables the population proportions are estimated for the categories (like how many people do and don't have medical insurance in a certain year).Two kinds of parameter estimates exist;A point estimate is a number that is the best prediction.An interval estimate is an interval surrounding a point estimate, which you think contains the population parameter. There is a difference between the estimator (the way that estimates are made) and the estimate point (the estimated number itself). For instance, a sample is an estimator for the population parameter and 0,73 is an estimate point of the population proportion that believes in love at first sight.A good estimator has a sampling distribution that is centered around the parameter and that has a standard error as small as possible.An estimate isn't biased when the sampling distribution is centered around the parameter. This is especially the case when the sample mean is the population parameter. In that case ӯ (sample mean) equals µ (population mean). ӯ is then regarded a good estimator for µ.When an estimate is biased, the sample mean doesn't estimate the...

  • Read more about How can you make estimates for statistical inference? – Chapter 5

How do you perform significance tests? – Chapter 6

How do you perform significance tests? – Chapter 6

A hypothesis is a prediction that a parameter within the population has a certain value or falls within a certain interval. A distinction can be made between two kinds of hypotheses. A null hypothesis (H0) is the assumption that a parameter will assume a certain value. Opposite is the alternative hypothesis (Ha), the assumption that the parameter falls in a range outside of that value. Usually the null hypothesis means no effect. A significance test (also called hypothesis test or test) finds if enough material exists to support the alternative hypothesis. A significance test compares point estimates of parameters with the expected values of the null hypothesis.Significance tests consist of five parts:Assumption. Each test makes assumptions about the type of data (quantitative/categorical), the required level of randomization, the population distribution (for instance the normal distribution) and the sample size.Hypotheses. Each test has a null hypothesis and an alternative hypothesis.Test statistic. This indicates how far the estimate lies from the parameter value of H0. Often, this is shown by the number of standard errors between the estimate and the value of H0.P-value. This gives the weight of evidence against H0. The smaller the P-value is...

  • Read more about How do you perform significance tests? – Chapter 6

How do you compare two groups in statistics? - Chapter 7

How do you compare two groups in statistics? - Chapter 7

In social science often two groups are compared. For quantitative variables means are compared, for categorical variables proportions. When comparing two groups, a binary variable is created: a variable with two categories (also called dichotomous). For instance for sex as a variable the results are men and women. This is an example of bivariate statistics.Two groups can be dependent or independent. They are dependent when the respondents naturally match with each other. An example is longitudinal research, where the same group is measured in two moments in time. For an independent sample the groups don't match, for instance in cross-sectional research, where people are randomly selected from the population.Imagine comparing two independent groups: men and women and the time they spend sleeping. Men and women are two different groups, with two population means, two estimates and two standard errors. The standard error indicates how much the mean differs for each sample. Because we want to investigate the difference, also this difference has a standard error. The population difference is estimated by the sample difference. What you want to know, is µ₂ – µ₁, this is estimated by ȳ2 – ȳ1. This can be shown in a sampling distribution. The standard error of ȳ2 – ȳ1...

  • Read more about How do you compare two groups in statistics? - Chapter 7

How do you analyze the association between categorical variables? – Chapter 8

How do you analyze the association between categorical variables? – Chapter 8

A contingency table contains the outcomes of all possible combinations of categorical data. A 4x5 contingency table has 4 rows and 5 columns. It often indicates percentages, this is called relative data.A conditional distribution means that the data is dependent on a certain condition and shown as percentages of a subtotal, like women that have a cold. A marginal distribution contains the separate numbers. A simultaneous distribution shows the percentages with respect to the entire sample.Two categorical variables are statistically independent when the probability that one occurs is unrelated to the probability that the other occurs. So this is when the probability distribution of one variable is not influenced by the outcome of the other variable. If this does happen, they are statistically dependent. When two variables are independent, this gives information about variables in the population. Probably the sample will be similarly distributed, but not necessarily. The variability can be high. A significance test tells whether it's plausible that the variables really are independent in the population. The hypotheses for this test are:H0: the variables are statistically independentHa: the variables are statistically dependentA cell in a...

  • Read more about How do you analyze the association between categorical variables? – Chapter 8

How do linear regression and correlation work? – Chapter 9

How do linear regression and correlation work? – Chapter 9

Regression analysis is the process of researching associations between quantitative response variables and explanatory variables. It has three aspects: 1) investigating whether an association exists, 2) determining the strength of the association and 3) making a regression equation to predict the value of the response variable using the explanatory variable.The response variable is denoted as y and the explanatory variable as x. A linear function means that there is a straight line throughout the points of data in a graph. A linear function is: y = α + β (x). In this alpha (α) is the y-intercept and beta (β) is the slope.The x-axis is the horizontal axis and the y-axis is the vertical axis. The origin is the point where x and y are both 0.The y-intercept is the value of y when x = 0. In that case β(x) equals 0, only y = α remains. The y-intercept is the location where the line starts on the y-axis.The slope (β) indicates the change in y for a change of 1 in x. So the slope is an indication of how steep the line is. Generally when β increases, the line becomes steeper.When β is positive, then y increases when x increases (a positive relationship). When β is negative, then y decreases when x increases (a negative relationship). When β = 0, the value of y is constant and doesn't change when x changes. This results in a horizontal line and means that the variables are...

  • Read more about How do linear regression and correlation work? – Chapter 9

Which types of multivariate relationships exist? – Chapter 10

Which types of multivariate relationships exist? – Chapter 10

Many scientifical studies research more than two variables, requiring multivariate methods. A lot of research is focussed on the causal relationship between variables, but finding proof of causality is difficult. A relationship that appears causal may be caused by another variable. Statistical control is the method of checking whether an association between variables changes or disappears when the influence of other variables is removed. In a causal relationship, x → y, the explanatory variable x causes the response variable y. This is asymmetrical, because y does not need to cause x.There are three criteria for a causal relationship:Association between the variablesAppropriate time orderElimination of alternative explanationsAn association is required for a causal relationship but it doesn't necessitate it. Usually it immediately becomes clear what is a logical time order, such as an explanatory variable preceding a response variable. Apart from x and y, extra variables may provide an alternative explanation. In observational studies it can almost never be proved that a variable causes another variables, this isn't certain. Sometimes there can be outliers or anecdotes that contradict causality, but usually a single anecdote isn't enough proof to contradict causality. It's easier to find causality with randomized experiments than with observational studies. This is because randomization appoints two groups randomly...

  • Read more about Which types of multivariate relationships exist? – Chapter 10

What is multiple regression? – Chapter 11

What is multiple regression? – Chapter 11

A multiple regression model has more than one explanatory variable and sometimes also (a) controle variable(s): E(y) = α + β1x1 + β2x2. The explanatory variables are numbered: x1, x2, etc. When an explanatory variable is added, then the equation is extended with β2x2. The parameters are α, β1 and β2. The y-axis is vertical, x1 is horizontal and x2 is perpendicular to x1. In this three-dimensional graph the multiple regression equation describes a flat surface, called a plane.A partial regression equation describes only part of the possible observations, only those with a certain value.In multiple regression a coefficient indicates the effect of an explanatory variable on a response variable, while controlling for other variables. Bivariate regression completely ignores the other variables, multiple regression only brushes them aside for a bit. This is the basic difference between bivariate and multiple regression. The coefficient (like β1) of a predictor (like x1) tells what is the change in the mean of y when the predictor is raised by one point, controlling for the other variables (like x2). In that case, β1 is a partial regression coefficient. The parameter α...

  • Read more about What is multiple regression? – Chapter 11

What is ANOVA? – Chapter 12

What is ANOVA? – Chapter 12

For analyzing categorical variables without assigning a ranking, dummy variables are an option. This means that fake variables are created from observations:z1 = 1 and z2 = 0 : observations of category 1 (men)z1 = 0 and z2 = 1 : observations of category 2 (women)z1 = 0 and z2 = 0 : observations of category 3 (transgender and other identities)The model is: E(y) = α + β1z1 + β2z2. The means are deducted from the model: μ1 = α + β1 and μ2 = α + β2 and μ3 = α. Three categories only require two dummy variables, because what remains falls in category 3.A significance test using the F-distribution tests whether the means are the same. The null hypothesis H0 : μ1 = μ2 = μ3 = 0 is the same as H0 : β1 = β2 = 0. A small F means a big P and much evidence against the null hypothesis.The F-test is robust against small violations of normality and differences in the standard deviations. However, it can't handle very skewed data. This is why randomization is important. A small P doesn't say which means differ or how much. Confidence intervals give more information. For every mean a confidence interval can be constructed, or for the difference between two means. An estimate of the difference in population means is:The degrees of freedom of the t-score are df = N – g,...

  • Read more about What is ANOVA? – Chapter 12

How does multiple regression with both quantitative and categorical predictors work? – Chapter 13

How does multiple regression with both quantitative and categorical predictors work? – Chapter 13

Multiple regression is also feasible for a combination of quantitative and categorical predictors. In a lot of research it makes sense to control for a quantitative variable. A quantitative control variable is called a covariate and it is studied using analysis of covariance (ANCOVA).A graph helps to research the effect of quantitative predictor x on the response y, while controlling for the categorical predictor z. For two categories, z can be the dummy variable, else more dummy variables are required (like z1 and z2). The values of z can be 1 ('agree') or 0 ('don't agree'). If there is no interaction, the lines that fit the data best are parallel and the slopes are the same. It's even possible that the regression lines are exactly the same. But if they aren't parallel, there is interaction.The predictor can be quantitative and the control variable can be categorial, but this can also be the other way around. Software compares the means. A regression model with three categories is:: E(y) = α + βx + β1z1 + β2z2, in which β is the effect of x on y for all groups z. For every additional quantitative variable a βx is added. For every additional categorical variable a dummy variable is added (or several, depending on the number of categories). Cross-product terms are added in case of interaction....

  • Read more about How does multiple regression with both quantitative and categorical predictors work? – Chapter 13

How do you make a multiple regression model for extreme or strongly correlating data? – Chapter 14

How do you make a multiple regression model for extreme or strongly correlating data? – Chapter 14

Three basic rules for selecting variables to add to a model are:Select variables that can answer the theoretical purpose (accepting/rejecting the null hypothesis), with sensible control variables and mediating variablesAdd enough variables for a good predictive powerKeep the model simpleThe explanatory variables should be highly correlated to the response variable but not to each other. Software can test and select explanatory variables. Possible strategies are backward elimination, forward selection and stepwise regression. In backward elimination all possible variables are added, tested for their P-value and then only the significant variables are selected. Forward selection starts from scratch, adding variables with the lowest P-value. Another version of this is stepwise regression, this method removes redundant variables when new variables are added.Software helps but it's up to the researcher to think and make choices. It also matters whether research is explanatory, starting with a theoretical model with known variables, or whether research is exploratory, openly looking for explanations of a phenomenon.Several criteria are indications of a good model. To find a model with big power but without an overabundance of variables, the adjusted...

  • Read more about How do you make a multiple regression model for extreme or strongly correlating data? – Chapter 14

What is logistic regression? – Chapter 15

What is logistic regression? – Chapter 15

A logistic regression model is a model with a binary response variable (like 'agree' or 'don't agree'). It's also possible for logistic regression models to have ordinal or nominal response variables. The mean is the proportion of responses that are 1. The linear probability model is P(y=1) = α + βx. This model often is too simple, a more extended version is:The logarithm can be calculated using software. The odds are:: P(y=1)/[1-P(y=1)]. The log of the odds, or logistic transformation (abbreviated as logit) is the logistic regression model: logit[P(y=1)] = α + βx.To find the outcome for a certain value of a predictor, the following formula is used:The e to a certain power is the antilog of that number.A straight line is drawn next to the curve of a logistic graph to analyze it. β is maximal where P(y=1) = ½. For logistic regression the maximal likelihood method is used instead of the least squares method. The model expressed in odds is:The estimate is:With this the odds ratio can be calculated.There are two possibilities to present the data. For ungrouped data a normal contingency table suffices. For grouped data a row contains data for every count in a cel, like just one row with the number of subjects that agreed, followed by...

  • Read more about What is logistic regression? – Chapter 15

Contributions, Comments & Kudos

Add new contribution

Content

Access level of this page

  • Public
  • WorldSupporters only
  • JoHo members
  • Private