This chapter is not written yet.

Regression Analysis

Regression analysis is a statistical process for estimating the relationships among variables. Regression (and more specifically in the case of categorical data: logistic regression) can be used to predict certain characteristics or events linked to one household.

Analysis Type

The Multiple Linear Regression Model

The multiple linear regression model and its estimation using ordinary least squares (OLS) is doubtless the most widely used tool in econometrics. It allows to estimate the relation between a dependent variable and a set of explanatory variables minimizing the squared distances between the observed and the predicted dependent variable.

Key concepts: OLS Assumptions and Estimation, Goodness of Fit, Small Sample Properties, Tests in Small Samples, Confidence Intervals in Small Sample, Asymptotic Properties of the OLS Estimator, Asymptotic Tests, Confidence Intervals in Large Samples, Small Sample vs. Asymptotic Properties, More Known Issues (Non-linear Functional Form, Aggregate Regressors, Omitted Variables, Irrelevant Regressors, Reverse Causality, Measurement Error, Mutlicollinearity).

Functional Form in the Linear Model

Despite its name, the classical linear regression model, is not limited to a linear relationship between the dependent and the explanatory variables. It is indeed possible to build a model which is linear in the parameters but that also includes non-linear functions of the regressors.

Key concepts: Log-Linear, Semi-Log, Polynomial, Inverse, Dummy Variables, Interaction Terms, Spline Functions.

Heteroskedasticity in the Linear Model

This chapter relaxes the homoscedasticity assumption of the least squares estimation, and shows how the parameters of the linear model can be correctly estimated and tested when the error terms are heteroscedastic, i.e. their variance, conditioned on the regressors, changes across observations.

Key concepts: Groupwise Heteroskedasticity, Estimation with OLS, Estimating the Variance of the OLS Estimator, Testing for Heteroskedasticity, Estimation with GLS/WLS when the Variance Matrix is Known, Estimation with FGLS/FWLS when the Variance Matrix is Unknown.

Clustering in the Linear Model

This chapter relaxes the homoscedasticity assumption of the least squares estimation and allows the error terms to be heteroscedastic and correlated within groups or so-called clusters. It shows in what situations the parameters of the linear model can be consistently estimated by OLS and how the standard errors need to be corrected. Clustering might arise when the sampling mechanism first draws a random sample of groups (e.g. schools, households, towns) and then surveys all (or a random sample of) observations within that group.

Key concepts: Random Cluster-Specific Effects, Estimation with OLS, Estimating Correct Standard Errors, Efficient Estimation with GLS, Estimating Correct Standard Errors with Random Cluster-Specific Effects.

Instrumental Variables

In many applications of the linear model, we suspect that some regressors are endogenous, i.e. one or more regressors are correlated with the error term. In this situation, OLS cannot consistently estimate the causal effect of the regressor on the dependent variable. Sometimes, we are able to find exogenous variables which are correlated with the endogenous regressor but not correlated with the error term. Such variables are called instrumental variables or instruments. If there are enough good such instruments, we can estimate the causal effect of the regressor on the dependent variable.

Key concepts: Canonical Examples (Omitted Variables, Simultaneity and Reversed Causality, Measurement Errors), Estimation with OLS, Estimation with IV (2SLS), Asymptotic Properties of the IV Estimators, What are Valid Instruments, Testing for Exogeneity of the Instruments, Testing for Relevance of the Instruments, Testing for Exogeneity of the Regressors.

Panel Data: Fixed and Random Effects

In panel data, individuals (persons, firms, cities, … ) are observed at several points in time (days, years, before and after treatment, …). This chapter focuses on panels with relatively few time periods (small T) and many individuals (large N). This chapter introduces the two basic models for the analysis of panel data, the fixed effects model and the random effects model, and presents consistent estimators for these two models. Panel data are most useful when we suspect that the outcome variable depends on explanatory variables which are not observable but correlated with the observed explanatory variables. If such omitted variables are constant over time, panel data estimators allow to consistently estimate the effect of the observed explanatory variables.

Key concepts: The Random Effects Model, The Fixed Effects Model, Estimation with Pooled OLS, Random Effects Estimation, Fixed Effects Estimation, Leas Squared Dummy Variable Estimation (LSDV), First Difference Estimator, Time Fixed Effects, Random Effects vs. Fixed Effects Estimation.

Binary Response Models

Many dependent variables of interest in economics and other social sciences can only take two values. The two possible outcomes are usually denoted by 0 and 1. Such variables are called dummy variables or dichotomous variables. As already seen in the course Introductory Econometrics, there are several ways to model these outcomes using regressions. This chapter specifically focuses on the interpretation of the Probit and Logit models, and of their estimated parameters.

Key concepts: The Econometric Model: Probit and Logit, Latent Variable Model, Interpretation of the Parameters, Estimation with Maximum Likelihood, Estimation with OLS.

Limited Dependent Variable Models

  • The effect of truncation occurs when the observed data in the sample are only drawn from a subset of a larger population. The sampling of the subset is based on the value of the dependent variable.
  • Censoring occurs when the values of the dependent variable are restricted to a range of values. As in the case of truncation the dependent variable is only observed for a subsample. However, there is information (the independent variables) about the whole sample.
  • The sample selection problem occurs when the observed sample is not a random sample but systematically chosen from the population. Truncation and censoring are special cases of sample selection or incidental truncation.

This chapter presents the econometric models that are used to deal with the above-mentioned situations.

Key concepts: Truncation, Truncated Regression, Interpretation of Parameters, Estimation; Censoring, Tobit Model Type I, Interpretation of Parameters, Estimation; Selection, Heckman Selection Model, Interpretation of Parameters, Estimation, Estimation with Maximum Likelihood, Estimation with Heckman’s Two-Step Procedure.

Quantile Regression

Quantile regression provides an alternative to ordinary least squares (OLS) regression and related methods, which typically assume that associations between independent and dependent variables are the same at all levels. Quantile methods allow the analyst to relax the common regression slope assumption. In OLS regression, the goal is to minimize the distances between the values predicted by the regression line and the observed values. In contrast, quantile regression differentially weights the distances between the values predicted by the regression line and the observed values, then tries to minimize the weighted distances.
Two empirical applications are here attached to understand why and when it may be appropriate to use this model:

Advanced Time Series Analysis

Additionally to the forecasting models presented in the chapter Introduction to Time Series Regression and Forecasting of the previous course, there are other time series regressions used for forecasting which involve lags of both the dependent variable and the error term, so called AR(I)MA. Before introducing these models, the concept of stationarity and the technique of differencing time series are discussed. The course also includes applications in R.

Key concepts: Stationarity and Differencing, Autoregressive Models, Moving Average Models, Non-seasonal AR(I)MA Models, Estimation and Order Selection, AR(I)MA Modelling in R, Forecasting, Seasonal AR(I)MA Model.

Multiple Hypothesis Testing

When a set of hypotheses are tested simultaneously and independently from each other, then the probability of rejecting at least one of the true null hypothesis can become excessively high. Methods for dealing with multiple testing frequently call for adjusting the significance level in some way, so that the probability of observing at least one significant result due to chance remains below your desired significance level.

Key concepts: the Problem of Multiple Testing, Bonferroni Correction, False Discovery Rate, Comparison of the Correction Methods.

For a further understanding of this topic, have a look at this interactive reading.