The use of machine learning in the social sciences
Conducting causal analysis from observational longitudinal data
Research summary
Dr Annalivia Polselli, in collaboration with Dr Spyros Samothrakis (IADS) and Prof Paul Clarke (ISER), is developing a methodology for estimating the heterogeneous average treatment effects from observational panel/longitudinal data using high-performance machine learning.
The project is part of a wider MiSoC programme committed to bringing data science in social sciences.
Research challenge
The main objective of applied social scientists is to consistently estimate the causal impact of a treatment (e.g. policy or reform) on an outcome of interest. This consists of comparing the average outcomes of treated units with control units. However, no unit can be observed in the treated and control group simultaneously. This is known as the “fundamental problem of causal inference”.
The effect of an intervention is unlikely to be homogeneous among the targeted population (e.g. individuals, regions, countries) but it might vary across units exposed to the treatment. There is a growing interest in estimating heterogeneous effects of units with a large set of individual characteristics. This can be done with the estimation of the Conditional Average Treatment Effects (CATE).
More accurate estimates can be obtained with machine learning techniques, that provide valid confidence intervals for statistical inference. However, the use of machine learning introduces two types of bias: the first due to overfitting that can be solved with sample splitting, and the second due to regularization that can be handled via orthogonalizing the estimator.
Their method builds on this theoretical framework.
Research focus
Social scientists are interested in conducting causal inference exercises on the phenomenon of interest so Dr Polselli, with Dr Samothrakis and Prof Clark, are working on an estimator for heterogeneous average treatment effects using machine learning tools.
They are developing a methodology for estimating the dependence of average causal effects on subjects’ time-varying characteristics from observational panel/longitudinal data. They will contribute to the recent literature on causal inference literature on causal random trees, causal random forest, and doubly debiased estimators.
In her doctoral research, she examined estimating valid statistical inferences in panel data models. She focused on panel data sets with a small cross-sectional sample, heteroskedasticity, and good leveraged, which is a structure common in the macroeconomic country-level studies and experimental works, characterised by a relatively small number of cross-sectional units.
The presence of heteroskedasticity and good leveraged data undermines conventional cluster-robust standard errors, leading to the over-rejection of the null hypothesis. Jackknife-type standard errors mitigate the downward bias of conventional inference.
As an applied Econometrician, she has expressed an active interest in documenting gender gaps in the labour market and delving into the reasons behind their existence, contributing to projects close to labour and gender economics.
Notable outcomes
A recent paper "An International Map of Gender Gaps" revisits stylized facts on female labour force participation, employment and unemployment in high-income and middle-low income countries. The working paper gained the media’s attention and was cited in the Italian economic newspaper Il Sole 24 Ore.
Dr Polselli and her co-authors are currently at the very early stage of the project on the use of machine learning for panel data models. They plan on publishing the proposed methodology in a peer-reviewed journal and making the computational package available in several programming languages (i.e., Python, R and Stata), building a bridge between social and computer sciences.
In addition, she is writing up the papers related to the topics covered in her PhD with the ultimate goal to publish them in peer-reviewed journals.