• Bibtex: @su2023
  • Bibliography: Su, R., John, J. R., & Lin, P.-I. (2023). Machine learning-based prediction for self-harm and suicide attempts in adolescents. Psychiatry Research, 328, 115446. https://doi.org/10.1016/j.psychres.2023.115446

Example citation

A random forest model achieved an AUC of 0.73 when predicting adolescent self-harm two years later

Random forest prediction of suicide attempts in adolescents reached an AUC of 0.72

My notes

  • They started off with 497 variables
  • Correlation cutoff > 0.85
  • Recursive feature elimination haven’t heard of this before.

Recursive feature elimination with 10-fold cross validation was then used to select the optimal subset of predictors to be used in training of the model (Misra and Yadav, 2020). This method has also been validated in the context of highly correlated predictors (Gregorutti et al., 2017). To explore the effects of reducing dimensionality further, the top 20 vari ables from model training were then used to develop another RF model.

  • “Variable importance was calculated based on permutation importance”. The mean decrease in accuracy when this variable was excluded. Way simpler to understand than other methods, possible to implement in R?

Self-harm results

Suicide attempt results

n = 2809 in final analysis, 145 (5.2%) had attempted suicide.

Their strategy to increase performance despite an unbalanced dataset:

Stratified random sampling was used to ensure that the proportion of positive respondents were the same in both training and testing datasets. RF tends to perform poorly in the case of extremelyimbalanced datasets, as correct classifications of the majority class will yield an overall higher prediction accuracy (Boulesteix et al., 2012).

To address the issue of data imbalance in the dataset, we utilized the class weight parameter of the RF algorithm. This allowed us to assign weightings to each class that were inversely proportional to their respective frequencies. By doing so, we were able to favourably weigh the minority class and reduce selection bias in the model.


This study aimed to use machine learning (ML) models to predict the risk of self-harm and suicide attempts in adolescents. We conducted secondary analysis of cross-sectional data from the Longitudinal Study of Australian Children dataset. Several key variables at the age of 14–15 years were used to predict self-harm or suicide attempt at 16-17 years. Random forest classification models were used to select the optimal subset of predictors and subsequently make predictions. Among 2809 participants, 296 (10.54%) reported an act of self-harm and 145 (5.16%) reported attempting suicide at least once in the past 12 months. The area under the receiver operating curve was fair for self-harm (0.7397) and suicide attempt (0.7220), which outperformed the prediction strategy solely based on prior suicide or self-harm attempt (AUC: 0.6). The most important factors identified were similar, and included depressed feelings, strengths and difficulties questionnaire scores, perceptions of self, and school- and parent-related factors. The random forest classification algorithm, an ML technique, can effectively select the optimal subset of predictors from hundreds of variables to forecast the risks of suicide and self-harm among adolescents. Further research is needed to validate the utility and scalability of ML techniques in mental health research. PDF: su_2023_machine_learning-based_prediction_for_self-harm_and_suicide_attempts_in.pdf