Clustering semantic spaces of suicide notes and newsgroups Pawel Matykiewicz, John P. Pestian, Wlodzislaw Duch. Introduction
Pawel Matykiewicz, John P. Pestian, Wlodzislaw Duch
Suicide is the third leading cause of death in adolescents and a leading cause of death in the United States. Those who attempt suicide usually arrive at the Emergency Department seeking help. These individuals are at risk for a repeated attempt, that may lead to a completed suicide. Emergency Medicine clinicians are often left to manage suicidal patients by clinical judgment alone. This research focuses on better understanding of a large collection of suicide notes to help clinicians with their judgment. This is done by comparing them with a non-suicidal control group.
Differences in semantic and syntactic spaces were tested using LIWC software. Documents were converted to a matrix representation using PerlNatural Language Processing modules. Vector space normalization, multidimensional scaling (MDS) and performance measures were calculated using Rsoftware. Clustering algorithms came from Weka machine learning package.
Table below shows differences in semantic and syntactic spaces from the LIWC software. There are five syntactic features (number of articles, words > 6 letters, pronouns, prepositions and verbs) and four semantic (biological, affective, cognitive and social processes).
Clustering was done by combining the four newsgroups into following data sets: talk.politics.guns + suicide notes = guns, talk.politics.mideast + suicide notes = mideast, talk.politics.misc + suicide notes = politics,talk.religion.misc + suicide notes = religion. Figures and tables in the center of the poster show MDS and clustering results.
Data for the suicide note database was collected from around the United States. They were either in a hand written or machine typed form. Once the note was acquired, it was scanned and typed into the database exactly as seen into the database.
As a non-suicidal control group four out of twenty newsgroups from the University of California in Irvine (UCI) machine learning Repository were chosen. Selected newsgroups were: talk.politics.guns, talk.politics.mideast, talk.politics.misc, and talk.religion.misc.
LIWC software showed statistical significance in the difference between semantic and syntactic spaces. Sequential information bottleneck clustering showed an ability to find same sub-groups of suicide notes even when different types of newsgroups are present. In our analysis, one subgroup showed no emotional content while the other was emotionally charged. This finding is consistent with Tuckman’s, 1959 work that showed suicide notes fall into six emotional categories: emotionally neutral, emotionally positive, emotionally negative directed inward, emotionally negative directed outward, emotionally negative directed inward and outward (Tuckman et al., 1959)