Online content mining to improve psychological profiling of individuals

This thesis explores computational methods for the early detection and analysis of mental health conditions from social media text, addressing the challenges that distinguish this domain from conventional text classification. In mental health, accurate prediction alone is insufficient: useful systems must also act early, before signs of distress escalate into a clinical crisis, while remaining interpretable enough to complement clinical judgment. At the same time, the linguistic signals associated with mental health conditions are often subtle, noisy, highly contextual, and deeply dependent on the way text is represented computationally. To address these challenges, unsupervised and representation-based approaches like clustering, topic modeling, coherence evaluation, and temporal word embeddings are applied to Reddit data on depression, anorexia, self-harm, and pathological gambling. These methods extract latent structure directly from the data, revealing recurring thematic and temporal patterns, and provide complementary insights that predefined labels are not designed to optimize for. Evaluation, however, cannot be detached from representation and task. Clustering metrics depend on the geometry of the embedding space; coherence metrics depend on the reference corpus; temporal signals depend on the vocabulary and contextual properties of each condition. Metrics, in this sense, are not neutral instruments but operational assumptions about the nature of the signal and the structure that captures it. Progress in this field depends as much on understanding those assumptions as on building stronger models. Reproducibility is also a structural property of the research process, not an after-the-fact convenience. In complex experimental pipelines, each dataset, model, and result arises from a chain of configuration choices, including preprocessing steps, hyperparameters, and training data, and without explicit preservation of this provenance, findings cannot be reliably compared, audited, or extended. Overall, the work highlights how representation, evaluation, and reproducibility jointly shape what can be extracted from social media data and how it can be interpreted, supporting more robust and methodologically grounded approaches to early mental health detection.

keywords: Text Mining, Mental Health, Early Risk Prediction, Evaluation, Reproducible Research