You can also join our Discord server!
- Psychologists and social scientists often have to match items in different questionnaires, such as "I often feel anxious" and "Feeling nervous, anxious or afraid".
- This is called harmonisation.
- Harmonisation is a time consuming and subjective process.
- Going through long PDFs of questionnaires and putting the questions into Excel is no fun.
- Enter Harmony, a tool that uses natural language processing and generative AI models to help researchers harmonise questionnaire items, even in different languages.
💁You can run the walkthrough Python notebook in Google Colab with a single click:
🇷You can also download an R markdown notebook to run in R Studio:
🇷You can run the walkthrough R notebook in Google Colab with a single click:
Harmony is a tool using AI which allows you to compare items from questionnaires and identify similar content. You can try Harmony at https://app.harmonydata.ac.uk and you can read our blog at https://harmonydata.ac.uk/blog/.
The source code for Harmony is on Github at https://github.com/harmonydata/harmony.
Here's a walkthrough video on how you can use Harmony online at harmonydata.ac.uk. Click to view:
- 🖱️ The Harmony app which is running at https://harmonydata.ac.uk/app is in this repo: https://github.com/harmonydata/app.
- 👨💻 The Harmony Python library source code is here: https://github.com/harmonydata/harmony.
- 🇷 The Harmony R library source code is here: https://github.com/harmonydata/harmony_r.
- 💻 The Harmony API source code is here: https://github.com/harmonydata/harmonyapi.
- 📰 The code for training the PDF extraction is here: https://github.com/harmonydata/pdf-questionnaire-extraction
- 📔 Finally, the source code of the Harmony static blog at https://harmonydata.ac.uk is in this repo: https://github.com/harmonydata/harmonydata.github.io. It is hosted with Github Pages.
Information about Harmony's server setup and deployment is in the private repo harmony_deployment_ulster_private.
The Harmony project is a data harmonisation project that uses Natural Language Processing to help researchers make better use of existing data from different studies by supporting them with the harmonisation of various measures and items used in different studies.
Harmony is a collaboration project between Ulster University, University College London, the Universidade Federal de Santa Maria, and Fast Data Science. Harmony has been funded by Wellcome as part of the Wellcome Data Prize in Mental Health and by Economic and Social Research Council (ESRC).
The core team at Harmony is made up of:
- Dr Bettina Moltrecht, PhD (UCL)
- Dr Eoin McElroy (University of Ulster)
- Dr George Ploubidis (UCL)
- Dr Mauricio Scopel Hoffmann (Universidade Federal de Santa Maria, Brazil)
- Thomas Wood (Fast Data Science)
You can contact us at https://harmonydata.ac.uk/contact/.
McElroy, E., Moltrecht, B., Ploubidis, G.B., Scopel Hoffman, M., Wood, T.A., Harmony [Computer software], Version 1.0, accessed at https://app.harmonydata.ac.uk. Ulster University (2022)
If you upload a questionnaire or instrument, Harmony does not store or save it. You can read more on our Privacy Policy page.
Harmony passes the text of each questionnaire item through a neural network called Sentence-BERT, in order to convert it into a vector. The similarity of two texts is then measured as the similarity between their vectors. Two identical texts have a similarity of 100% while two completely different texts have a similarity of 0%. You can read more in this technical blog post and you can even download and run Harmony’s source code.
Harmony was able to reconstruct the matches of the questionnaire harmonisation tool developed by McElroy et al in 2020 with the following AUC scores: childhood 81%, adulthood 77%. Harmony was able to match the questions of the English and Portuguese GAD-7 instruments with AUC 100%. You can read more in this blog post.
The numbers are the cosine similarity of document vectors. The cosine similarity of two vectors can range from -1 to 1 based on the angle between the two vectors being compared. We have converted these to percentages. We have also used a preprocessing stage to convert positive sentences to negative and vice-versa (e.g. I feel anxious -> I do not feel anxious). If the match between two sentences improves once this preprocessing has been applied, then the items are assigned a negative similarity.
At this time Harmony does not give p-values. But you can interpret the percentage matches like correlation coefficients. In future we hope to provide more statistical data to Harmony’s users.
- Thomas Wood (Fast Data Science)
You can cite our validation paper:
McElroy, Wood, Bond, Mulvenna, Shevlin, Ploubidis, Scopel Hoffmann, Moltrecht, Using natural language processing to facilitate the harmonisation of mental health questionnaires: a validation study using real-world data. BMC Psychiatry 24, 530 (2024), https://doi.org/10.1186/s12888-024-05954-2
A BibTeX entry for LaTeX users is
{{< rawhtml >}}
@article{mcelroy2024using, title={Using natural language processing to facilitate the harmonisation of mental health questionnaires: a validation study using real-world data}, author={McElroy, Eoin and Wood, Thomas and Bond, Raymond and Mulvenna, Maurice and Shevlin, Mark and Ploubidis, George B and Hoffmann, Mauricio Scopel and Moltrecht, Bettina}, journal={BMC psychiatry}, volume={24}, number={1}, pages={530}, year={2024}, publisher={Springer} }
{{< /rawhtml >}}