Machine Learning for disambiguation of scientific article authors

This project is an open-source implementation of a classifier which goal is to predict whether a pair of scientific articles (biomedical articles from the PubMed dataset) belongs to the same author or not.

The final classifier (Random Forest) used 15 features and had an accuracy of 87% with a 10-fold cross-validation. Further studies on the datasets revealed that for some combinations of last names and initial of first names (namespaces), over 100'000 articles could be found. This study explains the need for a classifier able to distinguish between these authors.

The project was my bachelor thesis job commissioned by Hoffmann-La Roche A.G.

Scientific disambiguation

You can visit the project's repository at the following link.
You can also visit the study on the Pubmed dataset at the following link.
Documentation (Italian Only) of the bachelor's thesis can be downloaded at this link.