TRN: Investigating the feasibility of extracting academic profile information from web sources

The Research Network (TRN) is based in Kent and provides pharmaceutical R&D consultancy, project management, outsourcing, training and due diligence services. Their expertise in drug discovery and development disciplines covers both small molecule and biotherapeutics. They regularly collaborate with universities and research organisations to progress new drug discovery ventures from lead identification through to early clinical development.

The project between TRN and University of Kent academics was funded through EIRA’s Innovation Voucher scheme. It explored whether it was possible to use data mining techniques to extract accurate profile information of potential collaborators. This study was conducted to investigate the feasibility of mining academic information, which would then be used to find similarities between academics who could potentially collaborate in the future.

The Challenge

Finding the right kind of scientific expertise needed to contribute to cutting edge drug discovery research presents a huge challenge. Information on expertise which could be utilised is often dispersed and comes from disparate sources, such as a company website or university profile pages. This project explored the possibility of using data mining techniques to extract academic profile information, providing a simple and effective way of shortlisting potential collaborators.

The Approach

University of Kent academics Dr James Bentham and Dr Jennifer Hiscock are experts in the fields of machine learning, data extraction and bioscience.

Their approach to the project was:

  1. To identify potential sources of information on academics and their collaborations
  2. To carry out pilot web scraping (data extraction)
  3. To consider data storage and access requirements
  4. To carry out pilot pre-processing and data mining

Various websites were investigated to identify salient attributes which could then be extracted, stored and used as the basis of data mining potential scientific investigators.

One way of identifying individual academics and areas of research was to generate lists of university departments, which could then be used for further data gathering. Webscraping was carried out for five universities:

  • University of Kent
  • University of Warwick
  • University of Sheffield
  • Imperial College London
  • King’s College London

The text from the personal webpages was pre-processed in R (a special programming language used for statistical analysis and visualisation) to remove extraneous information. A vector space model was utilised to analyse the words used in the description and a graphical plot (word cloud) was extractible. Vector spaces are used as a means of representing in a 3D space how close a particular word or description is to a target value. This could then be used to cluster associative skills, in order to match academics who shared interests in similar topics.

The Result

The study proved that rich information was available on individual academics and their collaborations. However, this information was found to be fragmented. Web scraping was proved to be feasible for university department and staff lists, free text from personal webpages, university repositories, and research council websites.

Data storage and access was straightforward. Natural language processing methods were applied to the data successfully, finding similarities between academics based on free text. Network analysis produced meaningful and useful results, which described potential collaboration networks.  The next phase of development recommended was for work to be carried out that further refined and combined these methods.

Andy McElroy, the CEO of TRN had this to say about the project:

This short project provided useful insights into available information on academic expertise and projects and has positioned us well for the second project to pilot the analysis and use of this information to highlight collaboration opportunities.

Andy McElroy, CEO of TRN

Next Steps

TRN have applied for an EIRA Innovation Voucher to build on the work of the feasibility study, collaborating with the academics who carried out this project. The aim of the new project is to take the concept of data extraction of researcher profiles through to a testable prototype system, which can be used to find suitable skilled scientists needed to contribute to cutting edge research projects.