By Amit Garg, vice president of analytics, Gramener.
The advent of the pandemic and more recent vaccination efforts means there’s never been a time where more people’s Personal Identifiable Information (PII) and Personal Health Information (PHI) in health records and medical documentation are in circulation. It’s vital to protect patients’ PII and PHI, which can include information on their age, race, or medical history, especially given that cyberthreats and fraudulent activity related to the Covid-19 vaccine are increasing.
With so much patient and clinical trial data being stored and shared at any given time, it’s becoming increasingly challenging for pharmaceutical companies to efficiently ensure that patient information is protected. Experts say that medical data is up to 50 times more valuable than credit card data.
Here’s where artificial intelligence comes into play: AI and machine learning (ML) solutions can not only automatically identify what information is classed as PII in a given record, but it can also then automatically redact or anonymize that data to make sure that no adversary can identify the patients.
Joining the dots with AI
AI algorithms use advanced methods such as entity detection, entity extraction, and entity-relationship management to handle patients’ PII and PHI from a given document. This involves identifying and categorizing key information in the text using Named Entity Recognition (NER), a form of Natural Language Processing (NLP).
Useful libraries for teams using NER include Stanford NER, spaCy’s EntityRecognizer, CliNER (a domain-specific NER tool that has been trained on clinical texts), or BioBERT (a domain-specific language representation model pre-trained on large-scale biomedical corpora).
NER works to safeguard PII in the context of healthcare by identifying different elements of a single patient’s PII across multiple health records. In one document the name and age may be present, and in another, age and race may be present but not the name, while in another religion and race, and so on.
Any hacker with access to each document and each data point would be able to join the dots to match all PII to the single person. To prevent this, an intelligent entity detection and extraction solution can identify this information across the documents and redact and anonymize the correct data to prevent reidentification of the patient.
If the solution is not able to de-identify (or redact) the information completely, it will score the subjects based on the probability of re-identification. The publication would then know the risk in advance and take the appropriate action, i.e., to assume the risk and publish or to not publish until specific confidence is reached.
Building powerful AI models
Let’s break down step by step how data science teams can apply and train these AI models to successfully protect patients’ PII.
AI can help with document reading as ML models are able to consume significant amounts of data, especially unstructured data, at a much faster rate than humans can. Graphic Processing Units (GPUs) can process multiple documents in parallel, at a rate of around 10 gigabytes per second.
The next critical step for the solution is to skim through the document and search for the relevant entities, extracting and storing them securely. Companies must make sure they have selected the right NER model for their use case; convolutional neural networks (CNNs) are better suited to image recognition cases, whereas recurrent neural networks (RNNs) that use sequence modeling, and are better suited for text sentiment analysis and Parts of Speech tagging (POS) use cases, making them more appropriate for the healthcare and pharma context.
Every entity that’s extracted needs to be tagged to an identifying attribute of the document, including things like patient name, age, residential address, patient ID, medical history, Serious Adverse Events, date of death, etc. Once the PII/PHI is extracted, the algorithm decides which information should be redacted, or anonymized vs. preserved.
But the work doesn’t stop there. In order to ensure the algorithm continuously delivers accurate results, the data science team must enrich the AI model. This can be done by training the algorithm further on labeled pharmaceutical or medical documents to keep it familiarized with the necessary language and entity-specific information.
The data science team must also measure the accuracy of the results of the AI model. They can do this by understanding the standard deviation and the margin of error within the sample of patients, which will allow them to eventually generalize the solution’s efficacy on the general population. This is called document risk scoring.
How does this look in action?
Let’s say a pharmaceutical company is trying to introduce a new drug to the market. After conducting various stages of drug development and testing, the company is then ready to conduct clinical trials on humans.
The company selects people covering various demographics, races, ages, and existing medical conditions, amongst other factors. Whoever goes through the phases of the clinical trial will have their PII, along with details like dose capacity and health effects of the drug, recorded within the documentation of the trial.
All of the reports—be they research reports, approval reports, or drug success reports—are created before submitting the application to the government body to be approved for public use. However, if the patient data and PII within the documentation are leaked, adversaries might misuse that information for their personal gains and those patients could face harassment or discrimination. Should this happen, the pharma company may also face legal penalization due to its negligence in handling the patient’s personal data.
This is when the company introduces a well-trained, accurate AI model. The AI solution is able to automatically identify, redact and anonymize the PII of trial participants—saving countless hours that were previously spent manually going through reports to do the same. This speeds up drug approval, giving the public faster access to novel pharmaceutical products that could end up saving lives.
AI applications in the pharmaceutical context are growing year on year. Despite broader AI adoption efforts often being slow due to heavy regulation, the ability of AI to help safeguard patients’ PII only makes it easier for pharma companies to comply with data regulation like HIPAA. Players in the pharmaceutical industry that adopt AI to protect patients’ PII will not only drive efficiencies in their own processes, but more importantly, they’ll put patients’ minds at ease as they can be sure in the knowledge that their data is in safe hands.