By Amit Garg, vice president of analytics, Gramener.
The advent of the pandemic and more recent vaccination efforts means there’s never been a time where more people’s Personal Identifiable Information (PII) and Personal Health Information (PHI) in health records and medical documentation are in circulation. It’s vital to protect patients’ PII and PHI, which can include information on their age, race, or medical history, especially given that cyberthreats and fraudulent activity related to the Covid-19 vaccine are increasing.
With so much patient and clinical trial data being stored and shared at any given time, it’s becoming increasingly challenging for pharmaceutical companies to efficiently ensure that patient information is protected. Experts say that medical data is up to 50 times more valuable than credit card data.
Here’s where artificial intelligence comes into play: AI and machine learning (ML) solutions can not only automatically identify what information is classed as PII in a given record, but it can also then automatically redact or anonymize that data to make sure that no adversary can identify the patients.
Joining the dots with AI
AI algorithms use advanced methods such as entity detection, entity extraction, and entity-relationship management to handle patients’ PII and PHI from a given document. This involves identifying and categorizing key information in the text using Named Entity Recognition (NER), a form of Natural Language Processing (NLP).
Useful libraries for teams using NER include Stanford NER, spaCy’s EntityRecognizer, CliNER (a domain-specific NER tool that has been trained on clinical texts), or BioBERT (a domain-specific language representation model pre-trained on large-scale biomedical corpora).
NER works to safeguard PII in the context of healthcare by identifying different elements of a single patient’s PII across multiple health records. In one document the name and age may be present, and in another, age and race may be present but not the name, while in another religion and race, and so on.
Any hacker with access to each document and each data point would be able to join the dots to match all PII to the single person. To prevent this, an intelligent entity detection and extraction solution can identify this information across the documents and redact and anonymize the correct data to prevent reidentification of the patient.
If the solution is not able to de-identify (or redact) the information completely, it will score the subjects based on the probability of re-identification. The publication would then know the risk in advance and take the appropriate action, i.e., to assume the risk and publish or to not publish until specific confidence is reached.