On Organizing Information: De-Identification of Personal Health Information

I recently started some work on de-identification of personal health information, and thought of putting together this primer on de-identification.

Medical researchers often need access to patients' medical records for their investigations. However, these records may contain information that compromise the identity of the individual and thus violate his right to privacy. It is thus required that personal health information (PHI) be removed from medical records, when they are released for the larger research community. The HIPAA regulation lays down the rules for the handling of PHI.

Under HIPAA, PHI must be removed from the medical records before releasing them to the research community. Thus any information that may reveal the identity of the patient like his name, address, doctor's name, social security numbers, telephone numbers, etc. must be removed. This process of removing PHI from medical records is termed as de-identification.

There are 18 PHI identifiers that must be de-identified to meet HIPAA regulations. These include names, addresses, etc. (Entire list here). Identifying these records poses an interesting text mining problem. Identifying names may seem to be a Named Entity Recognition task, but there are additional complexities involved - a device or a disease named after a person is not PHI, and it would be loss of valuable information to the researcher if it is lost. Addresses are a challenge to de-identify sufficiently to prevent re-identification. There is a wide range of identifiers that must be recognized: SSN, MRN, Admission No, Accension No, Telephone/Fax no, room numbers, etc. out of the many numbers that a report may contain. What makes the task challenging is that a very high recall must be obtained to ensure compliance, at the same time making sure that there aren't too many false postives which de-identifies valuable, non-PHI information.

A number of rule-based as well as statistical systems have been developed to tackle the problem. You can find a good survey of the research work in this paper. Here are a few de-identification systems that are available:

PhysioNet DeId (Open Source)
Harvard Medical School Scrubber (Open Source)
Data Corp DeId (Commercial)

For research purposes, a gold standard data set containing surrogate PHI data is available on the PhysioNet page.

On Organizing Information

Sunday, April 26, 2009

De-Identification of Personal Health Information

No comments:

Post a Comment