Sunday, April 26, 2009

De-Identification of Personal Health Information

I recently started some work on de-identification of personal health information, and thought of putting together this primer on de-identification.

Medical researchers often need access to patients' medical records for their investigations. However, these records may contain information that compromise the identity of the individual and thus violate his right to privacy. It is thus required that personal health information (PHI) be removed from medical records, when they are released for the larger research community. The HIPAA regulation lays down the rules for the handling of PHI.

Under HIPAA, PHI must be removed from the medical records before releasing them to the research community. Thus any information that may reveal the identity of the patient like his name, address, doctor's name, social security numbers, telephone numbers, etc. must be removed. This process of removing PHI from medical records is termed as de-identification.

There are 18 PHI identifiers that must be de-identified to meet HIPAA regulations. These include names, addresses, etc. (Entire list here). Identifying these records poses an interesting text mining problem. Identifying names may seem to be a Named Entity Recognition task, but there are additional complexities involved - a device or a disease named after a person is not PHI, and it would be loss of valuable information to the researcher if it is lost. Addresses are a challenge to de-identify sufficiently to prevent re-identification. There is a wide range of identifiers that must be recognized: SSN, MRN, Admission No, Accension No, Telephone/Fax no, room numbers, etc. out of the many numbers that a report may contain. What makes the task challenging is that a very high recall must be obtained to ensure compliance, at the same time making sure that there aren't too many false postives which de-identifies valuable, non-PHI information.

A number of rule-based as well as statistical systems have been developed to tackle the problem. You can find a good survey of the research work in this paper. Here are a few de-identification systems that are available:
For research purposes, a gold standard data set containing surrogate PHI data is available on the PhysioNet page.

Saturday, April 25, 2009

Yet Another Blog On Organizing Information

Data and information everywhere. The digital age is generating so much information, that it has fast outgrown our ability to comprehend it. 'Information Overload', we call it. These are the questions that are posed to us:
  • How do I find information that I want?
  • What information is relevant to my need?
  • Ok, this is way too much information than I can handle. I would like to have summary of the same.
  • In this huge infobase, is there some useful information that isn't obvious? Some patterns, trends that may be useful.
  • There are a lot of smart people generating content. How can the collective intelligence of these people augment my search for information?
These questions have had us hooked for a long time, and so have the solutions people have developed to tackle these questions. Search engines to help you find information, business intelligence tools to make find patterns in huge volumes of data, information extraction systems to summarize information in human generated content, recommendation systems to bring information relevant to your need and study of social networks to harness the "collective intelligence" of the crowd.

The rabbit hole goes deeper. These solutions are built on the more fundamental sciences of statistics, pattern recognition, artificial intelligence and natural language understanding.

This is not the end, for the more fundamental questions we are posed with are about the nature of cognition, the understanding of language, the organization of the knowledge and the active role of the human observer in the perception of information. I think this is the holy grail that we are all in pursuit of.

We are beginners in this exciting field,. This is a place to share what we learn, what we do and to benefit from the "collective intelligence" of all who visit this page.

While the challenges span many problems, there are some that we are currently working on. Dhaval currently works on optimizing ad-networks and takes an active interest in search engines. I currently work on information extraction from text and medical informatics. So for now you may find a certain bias towards these topics, and related topics on this blog.