92% 4 Level Deep Product Categorization with MultiLingual Dataset for Wholesale Marketplace
How we were able to reach 97% on top level categories, and 92% on bottom level categories with a multilingual dataset for a customer of Pumice.ai.
A major problem in health care is the amount of time spent on paperwork. Many facilities still rely on paper records and migrating them to electronic health records is not that simple.
Digitized paper records are full of complexities like bad handwriting, handwritten notes and markings, a variety of page layouts from forms to tables, image quality and noise issues, irrelevant text like footers, and more. Overcoming such problems requires complex processing pipelines that combine the latest techniques in large language models, natural language processing, and computer vision.
In this article, we explore one such pipeline called GPT-4 medical record summarization that's capable of reliable digitalization of paper health records, with a lot of potential cost and time savings.
We'll briefly explain some basics of health records and their contents here with a focus on hospital inpatient health records.
A medical record is a record of a patient's single encounter with medical care in a hospital.
In contrast, a health record is a comprehensive collection of all aspects of a patient's health over an extended period of time and across multiple care providers. Health care is broader than medical care, covering mental health, nutrition plans, health insurance, and other such aspects.
A health record typically contains many medical records, lab reports, and medical images.
In this article, we use the terms "health record," "medical record," and "medical report" interchangeably.
A typical health record consists of different kinds of information:
Other important aspects are the media and formats in which health records are stored:
In this section, we'll briefly go over how health records are used and by whom.
The primary use case of patient records is for clinical decisions and treatment plans by health care professionals. The records enable health care providers to render high-quality patient care because they contain the entire, objective history of patient data without having to rely on that information from patients.
Medical report summarization helps providers optimize their time by focusing on the most important details of a patient's record.
Since GPT-4 is capable of generating structured data, it also enables understaffed and under-resourced medical practices in rural areas to speedily convert all their paper records into queryable EHRs.
Another way the health care industry uses patient records is to inform overall improvements to the health care system through patient surveys, experimental treatments, new medicines, and so on.
Health records are also useful for administrative purposes like automated invoice generation for patients. Patients can also avoid billing fraud by running their patient records through third-party evaluations.
In some countries, health insurance is a major and essential component of the health care system. Patient records help both patients and care providers with tasks like insurance policy evaluations and prior authorizations.
Finally, health records and their data processing come up in the context of protecting patient privacy and confidentiality to comply with regulations like the Health Insurance Portability and Accountability Act (HIPAA). The data management concerns here focus on anonymizing the data in health records and analyzing it at aggregate levels rather than individual levels.
In the sections below, we explain a typical medical report processing pipeline using medical summarization as an example.
The illustration below shows our medical report summarization pipeline based on GPT-4 and other deep-learning models.
The pipeline consists of these components:
Each of these components uses different substeps and even different deep learning models.
The digitized health record is processed page by page in the initial stages. Related pages can be recombined later through page classification and combinator module. But first, the record is split into pages or chunks suited to the document's file format. PDF documents are easy to split into pages. For image formats like TIFF, we use object localization models to identify and locate the page boundaries.
Text extraction is done using custom optical character recognition (OCR) approaches focused on learning both text extraction and positional OCR information. Applying suitable preprocessing to the digitized pages of a health record improves the accuracy of text extraction in the next stage.
In this section, we cover some of the common preprocessing techniques used to improve text extraction.
The following image preprocessing operations are applied to the digitized health record images to improve character recognition for both printed and handwritten text:
Image preprocessing techniques to improve handwriting recognition are based on the paper, "Continuous Offline Handwriting Recognition Using Deep Learning Models". They include:
Text extraction identifies all the printed or handwritten characters on a digitized image, combines them into syntactic elements like words and punctuation marks, and returns them as text fragments like words, phrases, and sentences.
Character identification using just the image pixels can often be inaccurate if the image has text with poor handwriting (a widespread problem in the medical field), image defects, blurring, missing pixels, glare, and similar imaging artifacts.
State-of-the-art text extraction uses multi-modal language-vision models. They don't identify characters based on just the image pixels. Instead, they use a lot of additional information like:
All these additional criteria drastically improve the accuracy of the text extraction, including from the tough handwritten sections of a digitized health record.
Below, we explore some state-of-the-art text extraction approaches.
LayoutLMv3 is an OCR-based image-text model that's been trained for document layout tasks. Given a document image, it identifies layout elements like sections, field-value pairs, or tables and produces their bounding boxes and text as results.
The architecture is a pure transformer model with no convolutional elements. During training and fine-tuning, it must be supplied with the word embeddings, image patches, and position embeddings (from an off-the-shelf OCR package) of the training documents. Its multi-modal transformer model learns to associate patterns in the document text with their appropriate layout elements.
For fine-tuning, we supply a small dataset of annotated digitized health records. The pre-trained model adjusts its layer weights to activate for the layouts, layout elements, and text found in health records.
The document understanding transformer (Donut) is an alternative text extraction approach that is OCR-free. That means it does not use or generate information at the character level. Instead, it learns to directly generate text sequences from visual features without producing intermediate information like character labels and text bounding boxes.
Donut has a typical encoder-decoder transformer architecture:
For example, for the downstream task of document layout understanding, the decoder produces an output structured sequence like “<layout><section><heading>medications</heading><line><fragment>Aspirin</fragment> <fragment>10mg</fragment></line></section></layout>” as its output sequence.
Since Donut doesn't use OCR at all, this approach is faster and lighter with far fewer parameters than OCR-based models. It also reports high accuracy.
The text extraction must not only understand the regular text but also special marks like check marks and circles. In the example below, a doctor has selected "Y" as their choice but the text extraction model has ignored it.
Such unusual data outside the purview of regular text extraction is the reason we fine-tune image-language models for our document understanding pipelines. These recognitions can be trained into the image to text architectures outlined above to produce a single pipeline. We recommend processing these recognized marks into text that then can be learned in the language model.
The extracted text fragments may not be in an ideal state for medical processing use cases. For example, the text in signatures, page numbers, seals, logos, or letterheads just acts as noisy text that doesn't add any relevant information to health report summaries but may affect the quality of the summaries or extracted information fields.
Text preprocessing uses deep learning models to ignore such noisy text. One approach we use is fusing the approach of language model pretraining over clinical notes for our text extraction model to teach it to recognize the text patterns unique to health records.
Unlike the paper, our approach does not use long short-term memory models but instead adapts its sequential, hierarchical, and pretraining approaches to our multi-modal text extraction model. The approach involves fine-tuning the layout understanding model with additional embeddings for medical information like section labels and diagnosis codes.
Determining section labels for each page is an essential step for the accurate processing of digitized paper records.
As we saw earlier, every section in a health record has a distinct structure and set of fields. Not all sections can be processed the same way. The processing heavily depends on the nature of the medical information in a section, its structure, and the goals of the health care professional doing the processing. For example:
The appropriate GPT prompts and models for each section and use case are also different. So, every page is labeled with appropriate section labels and additional goal-specific labels by a page classification model.
Some examples of labels are:
The classification models that label an input health record are implemented in one of two ways explained next.
GPT-4 is already trained on medical corpora and is capable of scoring high in medical examinations. As such, it's inherently capable of classifying each page of a health record based on that page's text contents. For labels that are simple and obvious, straightforward prompt instructions are sufficient; we don't even have to provide any examples as few-shot guidance.
For some use cases, we need special labels that zero-shot classification is unable to classify accurately. To handle them, we maintain a reference set of manually labeled sections and examine how similar an input record's section is to each section in that set. The reference sections that score high on content similarity with the input section are selected and their labels (manually set) are chosen as the labels for the input section.
Implementation-wise, we determine content similarity using vector similarity metrics like cosine similarity. The reference sections as well as the input sections are converted to embedding vectors using either OpenAI embeddings or Sentence-BERT. The reference embeddings are stored in a vector database like Pinecone and queried for vector similarity with an input section. The database returns the most similar reference sections and their labels.
In this stage, GPT-4 prompts are used to summarize the information on each page.
For some pages, this involves abstractive summarization of the clinical text on the page. GPT-4 rephrases that text to a shorter abstract without losing any critical details.
For other pages, GPT-4 is used for extractive summarization. Key information is extracted verbatim from a page's content.
We show some page examples and their respective prompts in the sections below.
The medications page of a sample health record annotated by the text extraction model is shown below:
We ask GPT-4 to summarize the medications page with this prompt: "Summarize the list of medications in this extract from a medications page of a health record."
GPT-4 generates the following summary:
We can see that the dosages in the summary are missing. This is because the text extraction pipeline we used here did not keep all the information on a line together though the extraction model provides the pixel coordinates to do so. So, this is really the pipeline's drawback rather than the extraction or summarization model's, and it can be easily fixed.
This focus notes page of a sample health record contains a lot of difficult-to-read handwritten text and has been annotated by the text extraction model:
Note that the text extraction model has misidentified words like "Pt." (for "Patient") as a meaningless "R t." This is because the model used here has not been fine-tuned on medical records.
We ask GPT-4 to summarize the focus notes page with this prompt: "Summarize the following extract from the focus notes of a health record:"
GPT-4 generates the following summary:
GPT-4 has done a great job of summarizing the focus notes here. Although the extracted text is not in the same order as the page layout, GPT-4 recombined and organized the information in a coherent and structured way by itself while ignoring the unnecessary details.
The crucial patient details form of a sample health record has been annotated by the text extraction model as follows:
Notice how it has accurately identified both printed and handwritten text.
We ask GPT-4 to summarize the patient details with this prompt: "Summarize the details in this patient details form from a health record:"
GPT-4 generates the following patient details summary:
Note that even in cases where the field name and field value are not together in the extracted text because of pipeline drawbacks, GPT-4 has intelligently correlated them:
Medical record number in the original record on the left. Its locations in the extracted text. GPT-4 has correctly correlated them again in the summary! (Original source: AHIMA)
The combinator module generates the final section-level summaries. For sections that span multiple pages, it combines their page summaries into a single coherent section summary. While doing so, it doesn't just squish multiple summaries together naively. Instead, it condenses their information by removing any duplicated details and generates a concise section summary that does not feel choppy.
The combinator is implemented as a custom fine-tuned transformer model with either GPT-4 or another language model like BERT as the base model. Fine-tuning enables us to generate high-quality final summaries. It also lets users tweak the size and quality of each summary because everyone has a different idea of what an ideal summary looks like for their specific use case.
Using medical report summarization as a use case, this article showed a typical report processing pipeline that uses incredible advances in large language models to streamline health care operations.
In addition to summarization, many other high-quality artificial intelligence (AI) and natural language processing solutions for health records are possible now, like question-answering chatbots and powerful search engines. Contact us to explore how you can streamline operations in your hospital, lab, or medical practice with modern AI technologies.