Live Webinar 5/27: Dive into ParseBench and learn what it takes to evaluate document OCR for AI Agents

PHI Redaction

Protected Health Information (PHI) redaction is a critical compliance and data governance practice for any organization that handles health-related records. As healthcare systems increasingly rely on digital documents, automated workflows, and AI-assisted processing, accurately identifying and removing PHI before disclosure has become significantly more complex. Understanding what PHI redaction is, what it covers, and how to implement it is essential for maintaining HIPAA compliance and protecting patient privacy.

What PHI Redaction Is and Why It Matters

PHI redaction is the process of permanently removing or obscuring Protected Health Information from documents, records, or media before sharing or disclosure. It is a foundational practice under HIPAA, which governs how covered entities—such as hospitals, insurers, and healthcare clearinghouses—and their business associates handle individually identifiable health information.

How HIPAA Defines PHI

Under HIPAA, PHI refers to any individually identifiable health information that is created, received, maintained, or transmitted by a covered entity or business associate. This includes information relating to an individual's past, present, or future physical or mental health condition, the provision of healthcare, or payment for healthcare services.

PHI is defined broadly by design. Even data that appears innocuous in isolation—such as a ZIP code or a date—can qualify as PHI when combined with other information that makes an individual identifiable. That broad scope is one reason many organizations begin with strong PII detection in documents before applying healthcare-specific redaction rules.

What Redaction Involves in Practice

Redaction is the targeted removal or permanent obscuring of specific data points from a document or record. The goal is to prevent unauthorized disclosure of identifying information while preserving the remaining content for its intended use.

Common examples of PHI subject to redaction include:

  • Patient names
  • Dates of service, admission, or discharge
  • Social Security numbers
  • Home addresses and geographic data
  • Medical record numbers and account numbers
  • Phone numbers and email addresses

How Redaction Differs from De-Identification and Anonymization

These three terms are frequently used interchangeably, but they carry distinct legal and technical meanings with different compliance implications. Misunderstanding the differences can lead to serious compliance errors.

The following table clarifies the key distinctions:

ConceptDefinitionWhat Happens to the DataIs Re-Identification Possible?HIPAA Recognition / Regulatory StatusCommon Use Case
PHI RedactionThe targeted removal or obscuring of specific identifying data points from a document or recordSpecific fields or values are permanently removed or blacked out; surrounding content remains intactConditional — depends on thoroughness; residual data may still identify an individual if redaction is incompleteNot a formally defined HIPAA standard, but required practice under HIPAA's Privacy Rule for disclosures and litigationLitigation document review; sharing records with researchers or third parties
De-IdentificationThe transformation of a dataset so that no reasonable basis exists to identify any individualData is statistically or systematically altered to eliminate all 18 HIPAA identifiers using Safe Harbor or Expert Determination methodsNo — when properly executed under 45 CFR §164.514, re-identification risk is eliminated or negligibleFormally defined under 45 CFR §164.514; de-identified data is no longer subject to HIPAA's Privacy RuleReleasing datasets for public health research or analytics
AnonymizationThe irreversible removal of all identifying information so that re-identification is not possible by any meansAll direct and indirect identifiers are stripped with no retained linkage key or re-identification pathNo — by definition, anonymized data cannot be traced back to an individualNot explicitly defined under HIPAA; more commonly referenced under GDPR and other international frameworksPublishing aggregate health statistics; open data initiatives
PseudonymizationThe replacement of identifying information with artificial identifiers (pseudonyms) while retaining a linkage keyIdentifiers are substituted with tokens or codes; original data is retained separately and can be re-linkedYes — re-identification is possible using the retained linkage keyNot defined under HIPAA; recognized under GDPR as a data protection techniqueClinical trials; longitudinal research requiring re-linkage

Redaction is the most targeted of these approaches—it removes specific data points rather than altering an entire dataset. This distinction matters because a redacted document is not automatically de-identified, and organizations should not treat redaction as a substitute for formal de-identification when the latter is required.

The 18 HIPAA-Defined Identifiers That Require Redaction

HIPAA's Privacy Rule defines 18 specific data elements that constitute PHI under its Safe Harbor de-identification method. These identifiers apply across all data formats—structured records, unstructured clinical notes, scanned images, audio recordings, and video—and any combination of these elements that could reasonably identify an individual must be treated as PHI.

PHI does not only appear in obvious places. A clinical note, a scanned intake form, a voicemail recording, or a video consultation may all contain identifiable information that requires redaction before the content can be shared or disclosed.

The following table provides a reference for all 18 HIPAA-defined identifiers, including their categories, real-world examples, and the data formats in which they most commonly appear:

Identifier #PHI Identifier NameCategoryCommon ExamplesTypical Data Format(s) Where Found
1NamesDemographicPatient full name, maiden name, name on insurance cardText documents, scanned forms, audio, video
2Geographic Data Smaller Than StateDemographicStreet address, city, county, ZIP code, GPS coordinatesText documents, databases, scanned forms
3Dates (Except Year)TemporalBirthdate, admission date, discharge date, date of death, date of serviceText documents, databases, scanned records, audio
4Phone NumbersContactHome phone, mobile number, fax numberText documents, scanned forms, databases
5Fax NumbersContactOffice fax, clinic fax numberText documents, scanned forms
6Email AddressesContactPersonal or work email addressText documents, emails, scanned forms
7Social Security NumbersFinancial / Government IDFull or partial SSNText documents, databases, scanned forms
8Medical Record NumbersAdministrativeEHR record ID, hospital chart numberText documents, databases, scanned records
9Health Plan Beneficiary NumbersAdministrativeInsurance member ID, Medicare/Medicaid numberText documents, scanned insurance cards, databases
10Account NumbersFinancialBank account number, billing account IDText documents, databases, scanned forms
11Certificate / License NumbersAdministrativeMedical license number, driver's license numberText documents, scanned documents
12Vehicle Identifiers and Serial NumbersIdentifyingLicense plate number, VINText documents, scanned records
13Device Identifiers and Serial NumbersIdentifyingImplanted device serial number, medical equipment IDText documents, databases
14Web URLsDigitalPersonal website, patient portal URL linked to an individualText documents, emails, digital records
15IP AddressesDigitalIPv4 or IPv6 address associated with an individualSystem logs, digital records, emails
16Biometric IdentifiersBiometricFingerprints, retinal scans, voiceprintsDatabases, audio, video, scanned records
17Full-Face Photographs and Comparable ImagesVisualPatient photos, facial scans, identifying imagesScanned documents, image files, video
18Any Other Unique Identifying Number or CodeMiscellaneousCustom patient IDs, unique codes that link to an individualDatabases, text documents, scanned records

Redaction Requirements Across Data Formats

PHI appears in both structured and unstructured data, and redaction requirements apply equally across all formats.

Structured data—such as databases, forms, and spreadsheets—stores identifiers in discrete fields, which makes automated detection more straightforward. Unstructured data, including clinical notes, emails, and free-text records, embeds identifiers within narrative text, making detection significantly more challenging and error-prone.

Scanned documents and images present a different problem: PHI may be embedded in image layers rather than machine-readable text, which means optical character recognition must be applied before redaction can begin. Because scanned records remain common in healthcare, many teams evaluate clinical data extraction solutions for OCR to improve text capture before PHI detection and redaction are applied. Audio and video add further complexity, as PHI may be spoken, visible on screen, or embedded in metadata, requiring specialized transcription and frame-level analysis tools.

It is also worth noting that any combination of data elements that could reasonably identify an individual qualifies as PHI—even if each element appears benign on its own. For example, a ZIP code combined with a birthdate and a general diagnosis may be sufficient to identify a specific individual in a small population.

Choosing the Right PHI Redaction Method and Tools

Organizations have several approaches available for redacting PHI, ranging from fully manual processes to AI-powered automated systems. The appropriate method depends on document volume, data format complexity, available resources, and compliance requirements. For higher-volume environments, document redaction automation can reduce repetitive manual review work and improve consistency across large document sets.

One of the most significant technical challenges in automated PHI redaction is document preparation. Before detection can be applied, documents must be converted into clean, machine-readable formats. Scanned clinical records, multi-column PDFs, and image-heavy documents are particularly difficult for standard text extraction tools. Purpose-built document parsing systems can serve as the document preparation layer in a broader PHI redaction pipeline, converting complex document formats into structured, machine-readable output before PHI detection is applied.

Comparing Manual, Semi-Automated, and Fully Automated Redaction

The following table compares the three primary redaction approaches across key evaluation dimensions:

Redaction MethodHow It WorksBest Suited ForKey AdvantagesKey Limitations / RisksSupported Data TypesHIPAA Compliance Considerations
Manual RedactionA trained human reviewer physically or digitally identifies and obscures PHI using redaction tools or markup softwareLow-volume, highly sensitive, or one-off documents requiring full human judgmentComplete human control; no software dependency; suitable for complex edge casesHigh labor cost; not scalable; prone to human error and inconsistency; slow throughputText documents, scanned images (with effort), some audio/videoRequires documented review procedures; human error must be accounted for in compliance audits
Semi-Automated (Human-in-the-Loop)Software flags candidate PHI instances for human review and approval before final redaction is appliedMedium-volume workflows where accuracy and human oversight are both requiredBalances automation speed with human accuracy checks; reduces reviewer burden while maintaining oversightSlower than full automation; still requires trained reviewers; software errors may propagate if reviewers are inattentiveText documents, scanned images (with OCR), structured dataAudit trails typically generated by software; human review steps should be logged and documented
Fully Automated / AI-Powered RedactionAI or NLP-based software detects and removes PHI across large document sets without per-instance human interventionLarge-scale, high-throughput environments processing thousands of documentsHigh throughput; consistent application at scale; reduces labor costs significantlyRisk of false positives and false negatives; requires model training or configuration; higher upfront implementation costText documents, scanned images (with OCR), audio (with transcription), video (with frame analysis)Vendor BAA required under HIPAA; audit trail and logging capabilities are essential evaluation criteria

What to Look for When Evaluating Redaction Tools

When assessing redaction tools—whether semi-automated or fully automated—organizations should evaluate the following criteria:

Accuracy is the tool's ability to correctly identify all 18 HIPAA identifiers across structured and unstructured data formats, including edge cases such as partial identifiers or contextually embedded PHI.

Audit trail capabilities are non-negotiable. The tool must generate detailed, tamper-evident logs of all redaction actions, including what was redacted, when, by whom, and from which document.

HIPAA compliance support means the vendor should be willing to execute a Business Associate Agreement and demonstrate that the tool's architecture supports HIPAA's Security Rule requirements.

Compatibility with existing workflows matters because the tool should connect with existing document management systems, EHR platforms, and data repositories without requiring significant infrastructure changes.

Support for multiple data formats is also essential. Given that PHI appears across text, images, audio, and video, tools that handle only one format may leave significant compliance gaps.

Final Thoughts

PHI redaction is a legally and technically demanding practice that requires a clear understanding of what constitutes protected health information, where it appears across data formats, and which redaction methods are appropriate for a given organizational context. HIPAA's 18 defined identifiers span a wide range of data types—from names and dates to biometric identifiers and IP addresses—and can appear in structured databases, unstructured clinical notes, scanned documents, audio recordings, and video. Selecting the right redaction approach, whether manual, semi-automated, or fully automated, depends on document volume, format complexity, and the compliance infrastructure an organization has in place.

LlamaParse delivers VLM-powered agentic OCR that goes beyond simple text extraction, boasting industry-leading accuracy on complex documents without custom training. By leveraging advanced reasoning from large language and vision models, its agentic OCR engine intelligently understands layouts, interprets embedded charts, images, and tables, and enables self-correction loops for higher straight-through processing rates over legacy solutions. LlamaParse employs a team of specialized document understanding agents working together for unrivaled accuracy in real-world document intelligence, outputting structured Markdown, JSON, or HTML. It's free to try today and gives you 10,000 free credits upon signup.

Start building your first document agent today

PortableText [components.type] is missing "undefined"