What is PHI Redaction?

Protected Health Information (PHI) redaction is a critical compliance and data governance practice for any organization that handles health-related records. As healthcare systems increasingly rely on digital documents, automated workflows, and AI-assisted processing, accurately identifying and removing PHI before disclosure has become significantly more complex. Understanding what PHI redaction is, what it covers, and how to implement it is essential for maintaining HIPAA compliance and protecting patient privacy.

What PHI Redaction Is and Why It Matters

PHI redaction is the process of permanently removing or obscuring Protected Health Information from documents, records, or media before sharing or disclosure. It is a foundational practice under HIPAA, which governs how covered entities—such as hospitals, insurers, and healthcare clearinghouses—and their business associates handle individually identifiable health information.

How HIPAA Defines PHI

Under HIPAA, PHI refers to any individually identifiable health information that is created, received, maintained, or transmitted by a covered entity or business associate. This includes information relating to an individual's past, present, or future physical or mental health condition, the provision of healthcare, or payment for healthcare services.

PHI is defined broadly by design. Even data that appears innocuous in isolation—such as a ZIP code or a date—can qualify as PHI when combined with other information that makes an individual identifiable. That broad scope is one reason many organizations begin with strong PII detection in documents before applying healthcare-specific redaction rules.

What Redaction Involves in Practice

Redaction is the targeted removal or permanent obscuring of specific data points from a document or record. The goal is to prevent unauthorized disclosure of identifying information while preserving the remaining content for its intended use.

Common examples of PHI subject to redaction include:

Patient names
Dates of service, admission, or discharge
Social Security numbers
Home addresses and geographic data
Medical record numbers and account numbers
Phone numbers and email addresses

How Redaction Differs from De-Identification and Anonymization

These three terms are frequently used interchangeably, but they carry distinct legal and technical meanings with different compliance implications. Misunderstanding the differences can lead to serious compliance errors.

The following table clarifies the key distinctions:

Concept	Definition	What Happens to the Data	Is Re-Identification Possible?	HIPAA Recognition / Regulatory Status	Common Use Case
PHI Redaction	The targeted removal or obscuring of specific identifying data points from a document or record	Specific fields or values are permanently removed or blacked out; surrounding content remains intact	Conditional — depends on thoroughness; residual data may still identify an individual if redaction is incomplete	Not a formally defined HIPAA standard, but required practice under HIPAA's Privacy Rule for disclosures and litigation	Litigation document review; sharing records with researchers or third parties
De-Identification	The transformation of a dataset so that no reasonable basis exists to identify any individual	Data is statistically or systematically altered to eliminate all 18 HIPAA identifiers using Safe Harbor or Expert Determination methods	No — when properly executed under 45 CFR §164.514, re-identification risk is eliminated or negligible	Formally defined under 45 CFR §164.514; de-identified data is no longer subject to HIPAA's Privacy Rule	Releasing datasets for public health research or analytics
Anonymization	The irreversible removal of all identifying information so that re-identification is not possible by any means	All direct and indirect identifiers are stripped with no retained linkage key or re-identification path	No — by definition, anonymized data cannot be traced back to an individual	Not explicitly defined under HIPAA; more commonly referenced under GDPR and other international frameworks	Publishing aggregate health statistics; open data initiatives
Pseudonymization	The replacement of identifying information with artificial identifiers (pseudonyms) while retaining a linkage key	Identifiers are substituted with tokens or codes; original data is retained separately and can be re-linked	Yes — re-identification is possible using the retained linkage key	Not defined under HIPAA; recognized under GDPR as a data protection technique	Clinical trials; longitudinal research requiring re-linkage

Redaction is the most targeted of these approaches—it removes specific data points rather than altering an entire dataset. This distinction matters because a redacted document is not automatically de-identified, and organizations should not treat redaction as a substitute for formal de-identification when the latter is required.

The 18 HIPAA-Defined Identifiers That Require Redaction

HIPAA's Privacy Rule defines 18 specific data elements that constitute PHI under its Safe Harbor de-identification method. These identifiers apply across all data formats—structured records, unstructured clinical notes, scanned images, audio recordings, and video—and any combination of these elements that could reasonably identify an individual must be treated as PHI.

PHI does not only appear in obvious places. A clinical note, a scanned intake form, a voicemail recording, or a video consultation may all contain identifiable information that requires redaction before the content can be shared or disclosed.

The following table provides a reference for all 18 HIPAA-defined identifiers, including their categories, real-world examples, and the data formats in which they most commonly appear:

Identifier #	PHI Identifier Name	Category	Common Examples	Typical Data Format(s) Where Found
1	Names	Demographic	Patient full name, maiden name, name on insurance card	Text documents, scanned forms, audio, video
2	Geographic Data Smaller Than State	Demographic	Street address, city, county, ZIP code, GPS coordinates	Text documents, databases, scanned forms
3	Dates (Except Year)	Temporal	Birthdate, admission date, discharge date, date of death, date of service	Text documents, databases, scanned records, audio
4	Phone Numbers	Contact	Home phone, mobile number, fax number	Text documents, scanned forms, databases
5	Fax Numbers	Contact	Office fax, clinic fax number	Text documents, scanned forms
6	Email Addresses	Contact	Personal or work email address	Text documents, emails, scanned forms
7	Social Security Numbers	Financial / Government ID	Full or partial SSN	Text documents, databases, scanned forms
8	Medical Record Numbers	Administrative	EHR record ID, hospital chart number	Text documents, databases, scanned records
9	Health Plan Beneficiary Numbers	Administrative	Insurance member ID, Medicare/Medicaid number	Text documents, scanned insurance cards, databases
10	Account Numbers	Financial	Bank account number, billing account ID	Text documents, databases, scanned forms
11	Certificate / License Numbers	Administrative	Medical license number, driver's license number	Text documents, scanned documents
12	Vehicle Identifiers and Serial Numbers	Identifying	License plate number, VIN	Text documents, scanned records
13	Device Identifiers and Serial Numbers	Identifying	Implanted device serial number, medical equipment ID	Text documents, databases
14	Web URLs	Digital	Personal website, patient portal URL linked to an individual	Text documents, emails, digital records
15	IP Addresses	Digital	IPv4 or IPv6 address associated with an individual	System logs, digital records, emails
16	Biometric Identifiers	Biometric	Fingerprints, retinal scans, voiceprints	Databases, audio, video, scanned records
17	Full-Face Photographs and Comparable Images	Visual	Patient photos, facial scans, identifying images	Scanned documents, image files, video
18	Any Other Unique Identifying Number or Code	Miscellaneous	Custom patient IDs, unique codes that link to an individual	Databases, text documents, scanned records

Redaction Requirements Across Data Formats

PHI appears in both structured and unstructured data, and redaction requirements apply equally across all formats.

Structured data—such as databases, forms, and spreadsheets—stores identifiers in discrete fields, which makes automated detection more straightforward. Unstructured data, including clinical notes, emails, and free-text records, embeds identifiers within narrative text, making detection significantly more challenging and error-prone.

Scanned documents and images present a different problem: PHI may be embedded in image layers rather than machine-readable text, which means optical character recognition must be applied before redaction can begin. Because scanned records remain common in healthcare, many teams evaluate clinical data extraction solutions for OCR to improve text capture before PHI detection and redaction are applied. Audio and video add further complexity, as PHI may be spoken, visible on screen, or embedded in metadata, requiring specialized transcription and frame-level analysis tools.

It is also worth noting that any combination of data elements that could reasonably identify an individual qualifies as PHI—even if each element appears benign on its own. For example, a ZIP code combined with a birthdate and a general diagnosis may be sufficient to identify a specific individual in a small population.

Choosing the Right PHI Redaction Method and Tools

Organizations have several approaches available for redacting PHI, ranging from fully manual processes to AI-powered automated systems. The appropriate method depends on document volume, data format complexity, available resources, and compliance requirements. For higher-volume environments, document redaction automation can reduce repetitive manual review work and improve consistency across large document sets.

One of the most significant technical challenges in automated PHI redaction is document preparation. Before detection can be applied, documents must be converted into clean, machine-readable formats. Scanned clinical records, multi-column PDFs, and image-heavy documents are particularly difficult for standard text extraction tools. Purpose-built document parsing systems can serve as the document preparation layer in a broader PHI redaction pipeline, converting complex document formats into structured, machine-readable output before PHI detection is applied.

Comparing Manual, Semi-Automated, and Fully Automated Redaction

The following table compares the three primary redaction approaches across key evaluation dimensions:

Redaction Method	How It Works	Best Suited For	Key Advantages	Key Limitations / Risks	Supported Data Types	HIPAA Compliance Considerations
Manual Redaction	A trained human reviewer physically or digitally identifies and obscures PHI using redaction tools or markup software	Low-volume, highly sensitive, or one-off documents requiring full human judgment	Complete human control; no software dependency; suitable for complex edge cases	High labor cost; not scalable; prone to human error and inconsistency; slow throughput	Text documents, scanned images (with effort), some audio/video	Requires documented review procedures; human error must be accounted for in compliance audits
Semi-Automated (Human-in-the-Loop)	Software flags candidate PHI instances for human review and approval before final redaction is applied	Medium-volume workflows where accuracy and human oversight are both required	Balances automation speed with human accuracy checks; reduces reviewer burden while maintaining oversight	Slower than full automation; still requires trained reviewers; software errors may propagate if reviewers are inattentive	Text documents, scanned images (with OCR), structured data	Audit trails typically generated by software; human review steps should be logged and documented
Fully Automated / AI-Powered Redaction	AI or NLP-based software detects and removes PHI across large document sets without per-instance human intervention	Large-scale, high-throughput environments processing thousands of documents	High throughput; consistent application at scale; reduces labor costs significantly	Risk of false positives and false negatives; requires model training or configuration; higher upfront implementation cost	Text documents, scanned images (with OCR), audio (with transcription), video (with frame analysis)	Vendor BAA required under HIPAA; audit trail and logging capabilities are essential evaluation criteria

What to Look for When Evaluating Redaction Tools

When assessing redaction tools—whether semi-automated or fully automated—organizations should evaluate the following criteria:

Accuracy is the tool's ability to correctly identify all 18 HIPAA identifiers across structured and unstructured data formats, including edge cases such as partial identifiers or contextually embedded PHI.

Audit trail capabilities are non-negotiable. The tool must generate detailed, tamper-evident logs of all redaction actions, including what was redacted, when, by whom, and from which document.

HIPAA compliance support means the vendor should be willing to execute a Business Associate Agreement and demonstrate that the tool's architecture supports HIPAA's Security Rule requirements.

Compatibility with existing workflows matters because the tool should connect with existing document management systems, EHR platforms, and data repositories without requiring significant infrastructure changes.

Support for multiple data formats is also essential. Given that PHI appears across text, images, audio, and video, tools that handle only one format may leave significant compliance gaps.

Final Thoughts

PHI redaction is a legally and technically demanding practice that requires a clear understanding of what constitutes protected health information, where it appears across data formats, and which redaction methods are appropriate for a given organizational context. HIPAA's 18 defined identifiers span a wide range of data types—from names and dates to biometric identifiers and IP addresses—and can appear in structured databases, unstructured clinical notes, scanned documents, audio recordings, and video. Selecting the right redaction approach, whether manual, semi-automated, or fully automated, depends on document volume, format complexity, and the compliance infrastructure an organization has in place.

LlamaParse delivers VLM-powered agentic OCR that goes beyond simple text extraction, boasting industry-leading accuracy on complex documents without custom training. By leveraging advanced reasoning from large language and vision models, its agentic OCR engine intelligently understands layouts, interprets embedded charts, images, and tables, and enables self-correction loops for higher straight-through processing rates over legacy solutions. LlamaParse employs a team of specialized document understanding agents working together for unrivaled accuracy in real-world document intelligence, outputting structured Markdown, JSON, or HTML. It's free to try today and gives you 10,000 free credits upon signup.