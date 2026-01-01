Protegrity Data Discovery Description

Protegrity Data Discovery is a data classification tool that identifies and classifies sensitive data across structured and unstructured sources. The product uses a dual-model architecture combining a machine learning language model (RoBERTa) with a rules-based engine (Presidio) to detect PII, PHI, PCI, and intellectual property. The tool processes unstructured data including natural language text, transcripts, documents, chatbot logs, support tickets, and free-text fields. It provides real-time redaction capabilities for chatbot conversations, automated cleanup of call center transcripts and medical notes, and pre-processing for GenAI RAG pipelines to prevent PII leakage into LLM prompts. Protegrity Data Discovery offers API access through a REST API and Python SDK for integration into applications and workflows. The product can be deployed using Docker containers or Kubernetes environments including AWS EKS for cloud-native scalability. Classification outputs include standard entity types such as PERSON, EMAIL, PHONE, ADDRESS, and CREDIT_CARD, along with confidence scores and character position data for targeted redaction or masking. Discovery results can be fed into Protegrity Governance for policy creation and protection rule refinement.