navbar

Using Large Language Models(LLMs) for Data Privacy

AUTHOR

SHARE

Twitter

In today’s data-driven landscape, safeguarding Personally Identifiable Information (PII) is more critical than ever. PII refers to any information that can reasonably infer the identity of an individual, either directly or indirectly. Recent studies indicate that PII constitutes a significant portion of organisational data stores. As a result, Chief Information Officers (CIOs) and other C-suite executives are investing substantial time and resources to manage PII, including masking or redacting sensitive information to make data accessible for business purposes without compromising privacy.

The Challenges of Traditional PII Masking

PII Masking is a fundamental practice in data privacy. It involves techniques that protect PII by rendering it unreadable or unusable by unauthorized parties. Traditionally, PII masking is performed manually or with specialized tools. However, manual masking is both costly and time-consuming, while conventional tools may lack the adaptability to handle evolving types of PII.

In the past decade, Machine Learning (ML) models have gained traction for PII masking. Techniques like Named Entity Recognition (NER)—a natural language processing method that identifies and categorises entities in unstructured text—have been used to automate the classification and removal of PII. Despite their utility, these models have notable drawbacks:

  • Continuous Training Requirements: ML models need regular retraining to recognize new types of PII. For example, the scope of PII has expanded to include crypto wallet addresses and API tokens, necessitating updates to the models.
  • Complexity with Custom Entities: Organizations often have specific identifiers, such as employee IDs or proprietary classifiers, adding layers of complexity to the masking process.

These challenges underscore the need for more flexible and efficient solutions.

Introducing Large Language Models (LLMs) for PII Masking

Large Language Models (LLMs), such as OpenAI’s GPT-4, offer a transformative approach to PII masking. LLMs excel in Zero-Shot Learning, enabling them to perform tasks without explicit prior training on specific datasets. This capability makes them highly adaptable to new and evolving types of PII without the need for retraining.

Advantages of Using LLMs:

  • Rapid Deployment: Implementing PII masking with LLMs eliminates the lengthy training phase associated with traditional ML models.
  • Reduced Effort and Cost: Automating the masking process reduces manual workload and accelerates data processing.
  • Adaptability: LLMs can recognize and redact new forms of PII as they emerge, ensuring ongoing compliance with privacy regulations.

Implementing PII Masking with LLMs

By combining Prompt Engineering and Function Calling, organizations can leverage LLMs to ingest text containing PII and output a redacted version. Prompt engineering involves crafting specific instructions that guide the LLM to perform the desired task effectively. Function calling adds structure to the output, simplifying integration with existing systems and reducing the need for extensive exception handling.

Key Implementation Steps:

  1. Data Preparation: Utilize datasets that include both actual PII and masked text for validation purposes. Publicly available datasets, such as those on HuggingFace, can be used for initial testing. This dataset includes both actual PII included text and PII masked text that can be used for model validation.
  2. Prompt Engineering: Develop prompts that instruct the LLM to identify and redact all forms of PII, including incomplete names, organizational information, hash keys, crypto addresses, and API keys.
  3. Function Calling: Define functions that structure the LLM’s output, ensuring consistency and ease of integration.
  4. Local Deployment: To maintain data security, deploy the LLM within your organization’s infrastructure. This approach prevents sensitive data from being transmitted to external APIs or third-party services.

Note: While the GPT-4 model from OpenAI is used here for illustration, it’s crucial to choose an LLM that aligns with your organization’s security policies.

# Load Dataset
dataset_name = 'ai4privacy/pii-masking-65k'
dataset = datasets.load_dataset(dataset_name, split="train[:10%]")
dataset

# Function Calling
class RedactTextParams(BaseModel):
    redacted_text: str = Field(description="Returned text after masking all PII data with [REDACTED] token")
        
tool_definitions = [  
    {
        "type": "function",
        "function":{
            "name": "redact_pii_from_text",
            "description": "Generates masked text after removing all PII information related to persons \
             and companies to be uploaded into the datalake",
            "parameters": RedactTextParams.model_json_schema()
        },
        "required": ["redacted_text"]
    }
]

# Prompt Engineering
def redact_pii_fc(text):
    completion = client.chat.completions.create(
      model="gpt-4o",
      messages=[
        {"role": "system", "content": "You are an AI assistant, skilled in masking \
        personally identifiable information including incompleted names or other information \
        related to persons, organizations, including hash keys, crypto addresses, API keys \
        in text blocks. Do not return as a code block. Mask all PII related words in the following \
        text with [REDACTED] token.'''" + text + "'''"}
      ],
      seed = 42,
      tools=tool_definitions,
      tool_choice={"type": "function", "function": {"name": "redact_pii_from_text"}}
    )
    response = json.loads(completion.choices[0].message.tool_calls[0].function.arguments)
    return [val for val in response.values()][0]

Complete code for reference is available here.

Evaluating Performance

To assess the effectiveness of the LLM in PII masking, organizations can:

  • Use Levenshtein Distance: This metric measures the difference between the expected masked text and the LLM’s output, providing a quantitative evaluation of accuracy.
  • Iteratively Refine Prompts: By analyzing low-scoring examples, prompts can be adjusted to improve performance, squeezing more accuracy from the LLM. We noticed a significant improvement in accuracy when we prompted the LLM to redact Crypto Wallet addresses and hash keys.
  • Sample Testing: Conduct tests on random samples (e.g., 100 rows) to gauge the model’s efficiency and reliability.

Below is the similarity distribution between expected output vs actual output for the LLM based PII model on a sample.

Benefits and Limitations

Benefits:

  • Quick Turnaround: LLMs can process and mask large volumes of data rapidly, enhancing operational efficiency.
  • Low Effort: Minimal manual intervention is required, freeing up resources for other critical tasks.
  • Adaptability: LLMs stay effective even as definitions of PII evolve, without the need for retraining.

Limitations:

  • Exception Handling: Despite structured outputs, LLMs may produce unexpected results, necessitating additional exception handling mechanisms.
  • Output Variability: LLMs can generate different outputs for the same input due to their stochastic nature. Consistency strategies need to be in place.

Ensuring Data Security with Guardgen.AI

Data privacy is non-negotiable, especially when dealing with sensitive information. We strongly recommend performing PII masking using local LLMs to ensure data does not leave your secure environment. Guardgen.AI offers a solution that allows you to run LLMs securely within your infrastructure. Our platform facilitates PII masking and other generative AI solutions with ease and confidence.

Why Choose Guardgen.AI:

  • Secure Deployment: Keep your data in-house with on-premise LLM deployment.
  • Scalable Solutions: Adapt to growing data needs without compromising on performance.
  • Expert Support: Leverage our expertise to integrate LLMs seamlessly into your workflows.

Conclusion

Implementing PII masking with Large Language Models presents a significant opportunity for organizations to enhance data privacy while reducing operational costs and effort. By embracing this innovative approach, CXOs can ensure their organizations remain compliant with evolving privacy regulations and protect sensitive information effectively.


Take the Next Step

Ready to transform your data privacy strategy? Contact us today to learn how we can help you deploy LLMs securely within your infrastructure and unlock the full potential of generative AI solutions.

Share this post

Twitter
LinkedIn

Top blogs

Using Large Language Models(LLMs) for Data Privacy

In today’s data-driven landscape, safeguarding Personally Identifiable Information (PII) is

Read More

Accelerating knowledge processes with LLMs

As an operations leader, optimizing knowledge processes is

Read More

Machine Learning based approach to predicting stockouts

Introduction Product stockouts can be a major headache

Read More

Amplifying Security and Speed: Harnessing On-Premise CPU-powered LLMs

Foundation Models have disrupted the AI space and

Read More

Scroll to Top