Mitigating Prompt Injection Attacks on an LLM based Customer support App

TL;DR:

Table of Contents

Introduction

A typical end user interaction with ohandle

Background

Prompt Engineering

Prompt Injection

Jason Allen’s A.I.-generated work, “Théâtre D’opéra Spatial,” took first place in the digital category at the Colorado State Fair.Credit…via Jason Allen. Mr. Allen declined to share the exact text prompt he had submitted to Midjourney to create “Théâtre D’opéra Spatial.” Reference: AI-Generated Art Won a Prize. Artists Aren’t Happy. — The New York Times (nytimes.com)
There is an XKCD for everything! Ref: xkcd: Exploits of a Mom .
Fábio Perez and Ian Ribeiro shared the above modalities of the Prompt Injection Attack in their paper “Ignore Previous Prompt: Attack Techniques for Language Models” presented in the “36th Conference on Neural Information Processing Systems (NeurIPS 2022).” Reference: 2211.09527.pdf (arxiv.org).

Prompt Leaking

https://twitter.com/kliu128/status/1623472922374574080
https://twitter.com/kliu128/status/1623472922374574080

Goal Hijacking

Experiments

Before Mitigation

The first internal revision

Context Leakage
Coaxing into baking a cake
Confabulation
Hijacking

After Mitigation

Didn't fall for it.
Did not fall for the gambit, yay!
Foiled Again!
Sorry, Try Again!

Mitigations

Possible Approaches

Prefiltering the user input

from nltk import pos_tag, word_tokenize
import nltk
# nltk.download('averaged_perceptron_tagger')

text = "The quick brown fox jumps over the lazy dog."
nouns_adjectives = [word for word, pos in pos_tag(word_tokenize(text))
if pos.startswith('N') or pos.startswith('J')]

print(nouns_adjectives)

OUTPUT: ['quick', 'brown', 'fox', 'lazy', 'dog']

Prompt Engineering

Response filtering

Invitation to play

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store