This work focuses on Large Language Models (LLMs) and some issues they pose. In particular, we delve into the phenomenon of jailbreaking, which refers to the ability to bypass LLMs’ restrictions and elicit problematic content through specific prompts. Jailbreaking is examined from various perspectives: first, we review existing literature categorizing jailbreaking prompts; second, we compare the literature on jailbreaking with the literature on bias in Natural Language Processing (NLP); third, we provide a pragmatic analysis of jailbreaking based on human deception strategies.
This work focuses on Large Language Models (LLMs) and some issues they pose. In particular, we delve into the phenomenon of jailbreaking, which refers to the ability to bypass LLMs’ restrictions and elicit problematic content through specific prompts. Jailbreaking is examined from various perspectives: first, we review existing literature categorizing jailbreaking prompts; second, we compare the literature on jailbreaking with the literature on bias in Natural Language Processing (NLP); third, we provide a pragmatic analysis of jailbreaking based on human deception strategies.
Analyzing Jailbreaking in LLMs Through Pragmatics and Bias Studies
TORCHIO, FRANCESCA
2023/2024
Abstract
This work focuses on Large Language Models (LLMs) and some issues they pose. In particular, we delve into the phenomenon of jailbreaking, which refers to the ability to bypass LLMs’ restrictions and elicit problematic content through specific prompts. Jailbreaking is examined from various perspectives: first, we review existing literature categorizing jailbreaking prompts; second, we compare the literature on jailbreaking with the literature on bias in Natural Language Processing (NLP); third, we provide a pragmatic analysis of jailbreaking based on human deception strategies.File | Dimensione | Formato | |
---|---|---|---|
Torchio_Tesi.pdf
accesso aperto
Dimensione
7.88 MB
Formato
Adobe PDF
|
7.88 MB | Adobe PDF | Visualizza/Apri |
È consentito all'utente scaricare e condividere i documenti disponibili a testo pieno in UNITESI UNIPV nel rispetto della licenza Creative Commons del tipo CC BY NC ND.
Per maggiori informazioni e per verifiche sull'eventuale disponibilità del file scrivere a: unitesi@unipv.it.
https://hdl.handle.net/20.500.14239/26344