You can't guarantee an LLM does anything. Custom data can often subvert the mach...

danlitt · 2026-06-10T21:51:21 1781128281

> You can't guarantee an LLM does anything.

Agreed.

> But that doesn't mean that separation between instructions and data is impossible.

Yes it does! The comments you are replying to are concerned that it is not possible to be sure that data and instructions have been separated. With certain kinds of automated systems (traditional ones), unless you write them incorrectly, you can be sure of this. And it is possible to engage in a productive incremental process where mistakes can be identified and removed, in a way people comprehend and can plan around.

LLMs do not have this. They have heuristics and guesses. Nobody knows what will work ahead of time, nor even a probability that it will work. That is not a doomer comment by the way! The same is true when you talk to a person. But it is a fundamental limitation, it cannot be removed.

Dylan16807 · 2026-06-11T02:30:41 1781145041

This is conflating different problems, in my opinion.

Can you make sure the instructions and data are separated and the machine follows only the instructions and doesn't change its behavior based on the data? No.

But the part that's impossible is not "the instructions and data are separated". The part that's impossible is "the machine follows only the instructions".

Separating instructions and data is not impossible, but it doesn't solve your problems.

One really important consequence of this is that even if the data doesn't have anything that looks like instructions, it can poison the machine anyway! If you get too focused on "instructions" then you miss that security flaw!

Even if you don't give the machine any data at all, it might not follow the instructions. It's not instruction/data conflation as the root cause, it's that instructions don't really work in the first place.

Terr_ · 2026-06-10T22:17:34 1781129854

What we have is a machine trained on many old documents that takes one new document and dreams up stuff to append. The LLM algorithm cannot specially recognize contents as "instructions" to itself-the-author.

Even if special tokens are used absolutely perfectly (somehow avoiding escapes or ambiguities or reflected attacks) they are ultimately the same as highlighting all the parts of the document in different colors. You've saved the signal, but there's no mind to receive the intended meaning.

This means that your markers--while far more exclusive--ultimately exist on the same data-level as punctuation and using ? to indicate a question.

> you can prevent the output tokens from ever using instruction formatting

The right words may still outweigh the formatting around them, the same way that they can already outweigh other words around them.