Validity, Breadth, and Density in a Model of Language Generation
The recent successes of large language models (LLMs) have led to a surge of theoretical research into language generation. A recent line of work proposes an abstract view, called language generation in the limit, where generation is viewed as a game between an adversary and an algorithm: the adversary generates strings from an unknown language K, chosen from a countable collection of candidate languages, and after seeing a finite set of these strings, the algorithm must generate new strings from K that it has not seen before. This formalism highlights a key tension: the trade-off between validity (the algorithm should only produce strings from the language) and breadth (it should be able to produce many strings from the language). This trade-off is central in applied uses of language generation as well, where it appears as a balance between hallucination (generating invalid utterances) and mode collapse (generating only a restricted set of outputs). Despite its importance, this trade-off has been challenging to study quantitatively. We survey recent work in this model, including a set of results that quantifies the trade-off between validity and breadth using measures of density.
This talk is based on joint work with Sendhil Mullainathan and Fan Wei.