Guided Synthetic Data Generation for Zero-Shot Information Extraction
Neil De La Fuente · Oscar Sainz · Iker García-Ferrero · Eneko Agirre
Information Extraction (IE) systems are traditionally domain-specific, requiring costly adaptation that involves expert schema design, data annotation, and model training. While Large Language Models have shown promise in zero-shot IE, performance degrades significantly in unseen domains where label definitions differ. This paper introduces GuideX, a novel method that automatically defines domain-specific schemas, infers guidelines, and generates synthetically labeled instances, allowing for better out-of-domain generalization. Fine-tuning LLaMa 3.1 with GuideX sets a new state-of-the-art across seven zero-shot Named Entity Recognition benchmarks. Models trained with GuideX gain up to 7 F1 points over previous methods without human-labeled data, and nearly 3 F1 points higher when combined with it. Models trained on GuideX demonstrate enhanced comprehension of complex, domain-specific annotation schemas.
GUIDEX is a fully-automatic pipeline that transforms raw documents into executable (guidelines + annotations) pairs. A single LLaMA-3.1 70B Instruct model is queried four times in succession, each prompt feeding directly into the next stage. The design follows the four-step recipe described in Section 3 and Figure 2 of the paper.
{...}
JSON
sketch whose keys are coarse, on-the-fly entity/slot names and whose
values are the shortest source-aligned spans that realise each fact. A
dedup-and-merge pass keeps the schema compact.
@dataclass
file. Every key becomes its own dataclass with a
long natural-language docstring that lists definitions
and edge-cases; attributes are typed as str
, int
,
List[str]
or Optional[⋅]
so the file can be
imported later.
result_instances = [ ...]
. Any hallucinated value
is rejected by the prompt template itself.
Each generated file is immediately import
-ed inside a unit-test.
Samples that fail to compile or violate type hints are discarded,
eliminating spurious labels and mis-aligned spans. This filter keeps
only schema-consistent, executable pairs and reduces noise
before training.
@misc{delafuente2025guidexguidedsyntheticdata,
title={GuideX: Guided Synthetic Data Generation for Zero-Shot Information Extraction},
author={Neil De La Fuente and Oscar Sainz and Iker García-Ferrero and Eneko Agirre},
year={2025},
eprint={2506.00649},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.00649},
}