guidex logo

Guided Synthetic Data Generation for Zero-Shot Information Extraction

Neil De La Fuente · Oscar Sainz · Iker García-Ferrero · Eneko Agirre

hitz logo

ACL Findings 2025

Abstract

Information Extraction (IE) systems are traditionally domain-specific, requiring costly adaptation that involves expert schema design, data annotation, and model training. While Large Language Models have shown promise in zero-shot IE, performance degrades significantly in unseen domains where label definitions differ. This paper introduces GuideX, a novel method that automatically defines domain-specific schemas, infers guidelines, and generates synthetically labeled instances, allowing for better out-of-domain generalization. Fine-tuning LLaMa 3.1 with GuideX sets a new state-of-the-art across seven zero-shot Named Entity Recognition benchmarks. Models trained with GuideX gain up to 7 F1 points over previous methods without human-labeled data, and nearly 3 F1 points higher when combined with it. Models trained on GuideX demonstrate enhanced comprehension of complex, domain-specific annotation schemas.

F1 bar-plot on seven NER domains
Zero-shot NER. Training on GuideX (+Gold) beats the Gold-only model on five out of seven domains.

Methodology  4 stages + QA

GUIDEX is a fully-automatic pipeline that transforms raw documents into executable (guidelines + annotations) pairs. A single LLaMA-3.1 70B Instruct model is queried four times in succession, each prompt feeding directly into the next stage. The design follows the four-step recipe described in Section 3 and Figure 2 of the paper.

  1. Document Summarisation 
    The model receives the entire article (≈ 1–3 k words on average) and is constrained—via a bullet template and a tight token budget—to emit only atomic facts: key actors, dates, figures and claims, stripping away narrative filler.
  2. Structured Representation 
    The bullets are re-encoded into a minimal {...} JSON sketch whose keys are coarse, on-the-fly entity/slot names and whose values are the shortest source-aligned spans that realise each fact. A dedup-and-merge pass keeps the schema compact.
  3. Guideline Generation 
    From that JSON the LLM autowrites a Python @dataclass file. Every key becomes its own dataclass with a long natural-language docstring that lists definitions and edge-cases; attributes are typed as str, int, List[str] or Optional[⋅] so the file can be imported later.
  4. Instance Extraction 
    A final prompt instantiates those dataclasses with concrete, verbatim spans from the source text, returning a single Python list result_instances = [ ...]. Any hallucinated value is rejected by the prompt template itself.

Automated Quality-Assurance

Each generated file is immediately import-ed inside a unit-test. Samples that fail to compile or violate type hints are discarded, eliminating spurious labels and mis-aligned spans. This filter keeps only schema-consistent, executable pairs and reduces noise before training.

GUIDEX four-stage pipeline

Results

Table 2 – ablation
Table 3 – zero-shot results
Table 4 – per-label gains

Step-by-Step prompts

Full prompt templates

Citation

@misc{delafuente2025guidexguidedsyntheticdata,
        title={GuideX: Guided Synthetic Data Generation for Zero-Shot Information Extraction}, 
        author={Neil De La Fuente and Oscar Sainz and Iker García-Ferrero and Eneko Agirre},
        year={2025},
        eprint={2506.00649},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2506.00649}, 
  }