Annotation de corpus

  • The annotation process for NLP consists of the following steps:
    (1) Identifying and preparing a selection of the representative texts as starting material for the ‘training corpus’ (sometimes called ‘training suite’).
    (2) Instantiating a given linguistic theory or linguistic concept, to specify the set of tags to use, their conditions of applicability, etc. This step includes beginning to write the annotator instructions (often called the Codebook or Manual).
    (3) Annotating some fragment of the training corpus, in order to determine the feasibility both of the instantiation and the annotator Manual.
    (4) Measuring the results (comparing the annotators’ decisions) and deciding which measures are appropriate, and how they should be applied.
    (5) Determining what level of agreement is to be considered satisfactory (too little agreement means too little consistency in the annotation to enable machine learning algorithms to be trained successfully). If the agreement is not (yet) satisfactory, the process repeats from step 2, with appropriate changes to the theory, its instantiation, the Manual, and the annotator instructions. Otherwise, the process continues to step 6.
    (6) Annotating a large portion of the corpus, possibly over several months or years, with many intermediate checks, improvements, etc.
    (7) When sufficient material has been annotated, training the automated NLP machine learning technology on a portion of the training corpus and measuring its performance on the remainder (i.e., comparing its results when applied to the remaining text, often called the ‘held-out data’, to the decisions of the annotators).
    (Eduard Hovy, p. 4)


  • Corpus annotation, sometimes called ‘tagging’, can be broadly conceptualized as the process of enriching a corpus by adding linguistic and other information, inserted by humans or machines (or a combination of them) in service of a theoretical or practical goal. Neither manual nor automated annotation is infallible, and both have advantages. (Eduard Hovy, p.1)