Visit us on GitHub

© 2022 JULIE Lab


About | Projects | Staff | Resources | Contact | Imprint


« Back to all resources

The FSU PRotein GEne corpus was developed at the JULIE Lab Jena under supervision of Prof. Udo Hahn.
The executing scientist was Dr. Joachim Wermter.
The main annotator was Dr. Rico Pusch who is an expert in biology.
The corpus was developed in the context of the StemNet project (

The goals of the annotation project were

The corpus has the following annotation levels / entity types:

For definitions of the annotation levels, please refer to the Proteins-guidelines-final.doc file that should be found in the same archive as this readme.

To achieve a large coverage of biological subdomains, document from multiple other protein / gene corpora were reannotated. For further coverage, new document sets were created. All documents are abstracts from PubMed/MEDLINE. The corpus is made up of the union of all the documents in the different subcorpora. Each subcorpus is stored in its own directory as follows:

All document are delivered as MMAX2 ( annotation projects.

Corpus statistics:

   documents 3309 (3308 since v1.1)
   sentences 36223
   tokens 960757


March 26, 2019, v1.1:

Download the corpus: v1.0 (10MB), v1.1 (19.5MB).

Visit us on GitHub

© 2022 JULIE Lab