A set of tools for the EXMARaLDA Partitur-Editor for the annotation of learner data in learner corpora.



#EXMARaLDA (Dulko)

This repository used to provide EXMARaLDA (Dulko) – a toolset for the EXMARaLDA Partitur-Editor for the annotation of learner data in learner corpora. From 2016 to 2021, it was developed separately from EXMARaLDA mainline by Andreas Nolda for the Dulko learner-corpus project. In 2018, this work was awarded the Innovation Prize in the engineering category by the University of Szeged.

Since July 2023, the Dulko toolset is an integrated part of the release version of the EXMARaLDA Partitur-Editor. Development and support of the Dulko tools continue on the EXMARaLDA GitHub repository.

Any existing user of EXMARaLDA (Dulko) is strongly encouraged to use the current EXMARaLDA release with the integrated Dulko tools instead. Installation and configuration are much easier now, and various Dulko components have been improved and generalised for the annotation of data beyond German learner corpora.

The original README below is provided for reference only.

EXMARaLDA (Dulko) is a set of tools for the EXMARaLDA Partitur-Editor with transformation scenarios (actually, XSLT 2.0 stylesheets) for the annotation of learner data in learner corpora, supporting tokenisation, part-of-speech tagging, lemmatisation, sentence-span computation, editing target hypotheses, detection of differences between target hypotheses and the learner text, error analysis, and metadata management (Hirschmann and Nolda 2019, Nolda 2019). It has been developed for the Dulko learner-corpus project at the University of Szeged.

This repository provides the sources of EXMARaLDA (Dulko). The latest release is available as a ZIP archive exmaralda-dulko-<VERSION>.zip which contains, in particular, an executable for Microsoft Windows (exmaralda-dulko.exe) as well as start-up scripts for MacOS (exmaralda-dulko.command) and Linux (exmaralda-dulko.sh).

#Installation instructions

  1. Unless already installed, install a Java runtime environment (JRE) or Java development kit (JDK), e.g. Oracle Java[^1] or Amazon Corretto; on Linux, you can also use OpenJDK. Note that currently, Java version 8 is required.

  2. Unless already installed, download TreeTagger and install it into some directory <DIR1>.

    On Microsoft Windows, extract the downloaded ZIP archive into C:\Program Files\TreeTagger or another directory. Note this directory for future reference.

    On MacOS, there should be a directory called tree-tagger-MacOSX-<version> or similar in the Downloads folder. Drag this directory into the Applications folder or onto the desktop and rename it to tree-tagger.

    On Linux, extract the downloaded TAR.GZ archive into /opt/tree-tagger or $HOME/tree-tagger.

    After the installation, there should be a directory <DIR1>/bin with the binary tree-tagger.exe on Microsoft Windows or tree-tagger on MacOS and Linux.

  3. Create a subdirectory lib in the TreeTagger directory <DIR1>.

  4. Download the German parameter file for TreeTagger.

    On Microsoft Windows, uncompress this GZ file with 7-Zip or another tool, rename it to german-utf8.par and copy or move this file to <DIR1>\lib.

    On MacOS, there should be a file called german.par in the Downloads folder. Rename it to german-utf8.par and drag it into <DIR1>/lib.

    On Linux, uncompress the GZ file, rename it to german-utf8.par, and copy or move it into <DIR1>/lib.

  5. Unless already installed, download the release version of EXMARaLDA (1.6.1) corresponding to your system and install it into some directory <DIR2>. Note that EXMARaLDA (Dulko) no longer works with older versions of EXMARaLDA.

    On Microsoft Windows, it is recommended to use the default path for <DIR2> (typically, C:\Program Files\EXMARaLDA).

    If you are running MacOS and have Oracle Java installed on your system, you only need the Partitur-Editor disk image for Oracle Java, which you can install by dragging the icon called PartiturEditor_OJ in PartiturEditor_OJ.dmg into the Applications folder or onto the desktop.

    On Linux, install EXMARaLDA into /opt/exmaralda or $HOME/exmaralda.

  6. Download EXMARaLDA (Dulko) and install it into some directory <DIR3>.

    On Microsoft Windows, you can use for this task the setup program exmaralda-dulko-<VERSION>-setup.exe, which is included in the downloaded ZIP archive. Please note that on this system, the installation directory <DIR3> must be a sister directory of <DIR2>, which is the setup program’s default (typically, C:\Program Files\EXMARaLDA (Dulko)).

    On MacOS, there should be a directory called exmaralda-dulko-<VERSION> in the Downloads folder. While you may run EXMARaLDA (Dulko) from there, it is recommended to drag the directory into the Applications folder or onto the desktop and rename it to exmaralda-dulko.

    On Linux, extract the ZIP archive to /opt/exmaralda-dulko, $HOME/exmaralda-dulko, or another directory of your choice.

#Configuration instructions

  1. On Microsoft Windows, search for SystemPropertiesAdvanced and create a system environment variable with the name TREETAGGER_HOME and the path to the TreeTagger directory <DIR1> which you noted during the installation of the TreeTagger.

    On MacOS and Linux, the environment variable TREETAGGER_HOME is set by the start-up script exmaralda-dulko.command or exmaralda-dulko.sh in <DIR3> (unless already set by the environment). If you have installed TreeTagger into one of the directories recommended in the installation instructions above, nothing needs to be done. If you have installed it into a non-standard directory <DIR1>, open the start-up script with a text editor and set the variable TREETAGGER_HOME to <DIR1>.

  2. If you have installed EXMARaLDA into a non-standard directory <DIR2> on MacOS or Linux, set the variable EXMARALDADIR in the start-up script exmaralda-dulko.command or exmaralda-dulko.sh to <DIR2>.

  3. Run EXMARaLDA (Dulko).

    In order to run EXMARaLDA (Dulko) on Microsoft Windows, click on the EXMARaLDA (Dulko) icon on the desktop or run it from the EXMARaLDA submenu in the start menu.

    On MacOS, run the start-up script exmaralda-dulko.command in <DIR3>. If the script cannot be run with a double click, right-click on it and open it with the terminal.

    On Linux, run the start-up script exmaralda-dulko.sh in <DIR3>. If you add export PATH=<DIR3>:$PATH to /etc/profile or $HOME/.profile and copy or move the desktop file exmaralda-dulko.desktop from <DIR3> to /usr/local/share/applications or $HOME/.local/share/applications, you can also run EXMARaLDA (Dulko) from your desktop’s application menu.

  4. Open the annotation panel (‘View’ > ‘Annotation panel’) and open the file <DIR3>/annotation-panel.xml.

  5. Optionally, open the preferences (‘Edit’ > ‘Preferences’), switch to the ‘Stylesheets’ tab, and set the ‘Transcription to format table’ stylesheet to <DIR3>/format-table.xsl.

#Usage instructions

  1. Open <DIR3>/dulko.template.exb in EXMARaLDA (Dulko) (‘File’ > ‘Open’) and save it under a new name (‘File’ > ‘Save as’).[^2]

  2. Open the metainformation dialog (‘Transcription’ > ‘Metainformation’) and edit general metadata.

  3. Open the speakertable (‘Transcription’ > ‘Speakertable’) and edit the speaker metadata.[^3]

  4. On the main window, write or paste the learner text into one or several cells of the first tier. You can also first work on a proper part of the learner text (e.g. the first sentence) and add further parts later on.

  5. Apply the transformation scenario ‘Dulko: word-Spur (Lernertext)’ (‘Transcription’ > ‘Transformation’), which tokenises the learner text and normalises punctuation marks.

  6. If you want to annotate editorial changes by the learner, apply the transformation scenario ‘Dulko: orig-Spur (Lernertext)’, which adds a tier for the original, unchanged, learner text. When editing this tier, you can use the symbols , |, -, and _ for marking paragraph breaks, line breaks, hyphenations, and omissions, respectively.[^4]

  7. Apply the transformation scenario ‘Dulko: S-, pos- und lemma-Spuren (Lernertext)’ for parts-of-speech tagging, lemmatisation, and sentence-span identification of the learner text.[^5]

  8. If you have added a tier for the original learner text in step 6, apply the transformation scenario ‘Dulko: Diff-Spur (Lernertext)’, which detects editorial changes.

  9. If you have used some of the symbols , |, -, or _, mentioned above in step 6, on the tier for the original learner text, apply the transformation scenario ‘Dulko: Layout-Spur (Lernertext)’, which automatically tags those symbols.

  10. Optionally, apply the transformation scenario ‘Dulko: Graph-Spur (Lernertext)’, which adds a tier on which you can tag graphical renditions of the learner text by means of the annotation panel.

  11. Apply the transformation scenario ‘Dulko: trans-Spur (Lernertext)’ in case the learner text is a translation. Write or paste the text translated by the learner into the cells of the new tier.

  12. Apply the transformation scenario ‘Dulko: ZH- und Fehler-Spuren (1. Zielhypothese)’, which adds tiers for a target hypothesis and for error analysis. Edit the target hypothesis, and tag errors by means of the annotation panel.

  13. Apply the transformation scenario ‘Dulko: ZHS-, ZHpos- und ZHlemma-Spuren (1. Zielhypothese)’ for parts-of-speech tagging, lemmatisation, and sentence-span identification of the target hypothesis.

  14. Finally, apply the transformation scenario ‘Dulko: ZHDiff-Spur (1. Zielhypothese)’, which detects differences between the target hypothesis and the learner text.[^6]

In order to annotate further target hypotheses, apply the transformation scenarios for ‘2. Zielhypothese’, ‘3. Zielhypothese’, or ‘weitere Zielhypothese’. These transformation scenarios do not operate on the learner text but on the preceding target hypothesis.

Note that you can re-apply any of the above transformation scenarios in case you want to update the corresponding tiers, e.g. in order to revise the annotations or annotate further parts of the learner text.[^7]

If required, additional timeline items can be inserted by clicking on the next timeline item and choosing ‘Timeline’ > ‘Insert timeline item’. The transformation scenario ‘Dulko: Zeitachse’, in turn, removes unused timeline items.

Apply the transformation scenario ‘Dulko: HTML-Version’ for exporting the table sentence-wise into a HTML file, which can be viewed and printed by means of your favourite browser.

Run ‘Transcription’ > ‘Export segmented transcription’ for exporting the table to an EXS file, which can be used in COMA and EXAKT.[^8]

Apply the transformation scenarios ‘Dulko: ANNIS-kompatible Version’ and ‘Dulko: Pepper-kompatible Metadaten-Liste’ before exporting the final EXMARaLDA file to ANNIS via Pepper. The former transformation scenario deletes redundant annotations and adds namespace prefixes like ZH1 and ZH2 to the target-hypothesis and error tiers; those namespace prefixes are needed for properly ordering the tiers in ANNIS. The latter transformation scenario outputs an attribute-value list with corpus-level metadata for Pepper (cf. Pepper’s customisation property pepper.before.readMeta).[^9]

Andreas Nolda (andreas@nolda.org)

[^1]: A user of Microsoft Windows 8.1 reported that the installation program of the Oracle Java runtime environment does not set the system environment variable JAVA_HOME to the JRE installation path, which prevented EXMARaLDA from running. Cf. the configuration instructions in this README on how to set such variables. Alternatively, you can install the Oracle Java development kit or Amazon Corretto, which both appear to properly set this variable. [^2]: Alternatively, you may start from a blank table (‘File’ > ‘New’). Metadata can be imported from <DIR3>/dulko.template.exb by applying the transformation scenario ‘Dulko: Metadaten’. [^3]: Part of the speaker metadata (viz. the value of the ‘Abbreviation’ field) is used to generate tier names. If changed, the tier names can be updated by means of the the transformation scenario ‘Dulko: Spurnamen’. [^4]: In order to mark a hyphenation in the learner text, the corresponding word on the tier for the original learner text has to be split into three events consisting of the first part of the word, the symbol -, and the second part of the word, respectively. Optionally, you can add a further event with the symbol | after - as an explicit line-break mark. [^5]: The stylesheets for sentence-span tiers (Satzspannen) automatically identifies sentence spans ending in a punctuation character that TreeTagger tags as $. or ending in an abbreviation followed by a capitalised version of a non-noun. Sentence spans with different endings have to be tagged manually by splitting the corresponding sentence-span event inserted by the stylesheet; the sentence-span names can then be regenerated with the transformation scenario ‘Dulko: Satzspannen’. [^6]: The stylesheet for difference tiers (Differenz-Spuren) tries hard to detect movement source and target pairs, which are tagged with MOV[EMENT]S[OURCE] and MOV[EMENT]T[ARGET], respectively. If unsure, it tags potential movement sources and targets with the tags MOVS/DEL and MOVT/INS, which have to be manually disambiguated (e.g. by means of the annotation panel). [^7]: The only exception is the transformation scenario ‘Dulko: ZH- und Fehler-Spuren (weitere Zielhypothese)’, which always creates new tiers. [^8]: In EXMARaLDA (Dulko), this menu entry runs the XSLT stylesheet exb2exs.xsl on the current EXB file. [^9]: A build system for generating ANNIS data from EXMARaLDA sources annotated with EXMARaLDA (Dulko) is available at makeDulko.