Example of full training

Warning

the training of Sulci is a hard North face, be sure to have the minimum of French knowledge, some time, some pre-categorized texts, some fast computer, before taking this way

Warning

each algorithm needs the previous algorithm to work, so remember to train the algorithms in the order they are called.

Note

The trained data provided with the Sulci alpha version has been made with a corpus of :

  1. 30000 POS tagged words
  2. 3500 words in lexicon (lexicon must be smaller than POS corpus)
  3. FIXME: 2000 lemmatized words
  4. 40000 semantical tagged texts
  5. 17000 descriptors in thesaurus

Lexical training

First, we need to create some text corpus, in two groups:

  • one group with texts where only the POS tag for each word is set. Example:

    Tout/PRV:sg était/ECJ:sg tellement/ADV absurde/ADJ:sg et/COO compliqué/ADJ:sg

    These texts need to have the .crp extension ; this group must be bigger.

  • one other with texts where both the POS tag and the lemme are set. Example:

    Dans/PREP/dans les/DTN:pl/le faits/SBC:pl/fait ,/, la/DTN:sg/le répression/SBC:sg
    est/ECJ:sg/être contrebalancée/PAR:sg/contrebalancer

    These texts will be used to build the lexicon ; the valid extension is .lxc.lem.crp ; this group must be smaller.

Note

You can also use the algorithm to help you create the corpus : give a text to the algorithm, and correct the output.

Then, we can build the lexicon:

./manage.py sulci_train -x

This will write the new lexicon in temporary .pdg (pending) file. For now, we have to manually rename it in lexicon.lxc if the result is ok for us.

Now, we can launch the lexical training:

./manage.py sulci_train -e

or, to load-balance the work in more than one process (using zmq), here one master and 4 slaves subprocesses:

./manage.py sulci_train -e -s 4

Another time, we have to manually rename the file generated in /corpus/ from lexical_rules.pdg to lexical_rules.rls.

Then, we can launch the contextual training (remember to rename the file after):

./manage.py sulci_train -c -s 4

Lemmatization

Now, the lemmatizer trainer:

./manage.py sulci_train -r -s 4

Semantical training

Now, the last step, but the bigger : the semantical training. Here a big corpus of categorized texts is needed. For example, in Libération we are using now a corpus of 35000 texts.

Make sure you have configured the needed settings (see Installation below).

Then launch the command line:

./manage.py sulci_train -n -s 4

Postprocessing

Finally, we can clean manually to reduce noise and remove useless rows, for example, removing all synapses that have been seen just one time (triggertodescriptor.weight == 1) or those where the pondered_weight is too low (triggertodescriptor.pondered_weight < 0.01 for example). And after that, triggers with no synapse can be also deleted.

Project Versions

Table Of Contents

Previous topic

Overview

Next topic

How can I help?

This Page