Example of full training ======================== .. warning:: the training of Sulci is a hard North face, be sure to have the minimum of French knowledge, some time, some pre-categorized texts, some fast computer, before taking this way .. warning:: each algorithm needs the previous algorithm to work, so remember to train the algorithms in the order they are called. .. note:: The trained data provided with the Sulci alpha version has been made with a corpus of : #. 30000 POS tagged words #. 3500 words in lexicon (lexicon must be smaller than POS corpus) #. FIXME: 2000 lemmatized words #. 40000 semantical tagged texts #. 17000 descriptors in thesaurus Lexical training ---------------- First, we need to create some text corpus, in two groups: * one group with texts where only the POS tag for each word is set. Example:: Tout/PRV:sg était/ECJ:sg tellement/ADV absurde/ADJ:sg et/COO compliqué/ADJ:sg These texts need to have the `.crp` extension ; this group must be bigger. * one other with texts where both the POS tag and the lemme are set. Example:: Dans/PREP/dans les/DTN:pl/le faits/SBC:pl/fait ,/, la/DTN:sg/le répression/SBC:sg est/ECJ:sg/être contrebalancée/PAR:sg/contrebalancer These texts will be used to build the lexicon ; the valid extension is `.lxc.lem.crp` ; this group must be smaller. .. note:: You can also use the algorithm to help you create the corpus : give a text to the algorithm, and correct the output. Then, we can build the lexicon:: ./manage.py sulci_train -x This will write the new lexicon in temporary `.pdg` (pending) file. For now, we have to manually rename it in `lexicon.lxc` if the result is ok for us. Now, we can launch the lexical training:: ./manage.py sulci_train -e or, to load-balance the work in more than one process (using zmq), here one master and 4 slaves subprocesses:: ./manage.py sulci_train -e -s 4 Another time, we have to manually rename the file generated in `/corpus/` from `lexical_rules.pdg` to `lexical_rules.rls`. Then, we can launch the contextual training (remember to rename the file after):: ./manage.py sulci_train -c -s 4 Lemmatization ------------- Now, the lemmatizer trainer:: ./manage.py sulci_train -r -s 4 Semantical training ------------------- Now, the last step, but the bigger : the semantical training. Here a big corpus of categorized texts is needed. For example, in Libération we are using now a corpus of 35000 texts. Make sure you have configured the needed settings (see Installation below). Then launch the command line:: ./manage.py sulci_train -n -s 4 Postprocessing -------------- Finally, we can clean manually to reduce noise and remove useless rows, for example, removing all synapses that have been seen just one time (triggertodescriptor.weight == 1) or those where the pondered_weight is too low (triggertodescriptor.pondered_weight < 0.01 for example). And after that, triggers with no synapse can be also deleted.