Bases: object
Simple abstract class to manage RAM stored and retrievable objects.
Here, objects are created within a parent container. For exemple, the text, or a sample, or a lexicon, ecc. The store field is build from the name of the class.
Bases: sulci.base.RetrievableObject
A sentence of the text.
Retrieve errors, comparing attr and verified_attr. Possible values are : tag, lemme.
This method has to be called by the trainer each time a token of this sample is modified.
Bases: object
This is an abstract class for all the “text”, i.e. collection of samples and tokens.
Bases: sulci.base.RetrievableObject
Simplest element of a text.
Returns tokens neighbors in sample in positions passed as args, if available.
Eg. token.get_neighbors(1, 2) will return the next and next again tokens.
Bases: sulci.base.RetrievableObject
Compute scores that will be used to order, select, deduplicate keytentities.
Lets define that a ngram of 10 for a text of 100 words means 1 of confidence, so 0.1
Return the probability of all the terms of the ngram to appear together. The matter is to understand the dependance or independance of the terms. If just some terms appears out of this context, it may be normal (for exemple, a name, which appaers sometimes with both firstname and lastname and sometimes with just lastname). And if these terms appears many many times, but some others appears just in this context, the number doesn’t count. If NO term appears out of this context, with have a good probability for a collocation. If each term appears out of this context, and specialy if this occurs often, we can doubt of this collocation candidate. Do we may consider the stop_words ? This may affect negativly and positivly the main confidence.
Say two keyentities are duplicate if one is contained in the other.
Other is merged in self. Merging equal to say that other and self are the same KeyEntity, and self is the “delegate” of other. So (this is the case if other is smaller than self) each time other appears without the specific terms of self, we concider that is the same concept. So, we keep the highest frequency_confidence.
This is the frequency of the entity relatively to the possible entity of its length.
Number of occurrences of the ngram / number of ngrams possible / probability of each member of the ngram.
Try to find a descriptor in thesaurus, calculate levenshtein distance, and make a score. This may not be < 1, because if there is a descriptor, is a good point for the collocation, but if not, is doesn’t means that this is not a real collocation.
Define the probability of a ngram to be a title. Factor is for the confidence coex max. This may not have a negative effect, just positive : a title is a good candidate to be a collocation but this doesn’t means that if it’s not a title it’s not a collocation. Two things have importance here : the proportion of title AND the number of titles. Ex. : - “Jérôme Bourreau” is “more” title than “Bourreau” - “projet de loi Création et Internet” is “less” title than “loi Création et Internet”
Bases: object
Main class.
If a KeyEntity is contained in an other (same stemms in same place) longuer delete the one with the smaller confidence, or the shortest if same confidence We have to begin from the shortest ones.
Bases: sulci.base.RetrievableObject
Subpart of text, grouped by meaning (stem). This try to be the core meaning of a word, so many tokens can point to the same stemm. Should be renamed in Lemm, because we are talking about lemmatisation, not stemmatisation.
Do we take it in count as potential KeyEntity? If count is less than x, but main_occurrence is a title, we try to keep it
Bases: sulci.base.TextManager
Basic text class, with tokens, samples, etc.
Bases: sulci.corpus.CorpusMonitor
The corpus is a collection of manualy categorised texts.
We have different kind of categorised texts :
When loading a Corpus, you’ll need to specify the kind of texts to load.
Bases: object
Convenience class to store common methors between Corpus and TextCorpus.
Check the text of the corpus, and try to determine if there are some errors. Compare with lexicon.
Find occurrences of a word or tag or both in the corpus loaded.
Display tags usage stats.
Bases: sulci.base.TextManager, sulci.corpus.CorpusMonitor
One single text of the corpus.
This is not a raw text, but a manualy categorized text.
The normalisation is : word/TAG/lemme word2/TAG2/lemme2, etc.
Export tokens in a file.
force for export in the valid extension, otherwise it use the pending.
Bases: django.db.models.base.Model
Entries of the Thesaurus.
Bases: django.core.exceptions.ObjectDoesNotExist
Bases: django.core.exceptions.MultipleObjectsReturned
Bases: django.db.models.base.Model
The trigger is a keyentity who suggest some descriptors when in a text. It is linked to one or more descriptors, and the distance of the link between the trigger and a descriptor is stored in the relation. This score is populated during the sementical training.
Bases: django.core.exceptions.ObjectDoesNotExist
Bases: django.core.exceptions.MultipleObjectsReturned
Create a connection with the descriptor if doesn’t yet exists. In each case, update the connection weight. Delete the connection if the score is negative.
Bases: django.db.models.base.Model
This is the “synapse” of the trigger to descriptor relation.
Bases: django.core.exceptions.ObjectDoesNotExist
Bases: django.core.exceptions.MultipleObjectsReturned
Give the weight of the relation, relative to the max weight of the trigger and the max weight of the descriptor.
Bases: sulci.trainers.POSTrainer
alias of ContextualTemplateGenerator
Bases: sulci.trainers.RuleTrainer
Train the Lemmatizer.
Having a set of rules candidate for correcting some error, select the one correcting the more case, and creating the less errors.
alias of LemmatizerTemplateGenerator
Bases: sulci.trainers.POSTrainer
alias of LexicalTemplateGenerator
Bases: sulci.trainers.RuleTrainer
Pos Tagger trainer.
Bases: object
Main trainer class for rules based, for factorisation.
Bases: object
Create and update triggers. And make triggertodescription ponderation.
Bases: sulci.rules_templates.LemmatizerBaseTemplate
Make the original lower, if the tag is x.
Bases: sulci.rules_templates.WordBasedTemplate
The word is X. I have doubt on the interest of this template...
Bases: sulci.rules_templates.RuleTemplate
Base class for the contextual rules.
Bases: type
Export rules to the provisory config file.
rules are tuples (rule, score).
Returns and instance of a rule, from a template name or a rule string.
s can be template name or rule.
Bases: sulci.rules_templates.LemmatizerBaseTemplate
Give lemme y, if the tag is x.
Bases: sulci.rules_templates.WordBasedTemplate
One of the next three token is word X.
Bases: sulci.base.RetrievableObject
For the Lemmatizer training, the is just one template : it create as many possible rules as letters in the token tested. MAKELOWER GIVELEMME CHANGESUFFIX
Bases: type
Bases: sulci.rules_templates.RuleTemplate
Base class for the lexical rules.
Bases: type
Bases: sulci.rules_templates.LexicalBaseTemplate
Base templates for those who have to check lexicon.
Bases: sulci.rules_templates.LemmatizerBaseTemplate
Make the original lower, if the tag is x.
Bases: sulci.rules_templates.OrTemplate, sulci.rules_templates.TagBasedTemplate
One of the next three words is tagged X.
Bases: sulci.rules_templates.OrTemplate, sulci.rules_templates.TagBasedTemplate
One of the next three token is tagged X.
Bases: sulci.rules_templates.OrTemplate, sulci.rules_templates.WordBasedTemplate
One of the next two token is word X.
Bases: sulci.rules_templates.TagBasedTemplate
The token after next token is tagged X.
Bases: sulci.rules_templates.WordBasedTemplate
One of the next three token is word X.
Bases: sulci.rules_templates.WordBasedTemplate
The next two words are X and Y.
Bases: sulci.rules_templates.TagBasedTemplate
The next token is tagged X.
Bases: sulci.rules_templates.WordBasedTemplate
One of the next three token is word X.
Bases: sulci.rules_templates.ContextualBaseTemplate
Abstract class for template where we check not specific position.
Bases: sulci.rules_templates.OrTemplate, sulci.rules_templates.TagBasedTemplate
One of the next three token is tagged X.
Bases: sulci.rules_templates.OrTemplate, sulci.rules_templates.TagBasedTemplate
One of the next three token is tagged X.
Bases: sulci.rules_templates.OrTemplate, sulci.rules_templates.WordBasedTemplate
One of the next two token is word X.
Bases: sulci.rules_templates.TagBasedTemplate
The token after next token is tagged X.
Bases: sulci.rules_templates.WordBasedTemplate
One of the next three token is word X.
Bases: sulci.rules_templates.WordBasedTemplate
The previous two words are X and Y.
Bases: sulci.rules_templates.TagBasedTemplate
The next token is tagged X.
Bases: sulci.rules_templates.WordBasedTemplate
One of the next three token is word X.
Bases: sulci.rules_templates.WordBasedTemplate
One of the next three token is word X.
Bases: sulci.base.RetrievableObject
Class for managing rules creation and analysis.
Bases: sulci.rules_templates.TagBasedTemplate
The preceding word is tagged x and the following word is tagged y.
Bases: sulci.rules_templates.ContextualBaseTemplate
Abastract Class for tags based template.
Bases: sulci.rules_templates.ContextualBaseTemplate
Abastract Class for mixed based template : tag, than word.
Bases: sulci.rules_templates.WordBasedTemplate
One of the next three token is word X.
Bases: sulci.rules_templates.WordBasedTemplate
One of the next three token is word X.
Bases: sulci.rules_templates.WordTagBasedTemplate
Current word, and tag of two token after.
Bases: sulci.rules_templates.TagWordBasedTemplate
Current word, and tag of two token before.
Bases: sulci.rules_templates.WordTagBasedTemplate
Current word, and tag of token after.
Bases: sulci.rules_templates.TagWordBasedTemplate
Current word, and tag of token before.
Bases: sulci.rules_templates.ContextualBaseTemplate
Abstract Class for words based template.
Bases: sulci.rules_templates.ContextualBaseTemplate
Abastract Class for mixed based template : word, than tag.
Bases: sulci.rules_templates.LexiconCheckTemplate
Change current tag to tag X, if adding prefix Y lead in a entry of the lexicon.
Prefix Y lenght from 1 to 4 (Y < 4) : Syntax : Y addpref len(Y) X Ex. : er addpref 2 VNCFF
Bases: sulci.rules_templates.LexiconCheckTemplate
Change current tag to tag X, if adding suffix Y lead in a entry of the lexicon.
Suffix Y lenght from 1 to 4 (Y < 4) : Syntax : Y addsuf len(Y) X Ex. : re addsuf 2 VNCFF
Bases: sulci.rules_templates.LexiconCheckTemplate
Change current tag to tag X, if removing prefix Y lead in a entry of the lexicon.
Prefix Y lenght from 1 to 4 (Y < 4) : Syntax : Y deletepref len(Y) X Ex. : re deletepref 2 VNCFF
Bases: sulci.rules_templates.LexiconCheckTemplate
Change current tag to tag X, if removing suffix Y lead in a entry of the lexicon.
Bases: sulci.rules_templates.addpref
Change current tag to tag X, if adding prefix Y lead in a entry of the lexicon and if current tag is Z.
Prefix Y lenght from 1 to 4 (Y < 4) : Syntax : Z Y faddpref len(Y) X Ex. : SBC:sg re faddpref 2 VNCFF
Bases: sulci.rules_templates.addsuf
Change current tag to tag X, if removing prefix Y lead in a entry of the lexicon and current tag is Z.
Suffix Y lenght from 1 to 4 (Y <= 4) : Syntax : Z Y faddsuf len(Y) X Ex. : SBC:sg re faddsuf 2 VNCFF
Bases: sulci.rules_templates.deletepref
Change current tag to tag X, if removing prefix Y lead in a entry of the lexicon and if current tag is Z.
Prefix Y lenght from 1 to 4 (Y < 4) : Syntax : Z Y fdeletepref len(Y) X Ex. : ADV re fdeletepref 2 VNCFF
Bases: sulci.rules_templates.deletesuf
Change current tag to tag X, if removing suffix Y lead in a entry of lexicon and if current tag is Z.
Bases: sulci.rules_templates.haspref
Change current tag to tag X, if prefix is Y and current tag is Z.
Prefix Y is length from 1 to 4 (y <= 4) Syntax: Z Y hassuf len(Y) X Ex. : ADV bla haspref 3 DTC:sg
Bases: sulci.rules_templates.hassuf
Change current tag to tag X, if suffix is Y and current tag is Z.
Suffix Y is length from 1 to 4 (y <= 4) Syntax: Z Y hassuf len(Y) X Ex. : SBC:sg ment hassuf 4 ADV
Bases: sulci.rules_templates.NoLexiconCheckTemplate, sulci.rules_templates.ProximityCheckTemplate
The current word is at the right of the word x.
Bases: sulci.rules_templates.NoLexiconCheckTemplate, sulci.rules_templates.ProximityCheckTemplate
The current word is at the right of the word X.
Bases: sulci.rules_templates.NoLexiconCheckTemplate
Change current tag to tag X, if prefix is Y.
Prefix Y is length from 1 to 4 (y <= 4) Syntax: Z Y haspref len(Y) X Ex. : pro haspref 3 SBC:sg
Bases: sulci.rules_templates.NoLexiconCheckTemplate
Change current tag to tag X, if suffix is Y.
Suffix Y is length from 1 to 4 (y <= 4) Syntax: Y hassuf len(Y) X Ex. : ment hassuf 4 ADV
Define the Lexicon class.
For now, the lexicon is stored in a flat file, with special syntax :
Bases: sulci.base.TextManager
The lexicon is a list of unique words and theirs possible POS tags.
Build the list of factors (pieces of word).
These factors are used by the POStagger, to determine if an unnown word could be a derivate of another.
Util method to try to individuate errors in the Lexicon. For this, we display the entries with several tags, in case they are wrong duplicate.