Sulci Package

base Module

class sulci.base.RetrievableObject[source]

Bases: object

Simple abstract class to manage RAM stored and retrievable objects.

classmethod get_or_create(ref, parent_container, **kwargs)[source]

Here, objects are created within a parent container. For exemple, the text, or a sample, or a lexicon, ecc. The store field is build from the name of the class.

classmethod make_key(expression)[source]

Make a standardization in the expression to return a tuple who maximise maching possibilities. expression must be a list or tuple, or string or unicode

classmethod sort(seq, attr, reverse=True)[source]
class sulci.base.Sample(pk, parent=None, **kwargs)[source]

Bases: sulci.base.RetrievableObject

A sentence of the text.

append(item)[source]
get_errors(attr='tag')[source]

Retrieve errors, comparing attr and verified_attr. Possible values are : tag, lemme.

has_position(pos)[source]
is_token(stemm, position)[source]

Check if there is stemm “stemm” in position “position”.

meaning_words_count()[source]
reset_trainer_status()[source]

This method has to be called by the trainer each time a token of this sample is modified.

set_trained_position(pos)[source]

This method has to be called by trainer each time a token is processed but not corrected.

show_context(position)[source]

Returns a string of tokens around some positin of the sample.

class sulci.base.TextManager[source]

Bases: object

This is an abstract class for all the “text”, i.e. collection of samples and tokens.

PENDING_EXT = None
VALID_EXT = None
get_files(kind)[source]
instantiate_text(text)[source]

return samples and tokens. text is tokenized each token is : original + optionnal verified_tag (for training)

load_valid_files()[source]
pending_files[source]
tokenize(text)[source]
valid_files[source]
class sulci.base.Token(pk, original, parent=None, position=0, **kwargs)[source]

Bases: sulci.base.RetrievableObject

Simplest element of a text.

begin_of_sample(previous_token)[source]
get_neighbors(*args)[source]

Returns tokens neighbors in sample in positions passed as args, if available.

Eg. token.get_neighbors(1, 2) will return the next and next again tokens.

has_meaning()[source]

What about isdigit ?

has_meaning_alone()[source]

Do we take it in count if alone?

has_verified_tag(tag)[source]
is_avoir()[source]
is_closing_quote()[source]
is_etre()[source]
is_neighbor(candidates)[source]

Return true if word appears with right neighbours. False otherwise. candidates is tuple (Stemm object, distance)

is_opening_quote()[source]
is_strong_punctuation()[source]
is_tagged(tag)[source]
is_tool_word()[source]

Try to define if this word is a “mot outil”.

is_verb()[source]

We don’t take in count the verbs Etre and Avoir.

istitle()[source]

Determine if the token is a title, using its tag.

lower()[source]
next_bigram[source]

Return the two next token, or None if there is not two tokens after.

previous_bigram[source]

Return the two previous token, or None if there is not two tokens before.

sample[source]

For retrocompatibility.

show_context()[source]

textmining Module

class sulci.textmining.KeyEntity(pk, **kwargs)[source]

Bases: sulci.base.RetrievableObject

collocation_confidence[source]
compute_confidence()[source]

Compute scores that will be used to order, select, deduplicate keytentities.

confidence[source]
count = 0
frequency_confidence()[source]

Lets define that a ngram of 10 for a text of 100 words means 1 of confidence, so 0.1

frequency_relative_pmi_confidence[source]
heuristical_mutual_information_confidence()[source]

Return the probability of all the terms of the ngram to appear together. The matter is to understand the dependance or independance of the terms. If just some terms appears out of this context, it may be normal (for exemple, a name, which appaers sometimes with both firstname and lastname and sometimes with just lastname). And if these terms appears many many times, but some others appears just in this context, the number doesn’t count. If NO term appears out of this context, with have a good probability for a collocation. If each term appears out of this context, and specialy if this occurs often, we can doubt of this collocation candidate. Do we may consider the stop_words ? This may affect negativly and positivly the main confidence.

index(key)[source]
is_duplicate(KeyEntity)[source]

Say two keyentities are duplicate if one is contained in the other.

is_equal(other)[source]

This is for confidence and length comparison. NOT for content comparison.

istitle()[source]

A keyEntity is a title when all its stemms are title.

keyconcept_confidence[source]
merge(other)[source]

Other is merged in self. Merging equal to say that other and self are the same KeyEntity, and self is the “delegate” of other. So (this is the case if other is smaller than self) each time other appears without the specific terms of self, we concider that is the same concept. So, we keep the highest frequency_confidence.

nrelative_frequency_confidence()[source]

This is the frequency of the entity relatively to the possible entity of its length.

pos_confidence()[source]

Give a score linked to the POS of the subelements.

statistical_mutual_information_confidence()[source]

Number of occurrences of the ngram / number of ngrams possible / probability of each member of the ngram.

thesaurus_confidence()[source]

Try to find a descriptor in thesaurus, calculate levenshtein distance, and make a score. This may not be < 1, because if there is a descriptor, is a good point for the collocation, but if not, is doesn’t means that this is not a real collocation.

title_confidence()[source]

Define the probability of a ngram to be a title. Factor is for the confidence coex max. This may not have a negative effect, just positive : a title is a good candidate to be a collocation but this doesn’t means that if it’s not a title it’s not a collocation. Two things have importance here : the proportion of title AND the number of titles. Ex. : - “Jérôme Bourreau” is “more” title than “Bourreau” - “projet de loi Création et Internet” is “less” title than “loi Création et Internet”

trigger_score[source]

Score used by trigger, may be the final confidence ?

class sulci.textmining.SemanticalTagger(text, thesaurus=None, pos_tagger=None, lemmatizer=None, lexicon=None)[source]

Bases: object

Main class.

debug()[source]
deduplicate_keyentities()[source]

If a KeyEntity is contained in an other (same stemms in same place) longuer delete the one with the smaller confidence, or the shortest if same confidence We have to begin from the shortest ones.

descriptors[source]
filter_ngram(candidate)[source]

Here we try to keep the right ngrams to make keyentities.

get_descriptors(min_score=10)[source]

Final descriptors for the text.

Only descriptors triggered up to min_score will be returned.

keyentities_for_trainer()[source]
keystems(min_count=3)[source]
make_keyentities(min_length=2, max_length=10, min_count=2)[source]
ngrams(min_length=2, max_length=15, min_count=2)[source]
triggers[source]

Select triggers available for the current keyentities.

class sulci.textmining.Stemm(pk, **kwargs)[source]

Bases: sulci.base.RetrievableObject

Subpart of text, grouped by meaning (stem). This try to be the core meaning of a word, so many tokens can point to the same stemm. Should be renamed in Lemm, because we are talking about lemmatisation, not stemmatisation.

count[source]

Number of occurrences of this stemm.

has_interest()[source]

Do we take it in count as potential KeyEntity? If count is less than x, but main_occurrence is a title, we try to keep it

has_interest_alone()[source]

Do we take it in count if alone ?? If count is less than x, but main_occurrence is a title, we try to keep it

is_valid()[source]
is_valid_alone()[source]
istitle()[source]
main_occurrence[source]

Returns the “main” one from the linked tokens.

tag[source]
class sulci.textmining.StemmedText(text, pos_tagger=None, lemmatizer=None, lexicon=None)[source]

Bases: sulci.base.TextManager

Basic text class, with tokens, samples, etc.

create_stemm()[source]
distinct_words()[source]
distincts_meaning_words()[source]
make()[source]

Text is expected to be tokenized. And filtered ?

meaning_words[source]
meaning_words_count()[source]

Return the number of words in the text.

medium_word_count[source]
stemms[source]
words[source]
words_count()[source]

Return the number of words in the text.

corpus Module

class sulci.corpus.Corpus(extension='.crp', tagger=None)[source]

Bases: sulci.corpus.CorpusMonitor

The corpus is a collection of manualy categorised texts.

We have different kind of categorised texts :

  • .crp => just POS tag
  • .lem... => also manualy lemmatized
  • .lcx... => will be used to make the Lexicon

When loading a Corpus, you’ll need to specify the kind of texts to load.

LEXICON_EXT = '.lxc'
NEW_EXT = '.new'
PATH = 'corpus'
PENDING_EXT = '.pdg'
VALID_EXT = '.crp'
files[source]

Return a list of files for the corpus extension.

samples[source]
texts[source]
tokens[source]
class sulci.corpus.CorpusMonitor[source]

Bases: object

Convenience class to store common methors between Corpus and TextCorpus.

check(lexicon, check_lemmes=False)[source]

Check the text of the corpus, and try to determine if there are some errors. Compare with lexicon.

check_usage(word=None, tag=None, lemme=None, case_insensitive=False)[source]

Find occurrences of a word or tag or both in the corpus loaded.

tags_stats(word=None, case_insensitive=None)[source]

Display tags usage stats.

class sulci.corpus.TextCorpus(path=None)[source]

Bases: sulci.base.TextManager, sulci.corpus.CorpusMonitor

One single text of the corpus.

This is not a raw text, but a manualy categorized text.

The normalisation is : word/TAG/lemme word2/TAG2/lemme2, etc.

LEXICON_EXT = '.lxc.lem.crp'
PATH = 'corpus'
PENDING_EXT = '.pdg'
VALID_EXT = '.crp'
export(name, force=False, add_lemmes=False)[source]

Export tokens in a file.

force for export in the valid extension, otherwise it use the pending.

has_verified_lemmes[source]

Returns True if the text is supposed to contains verified lemmes.

load()[source]
prepare(text, tagger, lemmatizer)[source]

Given a raw text, clean it, and make tokens and samples.

(Maybe this method should be in the TextManager class.)

samples[source]
tokens[source]

thesaurus Module

class sulci.thesaurus.Descriptor(*args, **kwargs)[source]

Bases: django.db.models.base.Model

Entries of the Thesaurus.

exception DoesNotExist

Bases: django.core.exceptions.ObjectDoesNotExist

exception Descriptor.MultipleObjectsReturned

Bases: django.core.exceptions.MultipleObjectsReturned

Descriptor.aliases
Descriptor.children
Descriptor.is_alias_of
Descriptor.max_weight[source]
Descriptor.objects = <django.db.models.manager.Manager object at 0x2ee4ed0>
Descriptor.original[source]
Descriptor.parent
Descriptor.primeval[source]

Returns the primeval descriptor when self is alias of another.

Descriptor.trigger_set
Descriptor.triggertodescriptor_set
class sulci.thesaurus.Thesaurus(path='thesaurus.txt')[source]

Bases: object

load_triggers()[source]
normalize_item(item)[source]
classmethod reset_triggers()[source]

For full training, we need to remove previous triggers.

triggers[source]
class sulci.thesaurus.Trigger(*args, **kwargs)[source]

Bases: django.db.models.base.Model

The trigger is a keyentity who suggest some descriptors when in a text. It is linked to one or more descriptors, and the distance of the link between the trigger and a descriptor is stored in the relation. This score is populated during the sementical training.

exception DoesNotExist

Bases: django.core.exceptions.ObjectDoesNotExist

exception Trigger.MultipleObjectsReturned

Bases: django.core.exceptions.MultipleObjectsReturned

classmethod Trigger.clean_all_connections()[source]
Trigger.clean_connections()[source]

Remove the negative connections.

Trigger.connect(descriptor, score)[source]

Create a connection with the descriptor if doesn’t yet exists. In each case, update the connection weight. Delete the connection if the score is negative.

Trigger.descriptors
Trigger.export()[source]

Return a string for file storage.

Trigger.items()[source]
Trigger.max_weight[source]
Trigger.objects = <django.db.models.manager.Manager object at 0x2e95ad0>
Trigger.triggertodescriptor_set
class sulci.thesaurus.TriggerToDescriptor(*args, **kwargs)[source]

Bases: django.db.models.base.Model

This is the “synapse” of the trigger to descriptor relation.

exception DoesNotExist

Bases: django.core.exceptions.ObjectDoesNotExist

exception TriggerToDescriptor.MultipleObjectsReturned

Bases: django.core.exceptions.MultipleObjectsReturned

TriggerToDescriptor.descriptor
TriggerToDescriptor.objects = <django.db.models.manager.Manager object at 0x2e95550>
TriggerToDescriptor.pondered_weight[source]

Give the weight of the relation, relative to the max weight of the trigger and the max weight of the descriptor.

TriggerToDescriptor.trigger

trainers Module

class sulci.trainers.ContextualTrainer(tagger, corpus, mode='full')[source]

Bases: sulci.trainers.POSTrainer

pretrain()[source]

Tag the tokens, but not using the contextual rules, as we are training it.

template_generator

alias of ContextualTemplateGenerator

class sulci.trainers.LemmatizerTrainer(lemmatizer, mode='full')[source]

Bases: sulci.trainers.RuleTrainer

Train the Lemmatizer.

attr_name = 'lemme'
get_template_instance(tpl)[source]
log_error(token)[source]
pretrain()[source]

We need to have the right tags, here

select_one_rule(rules)[source]

Having a set of rules candidate for correcting some error, select the one correcting the more case, and creating the less errors.

template_generator

alias of LemmatizerTemplateGenerator

class sulci.trainers.LexicalTrainer(tagger, corpus, mode='full')[source]

Bases: sulci.trainers.POSTrainer

get_errors()[source]

We don’t care about token in Lexicon, for lexical trainer.

pretrain()[source]

Tag the tokens, but not using POS rules, as we are training it.

template_generator

alias of LexicalTemplateGenerator

class sulci.trainers.POSTrainer(tagger, corpus, mode='full')[source]

Bases: sulci.trainers.RuleTrainer

Pos Tagger trainer.

attr_name = 'tag'
get_template_instance(tpl)[source]
log_error(token)[source]
pretrain()[source]
select_one_rule(rules)[source]
class sulci.trainers.RuleTrainer[source]

Bases: object

Main trainer class for rules based, for factorisation.

display_errors()[source]

Display errors in current step.

do()[source]
get_errors()[source]

Retrieve token where tag !== verified_tag.

get_template_instance(tpl)[source]
log_error(token)[source]
pretrain()[source]

Trainer specific training session preparation.

select_one_rule(rules)[source]
setup_socket_master()[source]

Configure the sockets for the master trainer.

setup_socket_slave()[source]

Configure sockets for the workers (slaves).

slave()[source]
test_rule(rule)[source]
test_rules(rules_candidates)[source]
train()[source]

Main factorized train method.

class sulci.trainers.SemanticalTrainer(thesaurus, pos_tagger, mode='full')[source]

Bases: object

Create and update triggers. And make triggertodescription ponderation.

PENDING_EXT = '.pdg'
VALID_EXT = '.trg'
begin()[source]

Make one trigger for each descriptor of the thesaurus. Have to be called one time at the begining, and that’s all.

clean_connections()[source]

Delete all the connection where score < 0.

do(*args)[source]
setup_socket_master()[source]

Configure the sockets for the master trainer.

setup_socket_slave()[source]

Configure sockets for the workers (slaves).

slave()[source]
train(inst)[source]

For the moment, human defined descriptors are a string with ”,” separator.

rules_templates Module

class sulci.rules_templates.CHANGESUFFIX(pk, **kwargs)[source]

Bases: sulci.rules_templates.LemmatizerBaseTemplate

Make the original lower, if the tag is x.

apply_rule(tokens, rule)[source]
compile_rule(tag, to_delete, to_add)[source]
is_candidate(token, rule)[source]
make_rules(token)[source]

We make one rule for each possible transformation making verified_lemme from token.original.

test_rule(token, rule)[source]
uncompile_rule(rule)[source]
class sulci.rules_templates.CURWD(pk, **kwargs)[source]

Bases: sulci.rules_templates.WordBasedTemplate

The word is X. I have doubt on the interest of this template...

get_target()[source]
class sulci.rules_templates.ContextualBaseTemplate(pk, **kwargs)[source]

Bases: sulci.rules_templates.RuleTemplate

Base class for the contextual rules.

compile_rule(from_tag, to_tag, complement)[source]

Make the final rule string. complement must be an iterable

classmethod uncompile_rule(rule)[source]
class sulci.rules_templates.ContextualTemplateGenerator[source]

Bases: type

classmethod export(rules)[source]

Export rules to the provisory config file.

rules are tuples (rule, score).

classmethod get_instance(s, **kwargs)[source]

Returns and instance of a rule, from a template name or a rule string.

s can be template name or rule.

classmethod load()[source]

Load rules from config file.

register = {'NEXTWD': <class 'sulci.rules_templates.NEXTWD'>, 'WDAND2AFT': <class 'sulci.rules_templates.WDAND2AFT'>, 'PREV1OR2OR3TAG': <class 'sulci.rules_templates.PREV1OR2OR3TAG'>, 'NEXT1OR2OR3TAG': <class 'sulci.rules_templates.NEXT1OR2OR3TAG'>, 'CURWD': <class 'sulci.rules_templates.CURWD'>, 'NEXT1OR2WD': <class 'sulci.rules_templates.NEXT1OR2WD'>, 'SURROUNDTAG': <class 'sulci.rules_templates.SURROUNDTAG'>, 'PREV1OR2TAG': <class 'sulci.rules_templates.PREV1OR2TAG'>, 'WDAND2BFR': <class 'sulci.rules_templates.WDAND2BFR'>, 'PREVBIGRAM': <class 'sulci.rules_templates.PREVBIGRAM'>, 'NEXT2WD': <class 'sulci.rules_templates.NEXT2WD'>, 'PREV2TAG': <class 'sulci.rules_templates.PREV2TAG'>, 'WDAND2TAGBFR': <class 'sulci.rules_templates.WDAND2TAGBFR'>, 'LBIGRAM': <class 'sulci.rules_templates.LBIGRAM'>, 'NEXTBIGRAM': <class 'sulci.rules_templates.NEXTBIGRAM'>, 'NEXT1OR2TAG': <class 'sulci.rules_templates.NEXT1OR2TAG'>, 'RBIGRAM': <class 'sulci.rules_templates.RBIGRAM'>, 'WDAND2TAGAFT': <class 'sulci.rules_templates.WDAND2TAGAFT'>, 'WDNEXTTAG': <class 'sulci.rules_templates.WDNEXTTAG'>, 'WDPREVTAG': <class 'sulci.rules_templates.WDPREVTAG'>, 'PREV1OR2WD': <class 'sulci.rules_templates.PREV1OR2WD'>, 'PREV2WD': <class 'sulci.rules_templates.PREV2WD'>, 'PREVTAG': <class 'sulci.rules_templates.PREVTAG'>, 'NEXTTAG': <class 'sulci.rules_templates.NEXTTAG'>, 'PREVWD': <class 'sulci.rules_templates.PREVWD'>, 'NEXT2TAG': <class 'sulci.rules_templates.NEXT2TAG'>}
class sulci.rules_templates.FORCELEMME(pk, **kwargs)[source]

Bases: sulci.rules_templates.LemmatizerBaseTemplate

Give lemme y, if the tag is x.

apply_rule(tokens, rule)[source]
compile_rule(tag, lemme)[source]
is_candidate(token, rule)[source]
make_rules(token)[source]
test_rule(token, rule)[source]
class sulci.rules_templates.LBIGRAM(pk, **kwargs)[source]

Bases: sulci.rules_templates.WordBasedTemplate

One of the next three token is word X.

get_target()[source]
class sulci.rules_templates.LemmatizerBaseTemplate(pk, **kwargs)[source]

Bases: sulci.base.RetrievableObject

For the Lemmatizer training, the is just one template : it create as many possible rules as letters in the token tested. MAKELOWER GIVELEMME CHANGESUFFIX

compile_rule()[source]
is_candidate(token, rule)[source]
make_rules(token)[source]
test_rule(token, rule)[source]
uncompile_rule(rule)[source]
class sulci.rules_templates.LemmatizerTemplateGenerator[source]

Bases: type

classmethod export(rules)[source]

Rules are tuples (rule, score)

classmethod get_instance(s, **kwargs)[source]

s can be template name or rule.

classmethod load()[source]
register = {'FORCELEMME': <class 'sulci.rules_templates.FORCELEMME'>, 'MAKELOWER': <class 'sulci.rules_templates.MAKELOWER'>, 'CHANGESUFFIX': <class 'sulci.rules_templates.CHANGESUFFIX'>}
class sulci.rules_templates.LexicalBaseTemplate(pk, **kwargs)[source]

Bases: sulci.rules_templates.RuleTemplate

Base class for the lexical rules.

compile_rule(from_tag, to_tag, complement)[source]
test_complement(token, complement)[source]
classmethod uncompile_rule(rule)[source]
class sulci.rules_templates.LexicalTemplateGenerator[source]

Bases: type

classmethod export(rules)[source]

Rules are tuples (rule, score)

classmethod get_instance(s, lexicon)[source]

s can be template name or rule.

classmethod load()[source]
register = {'faddsuf': <class 'sulci.rules_templates.faddsuf'>, 'haspref': <class 'sulci.rules_templates.haspref'>, 'fdeletepref': <class 'sulci.rules_templates.fdeletepref'>, 'deletepref': <class 'sulci.rules_templates.deletepref'>, 'fhassuf': <class 'sulci.rules_templates.fhassuf'>, 'hassuf': <class 'sulci.rules_templates.hassuf'>, 'fhaspref': <class 'sulci.rules_templates.fhaspref'>, 'goodright': <class 'sulci.rules_templates.goodright'>, 'goodleft': <class 'sulci.rules_templates.goodleft'>, 'fdeletesuf': <class 'sulci.rules_templates.fdeletesuf'>, 'faddpref': <class 'sulci.rules_templates.faddpref'>, 'deletesuf': <class 'sulci.rules_templates.deletesuf'>, 'fgoodleft': <class 'sulci.rules_templates.fgoodleft'>, 'fgoodright': <class 'sulci.rules_templates.fgoodright'>, 'addpref': <class 'sulci.rules_templates.addpref'>, 'addsuf': <class 'sulci.rules_templates.addsuf'>}
class sulci.rules_templates.LexiconCheckTemplate(pk, **kwargs)[source]

Bases: sulci.rules_templates.LexicalBaseTemplate

Base templates for those who have to check lexicon.

make_rules(token)[source]
test_complement(token, complement)[source]

For the Lexicon Check rules, we need to check if modified word is in lexicon.

class sulci.rules_templates.MAKELOWER(pk, **kwargs)[source]

Bases: sulci.rules_templates.LemmatizerBaseTemplate

Make the original lower, if the tag is x.

apply_rule(tokens, rule)[source]
compile_rule(tag)[source]
make_rules(token)[source]
test_rule(token, rule)[source]
class sulci.rules_templates.NEXT1OR2OR3TAG(pk, **kwargs)[source]

Bases: sulci.rules_templates.OrTemplate, sulci.rules_templates.TagBasedTemplate

One of the next three words is tagged X.

get_target()[source]
class sulci.rules_templates.NEXT1OR2TAG(pk, **kwargs)[source]

Bases: sulci.rules_templates.OrTemplate, sulci.rules_templates.TagBasedTemplate

One of the next three token is tagged X.

get_target()[source]
class sulci.rules_templates.NEXT1OR2WD(pk, **kwargs)[source]

Bases: sulci.rules_templates.OrTemplate, sulci.rules_templates.WordBasedTemplate

One of the next two token is word X.

get_target()[source]
class sulci.rules_templates.NEXT2TAG(pk, **kwargs)[source]

Bases: sulci.rules_templates.TagBasedTemplate

The token after next token is tagged X.

get_target()[source]
class sulci.rules_templates.NEXT2WD(pk, **kwargs)[source]

Bases: sulci.rules_templates.WordBasedTemplate

One of the next three token is word X.

get_target()[source]
class sulci.rules_templates.NEXTBIGRAM(pk, **kwargs)[source]

Bases: sulci.rules_templates.WordBasedTemplate

The next two words are X and Y.

get_target()[source]
class sulci.rules_templates.NEXTTAG(pk, **kwargs)[source]

Bases: sulci.rules_templates.TagBasedTemplate

The next token is tagged X.

get_target()[source]
class sulci.rules_templates.NEXTWD(pk, **kwargs)[source]

Bases: sulci.rules_templates.WordBasedTemplate

One of the next three token is word X.

get_target()[source]
class sulci.rules_templates.NoLexiconCheckTemplate(pk, **kwargs)[source]

Bases: sulci.rules_templates.LexicalBaseTemplate

make_rules(token)[source]
class sulci.rules_templates.OrTemplate(pk, **kwargs)[source]

Bases: sulci.rules_templates.ContextualBaseTemplate

Abstract class for template where we check not specific position.

make_rules(token)[source]
test_complement(token, complement)[source]
class sulci.rules_templates.PREV1OR2OR3TAG(pk, **kwargs)[source]

Bases: sulci.rules_templates.OrTemplate, sulci.rules_templates.TagBasedTemplate

One of the next three token is tagged X.

get_target()[source]
class sulci.rules_templates.PREV1OR2TAG(pk, **kwargs)[source]

Bases: sulci.rules_templates.OrTemplate, sulci.rules_templates.TagBasedTemplate

One of the next three token is tagged X.

get_target()[source]
class sulci.rules_templates.PREV1OR2WD(pk, **kwargs)[source]

Bases: sulci.rules_templates.OrTemplate, sulci.rules_templates.WordBasedTemplate

One of the next two token is word X.

get_target()[source]
class sulci.rules_templates.PREV2TAG(pk, **kwargs)[source]

Bases: sulci.rules_templates.TagBasedTemplate

The token after next token is tagged X.

get_target()[source]
class sulci.rules_templates.PREV2WD(pk, **kwargs)[source]

Bases: sulci.rules_templates.WordBasedTemplate

One of the next three token is word X.

get_target()[source]
class sulci.rules_templates.PREVBIGRAM(pk, **kwargs)[source]

Bases: sulci.rules_templates.WordBasedTemplate

The previous two words are X and Y.

get_target()[source]
class sulci.rules_templates.PREVTAG(pk, **kwargs)[source]

Bases: sulci.rules_templates.TagBasedTemplate

The next token is tagged X.

get_target()[source]
class sulci.rules_templates.PREVWD(pk, **kwargs)[source]

Bases: sulci.rules_templates.WordBasedTemplate

One of the next three token is word X.

get_target()[source]
class sulci.rules_templates.ProximityCheckTemplate(pk, **kwargs)[source]

Bases: sulci.rules_templates.LexicalBaseTemplate

compile_rule(from_tag, to_tag, complement)[source]

No len...

class sulci.rules_templates.RBIGRAM(pk, **kwargs)[source]

Bases: sulci.rules_templates.WordBasedTemplate

One of the next three token is word X.

get_target()[source]
class sulci.rules_templates.RuleTemplate(pk, **kwargs)[source]

Bases: sulci.base.RetrievableObject

Class for managing rules creation and analysis.

apply_rule(tokens, rule)[source]

Apply rule to candidates in a set of tokens.

get_to_tag(rule)[source]
is_candidate(token, rule)[source]
make_rules(token)[source]
classmethod select_one(rules, MAX, minval=2)[source]

Select one rule between a set of tested rules.

rules is a iterable of tuples : (rule, good, bad), where good is the number of errors corrected, and bad the number of error generated.

test_complement(token, complement)[source]
test_rule(token, rule)[source]
class sulci.rules_templates.SURROUNDTAG(pk, **kwargs)[source]

Bases: sulci.rules_templates.TagBasedTemplate

The preceding word is tagged x and the following word is tagged y.

get_target()[source]
class sulci.rules_templates.TagBasedTemplate(pk, **kwargs)[source]

Bases: sulci.rules_templates.ContextualBaseTemplate

Abastract Class for tags based template.

get_complement(token)[source]
class sulci.rules_templates.TagWordBasedTemplate(pk, **kwargs)[source]

Bases: sulci.rules_templates.ContextualBaseTemplate

Abastract Class for mixed based template : tag, than word.

get_complement(token)[source]
class sulci.rules_templates.WDAND2AFT(pk, **kwargs)[source]

Bases: sulci.rules_templates.WordBasedTemplate

One of the next three token is word X.

get_target()[source]
class sulci.rules_templates.WDAND2BFR(pk, **kwargs)[source]

Bases: sulci.rules_templates.WordBasedTemplate

One of the next three token is word X.

get_target()[source]
class sulci.rules_templates.WDAND2TAGAFT(pk, **kwargs)[source]

Bases: sulci.rules_templates.WordTagBasedTemplate

Current word, and tag of two token after.

get_target()[source]
class sulci.rules_templates.WDAND2TAGBFR(pk, **kwargs)[source]

Bases: sulci.rules_templates.TagWordBasedTemplate

Current word, and tag of two token before.

get_target()[source]
class sulci.rules_templates.WDNEXTTAG(pk, **kwargs)[source]

Bases: sulci.rules_templates.WordTagBasedTemplate

Current word, and tag of token after.

get_target()[source]
class sulci.rules_templates.WDPREVTAG(pk, **kwargs)[source]

Bases: sulci.rules_templates.TagWordBasedTemplate

Current word, and tag of token before.

get_target()[source]
class sulci.rules_templates.WordBasedTemplate(pk, **kwargs)[source]

Bases: sulci.rules_templates.ContextualBaseTemplate

Abstract Class for words based template.

get_complement(token)[source]
class sulci.rules_templates.WordTagBasedTemplate(pk, **kwargs)[source]

Bases: sulci.rules_templates.ContextualBaseTemplate

Abastract Class for mixed based template : word, than tag.

get_complement(token)[source]
class sulci.rules_templates.addpref(pk, **kwargs)[source]

Bases: sulci.rules_templates.LexiconCheckTemplate

Change current tag to tag X, if adding prefix Y lead in a entry of the lexicon.

Prefix Y lenght from 1 to 4 (Y < 4) : Syntax : Y addpref len(Y) X Ex. : er addpref 2 VNCFF

get_complement(token)[source]
modified_token(token, complement)[source]
class sulci.rules_templates.addsuf(pk, **kwargs)[source]

Bases: sulci.rules_templates.LexiconCheckTemplate

Change current tag to tag X, if adding suffix Y lead in a entry of the lexicon.

Suffix Y lenght from 1 to 4 (Y < 4) : Syntax : Y addsuf len(Y) X Ex. : re addsuf 2 VNCFF

get_complement(token)[source]
modified_token(token, complement)[source]
class sulci.rules_templates.deletepref(pk, **kwargs)[source]

Bases: sulci.rules_templates.LexiconCheckTemplate

Change current tag to tag X, if removing prefix Y lead in a entry of the lexicon.

Prefix Y lenght from 1 to 4 (Y < 4) : Syntax : Y deletepref len(Y) X Ex. : re deletepref 2 VNCFF

get_complement(token)[source]
test_complement(token, complement)[source]

Tests if token has the right prefix, and if deleting it result in a word in the lexicon

class sulci.rules_templates.deletesuf(pk, **kwargs)[source]

Bases: sulci.rules_templates.LexiconCheckTemplate

Change current tag to tag X, if removing suffix Y lead in a entry of the lexicon.

get_complement(token)[source]

Return a tuple of afix, ceased_token.

test_complement(token, complement)[source]

Test if token has the right suffix, and if deleting it result in a word in the lexicon

class sulci.rules_templates.faddpref(pk, **kwargs)[source]

Bases: sulci.rules_templates.addpref

Change current tag to tag X, if adding prefix Y lead in a entry of the lexicon and if current tag is Z.

Prefix Y lenght from 1 to 4 (Y < 4) : Syntax : Z Y faddpref len(Y) X Ex. : SBC:sg re faddpref 2 VNCFF

class sulci.rules_templates.faddsuf(pk, **kwargs)[source]

Bases: sulci.rules_templates.addsuf

Change current tag to tag X, if removing prefix Y lead in a entry of the lexicon and current tag is Z.

Suffix Y lenght from 1 to 4 (Y <= 4) : Syntax : Z Y faddsuf len(Y) X Ex. : SBC:sg re faddsuf 2 VNCFF

class sulci.rules_templates.fdeletepref(pk, **kwargs)[source]

Bases: sulci.rules_templates.deletepref

Change current tag to tag X, if removing prefix Y lead in a entry of the lexicon and if current tag is Z.

Prefix Y lenght from 1 to 4 (Y < 4) : Syntax : Z Y fdeletepref len(Y) X Ex. : ADV re fdeletepref 2 VNCFF

class sulci.rules_templates.fdeletesuf(pk, **kwargs)[source]

Bases: sulci.rules_templates.deletesuf

Change current tag to tag X, if removing suffix Y lead in a entry of lexicon and if current tag is Z.

class sulci.rules_templates.fgoodleft(pk, **kwargs)[source]

Bases: sulci.rules_templates.goodleft

class sulci.rules_templates.fgoodright(pk, **kwargs)[source]

Bases: sulci.rules_templates.goodright

class sulci.rules_templates.fhaspref(pk, **kwargs)[source]

Bases: sulci.rules_templates.haspref

Change current tag to tag X, if prefix is Y and current tag is Z.

Prefix Y is length from 1 to 4 (y <= 4) Syntax: Z Y hassuf len(Y) X Ex. : ADV bla haspref 3 DTC:sg

class sulci.rules_templates.fhassuf(pk, **kwargs)[source]

Bases: sulci.rules_templates.hassuf

Change current tag to tag X, if suffix is Y and current tag is Z.

Suffix Y is length from 1 to 4 (y <= 4) Syntax: Z Y hassuf len(Y) X Ex. : SBC:sg ment hassuf 4 ADV

class sulci.rules_templates.goodleft(pk, **kwargs)[source]

Bases: sulci.rules_templates.NoLexiconCheckTemplate, sulci.rules_templates.ProximityCheckTemplate

The current word is at the right of the word x.

get_complement(token)[source]
class sulci.rules_templates.goodright(pk, **kwargs)[source]

Bases: sulci.rules_templates.NoLexiconCheckTemplate, sulci.rules_templates.ProximityCheckTemplate

The current word is at the right of the word X.

get_complement(token)[source]
class sulci.rules_templates.haspref(pk, **kwargs)[source]

Bases: sulci.rules_templates.NoLexiconCheckTemplate

Change current tag to tag X, if prefix is Y.

Prefix Y is length from 1 to 4 (y <= 4) Syntax: Z Y haspref len(Y) X Ex. : pro haspref 3 SBC:sg

get_complement(token)[source]
class sulci.rules_templates.hassuf(pk, **kwargs)[source]

Bases: sulci.rules_templates.NoLexiconCheckTemplate

Change current tag to tag X, if suffix is Y.

Suffix Y is length from 1 to 4 (y <= 4) Syntax: Y hassuf len(Y) X Ex. : ment hassuf 4 ADV

get_complement(token)[source]

Return a tuple of afix, ceased_token.

lexicon Module

Define the Lexicon class.

For now, the lexicon is stored in a flat file, with special syntax :

  • word[TAB]POStag1/lemme1[TAB]POStag2/lemme2
class sulci.lexicon.Lexicon(path='corpus')[source]

Bases: sulci.base.TextManager

The lexicon is a list of unique words and theirs possible POS tags.

add_factors(token)[source]

Build the list of factors (pieces of word).

These factors are used by the POStagger, to determine if an unnown word could be a derivate of another.

check()[source]

Util method to try to individuate errors in the Lexicon. For this, we display the entries with several tags, in case they are wrong duplicate.

create_afixes()[source]

We determinate here the most frequent prefixes and suffixes.

get_entry(entry)[source]
items()[source]
loaded[source]

Load lexicon in RAM, from file.

The representation will be a dict {“word1”: [{tag1 : lemme1}]}

make(force=False)[source]

Build the lexicon.

prefixes[source]
suffixes[source]
class sulci.lexicon.LexiconEntity(raw_data, **kwargs)[source]

Bases: object

One word of a lexicon.

lemmatizer Module

class sulci.lemmatizer.Lemmatizer(lexicon)[source]

Bases: sulci.base.TextManager

This class give a lemma for a token, using his tag.

PATH = 'corpus'
VALID_EXT = '.lem.crp'
content[source]
do(token)[source]

A Token object or a list of token objects is expected. Return the token or the list.

samples[source]
tokens[source]

Project Versions

Table Of Contents

Previous topic

Management Commands

This Page