Using a simple language, tokens of a .conllu-file can be edited if a condition is met
bin/replace.sh rules.txt input.conllu [--nostrict] > output.conllu
The rule file as line per rule
condition > new_values
condition is a logical expression which is evaluated for each word, and if true the new values are set to the token which satisfies the condition.
The condition is a set of key:values, operators like and, or or not and parentheses. The condition may contain whitespaces:
Examples:
Upos:ADP and !Deprel:case: true if the current token hasADPas UPOS and its deprel is notcase. Available keys:Upos:(Values:[A-Z]+)Xpos:(Values: string of any character except whitespaces,)and&)Lemma:(Values: string of any character except whitespaces,)and&)Form:(Values: string of any character except whitespaces,)and&)Deprel:(Values: string of any character except whitespaces,)and&, optionally followed by:and a string of any character except whitespaces,)and&)HeadId:(Values:[+-][0-9]+(relative from current head) or[0-9]+(absolute head id), true if the head of the current token matchesEUD:(Values:[+-][0-9]+:, deprel, if EudHeadId is*any head position is accepted, without-or+the EudHead is interpreted as an absolute value))Feat:(Values: FeatureName=Value or FeatureName:Value. The Featurename must match[A-Za-z_\[\]]+, the Value[A-Za-z0-9]+)Misc:(Values: MiscName=Value or MiscName:Value. MiscName must match[A-Za-z_]+, the Value can be any string without whitespaces,)and&)Id:(Values: integer)MWT:(Values: length of the multi-word token[2-9])IsEmpty(no value, true if the current node is empty)IsMWT(no value, true if the current node is a MWT)
Form:, Lemma: and Xpos: can contain simple regular expression (only the character ')' cannot be used.
To check for any Feat or Misc value, leave the value empty:
Feat:Gender:true if the current word has the featureGenderwith any value
In order to check for the absence of a given Featurename in the Feature or Misc column, use the following:
not Feat:Gender:true if the current word has no featureGender
EUD cannot deal (yet) with empty word ids (n.m)
Lemma and Form can have either a regex as argument or a filename of a file which contains a list of forms or lemmas:
Lemma:sing.* > misc:"Value=Sing"Lemma:#mylemmas.txt > misc:"Value=Sing"(if the filemylemmas.txtdoes not exist, the condition is false)
In addition to key keys listed above, four functions are available to take the context of the token into account:
child()child of current tokenhead()head of current tokenprec()preceding tokennext()following token
For example:
head(head(Upos:VERB and Feat:Tense=Past)): true if the current token has a head who has a head with UPOSVERB and the featureTense=Past`child(Upos:VERB && Feat:VerbForm=Part) and child(Upos:DET): true if the current token has a dependant with UPOSVERBand a featureVerbForm=Partand another child with UPOSDET.head(next(Upos:NOUN)): true if the current token has a head which is followed by a token with UPOSNOUN
Functions can be nested (eventhough child(head()) does not make sense, does it :-)
In order to compare values (for instance to check whether subject-verb agreement is OK),
value comparison is possible using the access operator @: e.g. @Upos or @Feat:Number gives access to column values
and = is used to compare.
If any of the accessed columns is empty (_) the comparison is evaluated as false.
For example:
@Feat:Number=head(@Feat:Number)returns true if the current word and its head both have a featureNumberwith the same value@Upos=@Xposreturns true if the current word has the same value forUPOSandXPOS@Deprel=prec(@Deprel): true, if the current word and the preceding word have the samedeprelvalue@Xpos=head(head(@Feat:Featname))true if theXPOSof the current word has the same value as the featureFeatnameof the head of its head.@Feat:Gender=head(@Feat:Gender) and not Upos:DETtrue if the head and the current word have the same value for the featureGenderand the current word is not aDETIf either of the two words had no featureGenderthe whole expression is evaluated as false.
The same search language is used for complex search and replace.
For more information check the formal grammar for conditions.
new_values is a whitespace separated list of targeted_colum:value which modify the tokens matched the condition.
The targeted_column indicates which column of the word a new value is assigned to:
Possible keys:
FormLemmaUposXposDeprelHeadIdFeatEudMisc(theIdcolumn cannot be changed).
value is a combination (using +) of strings or functions which give access to other columns of the current word or it's head. Strings must be included
in double quotes "NOUN".
column_name to retrieve a value from can be:
FormLemmaUposXposFeat_<FeatureName>DeprelMisc_<KeyName>HeadId
Available functions are:
this(<column_name>)value of the given column of the current tokenhead(<column_name>)value of the given column of the head of the current tokenhead(head(<column_name>)value of the given column of the head's head of the current tokensubstring(this()/head(), start, end)take the substring of the this/head expression fromstarttoendsubstring(this()/head(), start)take the substring of the result of the this/head expression fromstartuntil the end of the stringupper(this()/head())uppercase the result of the this/head expressionlower(this()/head())lowercase the result of the this/head expressioncap(this()/head())capitalize (first character uppercase, rest lowercase) the result of the this/head expressionreplace(this()/head(), regex, newstring) replaces theregexof the result fo the this/head expression bynewstring`
If a token has a head 0, it's deprel will always be root unless the option --nostrict is used with replace.sh
Upos:"NOUN"set Upos toNOUNEud:"+2:dep"add a enhanced UD relation "dep" using the current id + 2 (must be a negative or positive integer without 0 (if resulting head id is out of the sentence, the head id is not modified)Eud:head(HeadId)+":"+head(Deprel)set EUD to head and deprel of the headwordHeadId:"+2"set head to current ud + 2 (must be a negative or positive integer without 0 (if resulting head id is out of the sentence, the head id is not modified)HeadId:"-1"set head to current ud - 1HeadId:"5"set head to 5 (n must be 0 or a positive integer)HeadId:head(Headid)set head to the headid of head nodeFeat:"Number=Sing"adds a featureNumber=Sing(Number: deletes the feature)Lemma:this(Form)set lemma to the form of current tokenLemma:this(Misc_Translit)set lemma to the keyTranslitof theMisccolumnLemma:this(Form)+"er"set lemma to the form + "er"Lemma:"de"+token(Form)set lemma to "de" + formFeat:"Featname"+this(Lemma)set the feature Featname to the value of LemmaFeat:"Gender"+this(Misc_Special)set the feature Gender to the value of the Misc SpecialMisc:"Keyname"+head(head(Upos))set the key "Keyname" ofMisccolumn to the Upos of the head of the headLemma:substring(this(Form),1,3)set lemma to the substring (1 - 3) of the formLemma:substring(this(Form),1)set lemma to the substring (1 - end) ofthe formForm:replace(this(Form),"é","e")replace all occurrances oféin the form bye
N.B. no white spaces allowed in a value expression!
therefore Lemma:substring(this(Form), 1, 3) or Lemma:this(Form) + "er"are invalid, useLemma:substring(this(Form),1,3) or Lemma:this(Form)+"er" instead.
In order to empty a column, just set it to "_": Feat:"_", Xpos:"_", Eud:"_" etc.
For more information check the formal grammar for replacements (the part after the first :).