Similarity is a measure of the similarity/dissimilarity
between
Text1 and Text2. E.g.
?- isub('E56.Language', 'languange', D, [normalize(true)]).
D = 0.4226950354609929. % [-1,1] range
?- isub('E56.Language', 'languange', D, [normalize(true),zero_to_one(true)]).
D = 0.7113475177304964. % [0,1] range
?- isub('E56.Language', 'languange', D, []). % without normalization
D = 0.19047619047619047. % [-1,1] range
?- isub(aa, aa, D, []). % does not work for short substrings
D = -0.8.
?- isub(aa, aa, D, [substring_threshold(0)]). % works with short substrings
D = 1.0. % but may give unwanted values
% between e.g. 'store' and 'spore'.
?- isub(joe, hoe, D, [substring_threshold(0)]).
D = 0.5315315315315314.
?- isub(joe, hoe, D, []).
D = -1.0.
This is a new version of isub/4
which replaces the old version while providing backwards compatibility.
This new version allows several options to tweak the algorithm.
Text1 | and Text2 are either an
atom, string or a list of characters or character codes. |
Similarity | is a float in the range [-1,1.0],
where 1.0 means most similar. The range can be set to [0,1] with
the zero_to_one option described below. |
Options | is a list with elements described
below. Please note that the options are processed at compile time using
goal_expansion to provide much better speed. Supported options are:
- normalize(+Boolean)
- Applies string normalization as implemented by the original authors: Text1
and Text2 are mapped to lowercase and the characters "._ "
are removed. Lowercase mapping is done with the C-library function
towlower() .
In general, the required normalization is domain dependent and is better
left to the caller. See e.g., unaccent_atom/2.
The default is to skip normalization (false ).
- zero_to_one(+Boolean)
- The old isub implementation deviated from the original algorithm by
returning a value in the [0,1] range. This new isub/4
implementation defaults to the original range of [-1,1], but this option
can be set to
true to set the output range to [0,1].
- substring_threshold(+Nonneg)
- The original algorithm was meant to compare terms in semantic web
ontologies, and it had a hard coded parameter that only considered
substring similarities greater than 2 characters. This caused the
similarity between, for example’aa’and’aa’to
return -0.8 which is not expected. This option allows the user to set
any threshold, such as 0, so that the similatiry between short
substrings can be properly recognized. The default value is 2 which is
what the original algorithm used.
|