I've published my mappings from Wordnet 3.0 synsets to Wordnet 2.0 synsets in the
wn20mappings folder.
Note: The mappings in this folder are not based on any Princeton sourcefile.
All erroneous mappings are my responsibility, not Princeton's.
Mapping statistics and origin:
The mappings in this folder have been created in multiple steps.
The result of each step is reflected in a separate file.
In the RDF version we have 117,657 Wordnet 3.0 synsets to be mapped.
-
Step one: detecting synsets with identical label and gloss (103,339)
- I've detected 103,339 Wordnet 3.0 synsets with a unique one-to-one mapping to a Wordnet 2.0 synset
on the basis of having both an identical label and gloss. I assume these synsets correspond.
Note that this first step covers already around 88% of all synsets.
Results are in file: glossmatches-m.ttl
An additional twelve Wordnet 3.0 synsets where found which had a mapping to two Wordnet 2.0 synsets,
based on identical gloss and label,
and two Wordnet 3.0 synsets that where both mapped to the same Wordnet 2.0 synset
Results are in file: glossmatches-p.ttl
Since the mappings in the file above are ambiguous, we will ignore them in the following steps.
-
Step two: detecting synsets with identical label
and strong family resemblences.
-
For all the 3.0 synsets not having a one-to-one mapping already, I've
looked at 2.0 synsets that have identical labels and:
- Both have a matching broader and narrower synset
in the hyponym that was already matched by an
earlier step.
Results are in file:
label-childparent-matches.ttl (1,272/1,550).
- Only have a broader (based on hyponym, meronym or instance) match.
Results are in file: label-parent-matches.ttl (3,396/3,682).
Results are in file: label-meronym-matches.ttl (1,403/1,561).
Results are in file: label-instance-matches.ttl (507/486).
- Only have a narrower (hyponym axis) match.
Results are in file: label-child-matches.ttl (309/141).
-
If non of the above applies, but a label occurs
only once in wn30 and also only once in wn20
(within the same part of speech), we consider the corresponding
synsets to match as well.
Results are in file:
label-unique-matches.ttl (1562/1200).
-
If non of the above applies, but the labels match
and the glosses are very similar, we consider the corresponding
synsets to match as well.
Results are in file:
label-neargloss-matches.ttl (823/666).
- Before saving the above 3 results, we have
removed the synsets for which this step three resulted in ambiguous alignments,
and saved this ambigous mappings in a separate file.
Results are in file:
ambiguous-label-pc-matches.ttl (253/279).
- Step three: rerun step two
- We rerun step two multiple times to take
advantage of the new mappings generated. Repeat
until no new mappings are found (this was the case
after three repetitions). The second number in the
statics above shows the number on which this stabelizes.
Analysis of recall
This leaves us with
117657 - 103339 - 1550 - 3682 - 1561 - 486 - 141 - 1200 - 666 - 165 = 4869
unmapped synsets.
These are in the file:
to_be_mapped.ttl (4869)
A quick manual inspection showed that many of
these unmapped synsets are new senses of existing words.
Improvements on the mappings will be posted on this site.