Project 2b: NGordNet (WordNet)

This project was more real-world like, open-ended, since the instructions ask for a specific outcome, but don't tell you how to get there.

WordNetPrerequisites to running the code. Additionally, you need to download the data files.

Instructions: https://fa22.datastructur.es/materials/proj/proj2b/

Solution: https://github.com/tomthestrom/cs61b/tree/master/proj2b

About the project

Great project without too many instructions on how to do it.

Included:

  1. Implementing lookups to support the functionality of the Graph.

Using the WordNet Dataset #

Before we can incorporate WordNet into our project, we first need to understand the WordNet dataset.

WordNet is a “semantic lexicon for the English language” that is used extensively by computational linguists and cognitive scientists; for example, it was a key component in IBM’s Watson. WordNet groups words into sets of synonyms called synsets and describes semantic relationships between them. One such relationship is the is-a relationship, which connects a hyponym (more specific synset) to a hypernym (more general synset). For example, “change” is a hypernym of “demotion”, since “demotion” is-a (type of) “change”. “ change” is in turn a hyponym of “action”, since “change” is-a (type of) “action”. A visual depiction of some hyponym relationships in English is given below:

Each node in the graph above is a synset. Synsets consist of one or more words in English that all have the same meaning. For example, one synset is “jump, parachuting” , which represents the act of descending to the ground with a parachute. “jump, parachuting” is a hyponym of “descent”, since “jump, parachuting” is-a “descent”.

Words in English may belong to multiple synsets. This is just another way of saying words may have multiple meanings. For example, the word “jump” also belongs to the synset “jump, leap” , which represents the more figurative notion of jumping (e.g. a jump in attendance) rather the literal meaning of jump from the other synset (e.g. a jump over a puddle). The hypernym of the synset “jump, leap” is “increase”, since “jump, leap” is-an “increase”. Of course, there are other ways to “increase” something: for example, we can increase something through “augmentation,” and thus it is no surprise that we have an arrow pointing downwards from “increase” to “augmentation” in the diagram above.

Synsets may include not just words, but also what are known as collocations. You can think of these as single words that occur next to each other so often that they are considered a single word, e.g. nasal_decongestant . To avoid ambiguity, we will represent the constituent words of collocations as being separated with an underscore _ instead of the usual convention in English of separating them with spaces. For simplicity, we will refer to collocations as simply “words” throughout this document.

A synset may be a hyponym of multiple synsets. For example, “actifed” is a hyponym of both “antihistamine” and “ nasal_decongestant”, since “actifed” is both of these things.

If you’re curious, you can browse the Wordnet database by using the web interface , though this is not necessary for this project.

Challenges, Description, Implementation

I decided to divide this section into the most significant challenges I faced.

Challenge 1 - Coming up with a design for the supporting graph

Description

The first challenge was coming up with a design for a Graph to support the requested operations. The data was provided in 2 types of text files:

  • File type #1: List of noun synsets. The file synsets.txt (and other smaller files with synset in the name) lists all the synsets in WordNet. The first field is the synset id (an integer), the second field is the synonym set (or synset) , and the third field is its dictionary definition (also called its “gloss” for some reason). For example, the line

  • means that the synset { AND_circuit, AND_gate } has an id number of 36 and its definition is “a circuit in a computer that fires only when all of its inputs fire”. The individual nouns that comprise a synset are separated by spaces (and a synset element is not permitted to contain a space). The S synset ids are numbered 0 through S − 1; the id numbers will appear consecutively in the synset file. You will not (officially) use the definitions in this project, though you’re welcome to use them in some interesting way if you’d like if you decide to add optional features at the end of this project. The id numbers are useful because they also appear in the hyponym files, described as file type #2.

  • File type #2: List of hyponyms. The file hyponyms.txt (and other smaller files with hyponym in the name) contains the hyponym relationships: The first field is a synset id; subsequent fields are the id numbers of the synset’s direct hyponyms. For example, the following line

means that the synset 79537 (“viceroy vicereine”) has two hyponyms: 38611 (“exarch”) and 9007 (“Khedive”), representing that exarchs and Khedives are both types of viceroys (or vicereine). The synsets are obtained from the corresponding lines in the file synsets.txt:

There may be more than one line that starts with the same synset ID. For example, in hyponyms16.txt, we have

This indicates that both synsets 12 and 13 are direct hyponyms of synset 11. These two could also have been combined on to one line, i.e. the line below would have the exact same meaning, namely that synsets 12 and 13 are direct hyponyms of synset 11.

You might ask why there are two ways of specifying the same thing. Real world data is often messy, and we have to deal with it.

Implementation

if we "remove" all the "noise", the data provided by the instructors was basically in 2 formats.

The first was a "map" an Integer ID associated with a String word.

The second was already sort of a "graph" since it provided an adjacency list in a form of an Integer ID as a parent/hypernym and an Integer list of 1 or more of it's children/hyponyms. I used an IdToIdsMap<Integer, HashSet<Integer>> within the Graph class to store these relationships.

I proceeded by creating a WordNet class which would act as an intermediary between the only required class in this exam - HyponymHandler - the class that handles queries from the frontend and the Graph class, for which WordNet acts as a wrapper.

In order to support lookup by words provided by the user and to be able to use the results of the lookup (all the ids associated with the word) to query the Graph, I implemented 2 HashMaps that acted as instance properties of the WordNet class:

  1. A HashMap WordToIdsMap<String, HashSet<Integer>>:

    Upon lookup returns a HashSet of ids associated with the requested word, so that we can look them up in the Graph.

  2. A HashMap of IdToWordsMap<Integer, HashSet<String>>:

    Upon lookup returns a HashSet of words associated with the requested id, so that we can associate the results returned by querying the Graph with actual words.

Challenge 2 - Handling List of Words

Description

More verbose description: https://fa22.datastructur.es/materials/proj/proj2b/#handling-lists-of-words

Based on the description of the Graph functionality in Challenge nr 1, it's not hard to understand how retrieving hyponyms for one hypernym works (check WordNet - getHyponyms method for the exact implementation).

The challenge of handling a list of words layed in only returning words that are the hyponyms of all of the provided (the list) hypernyms. If the user types “female, animal” in the words box. Then, clicking “Hyponyms” should display [amazon, bird, cat, chick, dam, demoiselle, female, female_mammal, filly, hag, hen, nanny, nymph, siren], as these are all the words which are hyponyms of female and animal.

Implementation

I implemented this in WordNet - getHyponymsIntersect, where I first call the getHyponyms method for each word on the list, save each result in a TreeSet. These TreeSets are stored in an array which I iterate over and use retainAll to determine which words (hyponyms) appear in all TreeSets.

The result is then returned to the user.

Challenge 3 - Handling k > 0

Description

The methods implemented in the previous challenges returned as many "proper" results as possible. However - what if the user wanted to specify the maximum amount of words they're interested in? In that case, I have to find a way to return the the top k most popular words (possibly in a certain time period).

Implementation

To get the k most popular hyponyms of a hypernym (possibly a list of hypernyms), I implemented the getKPopularHyponyms method in WordNet class.

getKPopularHyponyms

  • uses a MaxPQ for sorting the hyponyms by popularity (weight) via WordNet - getWordWeightPQ

  • WordWeight is a public record with 2 props - String word and Double weight

    • implements Comparable interface - compareTo uses the weight property for comparison

  • is the method that is ultimately used to handle all the requests (called in HyponymHandler.handle), be it 1 or a list of hypernyms, a k = 0 or a k > 0.

  • The data flow is such as:

    HyponymHandler.handle => Wordnet.getKPopularHyponyms()

Summary

Those were the most challenging parts of the project, with the most challenging being designing a proper structure to represent the Directed Acyclic Graph that would represent the relationship between the hypernyms and hyponyms.

The only required part of the project was the HyponymHandler class and its handle method. The rest was to be decided by the students.

Extensive unit test were provided to test the HyponymHandler handle method and I also wrote some unit tests for the classes I implemented to verify their correctness.

Tests:

All tests can be found here: https://github.com/tomthestrom/cs61b/tree/master/proj2b/ngordnet/proj2b_testing. Additionally I'm linking the specific unit test I implemented above the code of each class.

Class HyponymHandler

Extensive tests for this class had already provided (see link above).

Class WordNet

Unit test: https://github.com/tomthestrom/cs61b/blob/master/proj2b/ngordnet/proj2b_testing/WordNetTest.java

Class Graph

Unit Test: https://github.com/tomthestrom/cs61b/blob/master/proj2b/ngordnet/proj2b_testing/TestGraph.java

Class IdToIdsMap

No test, class used within Graph whose proper functionality is tested.

Class WordToIdsMap

Unit test: https://github.com/tomthestrom/cs61b/blob/master/proj2b/ngordnet/proj2b_testing/TestWordToIdsMap.java

Class IdToWordsMap

Unit test: https://github.com/tomthestrom/cs61b/blob/master/proj2b/ngordnet/proj2b_testing/TestIdToWordsMap.java

Public record WordWeight

Last updated