WordNet to running the code. Additionally, you need to .
Instructions:
Solution:
About the project
Great project without too many instructions on how to do it.
Included:
with an adjacency list.
Implementing lookups to support the functionality of the Graph.
.
Using the WordNet Dataset
Before we can incorporate WordNet into our project, we first need to understand the WordNet dataset.
is a “semantic lexicon for the English language” that is used extensively by computational linguists and cognitive scientists; for example, it was a key component in IBM’s Watson. WordNet groups words into sets of synonyms called synsets and describes semantic relationships between them. One such relationship is the is-a relationship, which connects a hyponym (more specific synset) to a hypernym (more general synset). For example, “change” is a hypernym of “demotion”, since “demotion” is-a (type of) “change”. “ change” is in turn a hyponym of “action”, since “change” is-a (type of) “action”. A visual depiction of some hyponym relationships in English is given below:
Source: UC Berkeley - CS61B FA 22 ()
A synset may be a hyponym of multiple synsets. For example, “actifed” is a hyponym of both “antihistamine” and “ nasal_decongestant”, since “actifed” is both of these things.
Challenges, Description, Implementation
I decided to divide this section into the most significant challenges I faced.
Challenge 1 - Coming up with a design for the supporting graph
Description
The first challenge was coming up with a design for a Graph to support the requested operations. The data was provided in 2 types of text files:
File type #1: List of noun synsets. The file synsets.txt (and other smaller files with synset in the name) lists all the synsets in WordNet. The first field is the synset id (an integer), the second field is the synonym set (or synset) , and the third field is its dictionary definition (also called its “gloss” for some reason). For example, the line
36,AND_circuit AND_gate,a circuit in a computer that fires only when all of its inputs fire
means that the synset { AND_circuit, AND_gate } has an id number of 36 and its definition is “a circuit in a computer that fires only when all of its inputs fire”. The individual nouns that comprise a synset are separated by spaces (and a synset element is not permitted to contain a space). The S synset ids are numbered 0 through S − 1; the id numbers will appear consecutively in the synset file. You will not (officially) use the definitions in this project, though you’re welcome to use them in some interesting way if you’d like if you decide to add optional features at the end of this project. The id numbers are useful because they also appear in the hyponym files, described as file type #2.
File type #2: List of hyponyms. The file hyponyms.txt (and other smaller files with hyponym in the name) contains the hyponym relationships: The first field is a synset id; subsequent fields are the id numbers of the synset’s direct hyponyms. For example, the following line
79537,38611,9007
means that the synset 79537 (“viceroy vicereine”) has two hyponyms: 38611 (“exarch”) and 9007 (“Khedive”), representing that exarchs and Khedives are both types of viceroys (or vicereine). The synsets are obtained from the corresponding lines in the file synsets.txt:
79537,viceroy vicereine,governor of a country or province who rules...
38611,exarch,a viceroy who governed a large province in the Roman Empire
9007,Khedive,one of the Turkish viceroys who ruled Egypt between...
There may be more than one line that starts with the same synset ID. For example, in hyponyms16.txt, we have
11,12
11,13
This indicates that both synsets 12 and 13 are direct hyponyms of synset 11. These two could also have been combined on to one line, i.e. the line below would have the exact same meaning, namely that synsets 12 and 13 are direct hyponyms of synset 11.
11,12,13
You might ask why there are two ways of specifying the same thing. Real world data is often messy, and we have to deal with it.
Implementation
if we "remove" all the "noise", the data provided by the instructors was basically in 2 formats.
The first was a "map" an Integer ID associated with a String word.
In order to support lookup by words provided by the user and to be able to use the results of the lookup (all the ids associated with the word) to query the Graph, I implemented 2 HashMaps that acted as instance properties of the WordNet class:
Upon lookup returns a HashSet of ids associated with the requested word, so that we can look them up in the Graph.
Upon lookup returns a HashSet of words associated with the requested id, so that we can associate the results returned by querying the Graph with actual words.
Challenge 2 - Handling List of Words
Description
The challenge of handling a list of words layed in only returning words that are the hyponyms of all of the provided (the list) hypernyms. If the user types “female, animal” in the words box. Then, clicking “Hyponyms” should display [amazon, bird, cat, chick, dam, demoiselle, female, female_mammal, filly, hag, hen, nanny, nymph, siren], as these are all the words which are hyponyms of female and animal.
Implementation
The result is then returned to the user.
Challenge 3 - Handling k > 0
Description
The methods implemented in the previous challenges returned as many "proper" results as possible. However - what if the user wanted to specify the maximum amount of words they're interested in? In that case, I have to find a way to return the the top k most popular words (possibly in a certain time period).
Implementation
To get the k most popular hyponyms of a hypernym (possibly a list of hypernyms), I implemented the getKPopularHyponyms method in WordNetclass.
getKPopularHyponyms
uses a MaxPQ for sorting the hyponyms by popularity (weight) via WordNet - getWordWeightPQ
implements Comparable interface - compareTo uses the weightproperty for comparison
is the method that is ultimately used to handle all the requests (called in HyponymHandler.handle), be it 1 or a list of hypernyms, a k = 0 or a k > 0.
Those were the most challenging parts of the project, with the most challenging being designing a proper structure to represent the Directed Acyclic Graph that would represent the relationship between the hypernyms and hyponyms.
The only required part of the project was the HyponymHandler class and its handle method. The rest was to be decided by the students.
Extensive unit test were provided to test the HyponymHandler handle method and I also wrote some unit tests for the classes I implemented to verify their correctness.
Tests:
Class HyponymHandler
Extensive tests for this class had already provided (see link above).
package ngordnet.main;
import java.util.*;
import java.util.stream.Collectors;
import java.util.stream.Stream;
import edu.princeton.cs.algs4.In;
import edu.princeton.cs.algs4.MaxPQ;
import ngordnet.ngrams.NGramMap;
public class WordNet {
//wrapper for a graph
/**
* Keeps track of the ids associated with the word
*/
private WordToIdsMap wordToIdsMap;
/**
* Keeps track of the words associated with the id
*/
private IdToWordsMap idToWordsMap;
/**
* Keeps track of id to id connection hypernym -> hyponym
*/
private Graph graph;
public WordNet(String synsetsFileName, String hyponymsFileName) {
graph = new Graph();
wordToIdsMap = new WordToIdsMap();
idToWordsMap = new IdToWordsMap();
setMapsFromFile(synsetsFileName);
setGraphFromFile(hyponymsFileName);
}
/**
* Returns all the hyponyms of the hypernym, including the hypernym itself - in order thx to being a TreeSet
* @param hypernym
*/
public TreeSet<String> getHyponyms(String hypernym) {
HashSet<Integer> parentIds = wordToIdsMap.get(hypernym);
HashSet<Integer> childIds = new HashSet<>();
TreeSet<String> hyponyms = new TreeSet<>();
for (int parentId : parentIds) {
//find and get all child ids from the graph
childIds.addAll(graph.getAllChildren(parentId));
//add all words associated with the parent ids to hyponyms set
hyponyms.addAll(idToWordsMap.get(parentId));
}
for (int childId : childIds) {
//add all words associated with the found child ids to hyponyms set
hyponyms.addAll(idToWordsMap.get(childId));
}
return hyponyms;
}
/**
* Returns the hyponyms that are common between all requested hypernyms (sets intersection)
* @param hypernyms
*/
public TreeSet<String> getHyponymsIntersect(List<String> hypernyms) {
TreeSet<String> hyponyms;
TreeSet<String>[] hyponymsSets = new TreeSet[hypernyms.size()];
//create hypernyms.size() amount of hyponyms sets
for (int i = 0; i < hypernyms.size(); i++) {
hyponymsSets[i] = getHyponyms(hypernyms.get(i));
}
if (hyponymsSets[0] == null) {
hyponyms = new TreeSet<>();
} else {
//init hyponyms with the value of the 0th set
hyponyms = hyponymsSets[0];
}
//iterate over the rest of hyponyms sets - removing items not found in the other ones
for (int i = 1; i < hyponymsSets.length; i++) {
if (hyponymsSets[i] != null) {
hyponyms.retainAll(hyponymsSets[i]);
}
}
return hyponyms;
}
/**
* Returns k most popular hyponyms between years
*/
public TreeSet<String> getKPopularHyponyms(NGramMap ngramMap, List<String> hypernyms, int k, int startYear, int endYear) {
TreeSet<String> requestedHyponyms = getHyponymsIntersect(hypernyms);
if (k <= 0) {
return requestedHyponyms;
}
TreeSet<String> kPopularHyponyms = new TreeSet<>(String.CASE_INSENSITIVE_ORDER);
MaxPQ<WordWeight> maxPQ = getWordWeightPQ(ngramMap, requestedHyponyms, startYear, endYear);
Iterator<WordWeight> iterator = maxPQ.iterator();
int i = 0;
while (iterator.hasNext() && i < k) {
kPopularHyponyms.add(iterator.next().word());
i++;
}
return kPopularHyponyms;
}
private void addWordToMap(String word, Integer id) {
wordToIdsMap.addWordIdPair(word, id);
idToWordsMap.addIdWordPair(id, word);
}
private HashSet<Integer> getWordIdSet(String word) {
return wordToIdsMap.get(word);
}
private HashSet<String> getIdWordSet(int id) {
return idToWordsMap.get(id);
}
/**
* Sets both word to ids and ids to word mappings from the provided synsets.txt file
* @param fileName
*/
private void setMapsFromFile(String fileName) {
In in = new In(fileName);
ArrayList<String> line;
int id;
while (in.hasNextLine()) {
line = Stream.of(in.readLine().split(",")).collect(Collectors.toCollection(ArrayList<String>::new));
id = Integer.parseInt(line.get(0));
String[] words = (line.get(1)).trim().split("\\s+");
for (String word: words) {
addWordToMap(word, id);
}
}
}
/**
* Sets id to id mappings from the provided hyponyms.txt
* @param fileName
*/
private void setGraphFromFile(String fileName) {
In in = new In(fileName);
ArrayList<String> line;
int parentId;
while (in.hasNextLine()) {
line = Stream.of(in.readLine().split(",")).collect(Collectors.toCollection(ArrayList<String>::new));
parentId = Integer.parseInt(line.get(0));
for (int i = 1; i < line.size(); i++) {
int childId = Integer.parseInt(line.get(i));
graph.addNode(parentId, childId);
}
}
}
/**
* Returns a PQ of k amount of WordWeight objects - weight is based on their popularity in NgramMap
* between the given years
*/
private MaxPQ<WordWeight> getWordWeightPQ(NGramMap nGramMap, Collection<String> words, int startYear, int endYear) {
MaxPQ<WordWeight> maxPQ = new MaxPQ<>();
for (String word : words) {
/*
* nMap - summedWHistory only takes lists as arg
* returns a timeseries year -> sum
* data() of timeseries returns the sums
* .stream().mapToDouble().sum() to get the total sum
*/
double weight = nGramMap.summedWeightHistory(Collections.singletonList(word), startYear, endYear)
.data()
.stream()
.mapToDouble(Double::doubleValue).sum();
WordWeight wordWeight = new WordWeight(word, weight);
maxPQ.insert(wordWeight);
}
return maxPQ;
}
}
Class WordNet
package ngordnet.main;
import java.util.*;
import java.util.stream.Collectors;
import java.util.stream.Stream;
import edu.princeton.cs.algs4.In;
import edu.princeton.cs.algs4.MaxPQ;
import ngordnet.ngrams.NGramMap;
public class WordNet {
//wrapper for a graph
/**
* Keeps track of the ids associated with the word
*/
private WordToIdsMap wordToIdsMap;
/**
* Keeps track of the words associated with the id
*/
private IdToWordsMap idToWordsMap;
/**
* Keeps track of id to id connection hypernym -> hyponym
*/
private Graph graph;
public WordNet(String synsetsFileName, String hyponymsFileName) {
graph = new Graph();
wordToIdsMap = new WordToIdsMap();
idToWordsMap = new IdToWordsMap();
setMapsFromFile(synsetsFileName);
setGraphFromFile(hyponymsFileName);
}
/**
* Returns all the hyponyms of the hypernym, including the hypernym itself - in order thx to being a TreeSet
* @param hypernym
*/
public TreeSet<String> getHyponyms(String hypernym) {
HashSet<Integer> parentIds = wordToIdsMap.get(hypernym);
HashSet<Integer> childIds = new HashSet<>();
TreeSet<String> hyponyms = new TreeSet<>();
for (int parentId : parentIds) {
//find and get all child ids from the graph
childIds.addAll(graph.getAllChildren(parentId));
//add all words associated with the parent ids to hyponyms set
hyponyms.addAll(idToWordsMap.get(parentId));
}
for (int childId : childIds) {
//add all words associated with the found child ids to hyponyms set
hyponyms.addAll(idToWordsMap.get(childId));
}
return hyponyms;
}
/**
* Returns the hyponyms that are common between all requested hypernyms (sets intersection)
* @param hypernyms
*/
public TreeSet<String> getHyponymsIntersect(List<String> hypernyms) {
TreeSet<String> hyponyms;
TreeSet<String>[] hyponymsSets = new TreeSet[hypernyms.size()];
//create hypernyms.size() amount of hyponyms sets
for (int i = 0; i < hypernyms.size(); i++) {
hyponymsSets[i] = getHyponyms(hypernyms.get(i));
}
if (hyponymsSets[0] == null) {
hyponyms = new TreeSet<>();
} else {
//init hyponyms with the value of the 0th set
hyponyms = hyponymsSets[0];
}
//iterate over the rest of hyponyms sets - removing items not found in the other ones
for (int i = 1; i < hyponymsSets.length; i++) {
if (hyponymsSets[i] != null) {
hyponyms.retainAll(hyponymsSets[i]);
}
}
return hyponyms;
}
/**
* Returns k most popular hyponyms between years
*/
public TreeSet<String> getKPopularHyponyms(NGramMap ngramMap, List<String> hypernyms, int k, int startYear, int endYear) {
TreeSet<String> requestedHyponyms = getHyponymsIntersect(hypernyms);
if (k <= 0) {
return requestedHyponyms;
}
TreeSet<String> kPopularHyponyms = new TreeSet<>(String.CASE_INSENSITIVE_ORDER);
MaxPQ<WordWeight> maxPQ = getWordWeightPQ(ngramMap, requestedHyponyms, startYear, endYear);
Iterator<WordWeight> iterator = maxPQ.iterator();
int i = 0;
while (iterator.hasNext() && i < k) {
kPopularHyponyms.add(iterator.next().word());
i++;
}
return kPopularHyponyms;
}
private void addWordToMap(String word, Integer id) {
wordToIdsMap.addWordIdPair(word, id);
idToWordsMap.addIdWordPair(id, word);
}
private HashSet<Integer> getWordIdSet(String word) {
return wordToIdsMap.get(word);
}
private HashSet<String> getIdWordSet(int id) {
return idToWordsMap.get(id);
}
/**
* Sets both word to ids and ids to word mappings from the provided synsets.txt file
* @param fileName
*/
private void setMapsFromFile(String fileName) {
In in = new In(fileName);
ArrayList<String> line;
int id;
while (in.hasNextLine()) {
line = Stream.of(in.readLine().split(",")).collect(Collectors.toCollection(ArrayList<String>::new));
id = Integer.parseInt(line.get(0));
String[] words = (line.get(1)).trim().split("\\s+");
for (String word: words) {
addWordToMap(word, id);
}
}
}
/**
* Sets id to id mappings from the provided hyponyms.txt
* @param fileName
*/
private void setGraphFromFile(String fileName) {
In in = new In(fileName);
ArrayList<String> line;
int parentId;
while (in.hasNextLine()) {
line = Stream.of(in.readLine().split(",")).collect(Collectors.toCollection(ArrayList<String>::new));
parentId = Integer.parseInt(line.get(0));
for (int i = 1; i < line.size(); i++) {
int childId = Integer.parseInt(line.get(i));
graph.addNode(parentId, childId);
}
}
}
/**
* Returns a PQ of k amount of WordWeight objects - weight is based on their popularity in NgramMap
* between the given years
*/
private MaxPQ<WordWeight> getWordWeightPQ(NGramMap nGramMap, Collection<String> words, int startYear, int endYear) {
MaxPQ<WordWeight> maxPQ = new MaxPQ<>();
for (String word : words) {
/*
* nMap - summedWHistory only takes lists as arg
* returns a timeseries year -> sum
* data() of timeseries returns the sums
* .stream().mapToDouble().sum() to get the total sum
*/
double weight = nGramMap.summedWeightHistory(Collections.singletonList(word), startYear, endYear)
.data()
.stream()
.mapToDouble(Double::doubleValue).sum();
WordWeight wordWeight = new WordWeight(word, weight);
maxPQ.insert(wordWeight);
}
return maxPQ;
}
}
Class Graph
package ngordnet.main;
import java.util.HashSet;
public class Graph {
private IdToIdsMap idToIdsMap;
public Graph() {
idToIdsMap = new IdToIdsMap();
}
public void addNode(int parentId, int childId) {
idToIdsMap.addIdIdPair(parentId, childId);
}
public HashSet<Integer> getNode(int parentId) {
return idToIdsMap.get(parentId);
}
/**
* Returns all children of the parent, not including the parent
* @param parentId
* @return
*/
public HashSet<Integer> getAllChildren(int parentId) {
HashSet<Integer> children = new HashSet<>();
if (getNode(parentId) != null) {
for (int childId: getNode(parentId)) {
children.addAll(getAllChildrenHelper(childId, children));
}
}
return children;
}
/**
* Recursively collects all children
* @param parentId
* @param children
*/
private HashSet<Integer> getAllChildrenHelper(int parentId, HashSet<Integer> children) {
children.add(parentId);
if (getNode(parentId) != null) {
for (int childId: getNode(parentId)) {
getAllChildrenHelper(childId, children);
}
}
return children;
}
}
Class IdToIdsMap
package ngordnet.main;
import java.util.HashMap;
import java.util.HashSet;
public class IdToIdsMap extends HashMap<Integer, HashSet> {
public IdToIdsMap() {
super();
}
public void addIdIdPair(int parentId, int childId) {
HashSet<Integer> childIdSet;
if (this.containsKey(parentId)) {
childIdSet = this.get(parentId);
childIdSet.add(childId);
return;
}
childIdSet = new HashSet<>();
childIdSet.add(childId);
this.put(parentId, childIdSet);
}
}
Class WordToIdsMap
package ngordnet.main;
import java.util.HashMap;
import java.util.HashSet;
/**
* LookUp of word associated with a set of ids (a word can have multiple ids, depending on its meaning in the WordNet)
*/
public class WordToIdsMap extends HashMap<String, HashSet<Integer>> {
public WordToIdsMap() {
super();
}
public void addWordIdPair(String word, Integer id) {
HashSet<Integer> idSet;
if (this.containsKey(word)) {
idSet = this.get(word);
idSet.add(id);
return;
}
idSet = new HashSet<>();
idSet.add(id);
this.put(word, idSet);
}
@Override
public HashSet get(Object key) {
HashSet<Integer> set = super.get(key);
if (set == null) {
set = new HashSet<>();
}
return set;
}
}
Class IdToWordsMap
package ngordnet.main;
import java.util.HashMap;
import java.util.HashSet;
public class IdToWordsMap extends HashMap<Integer, HashSet<String>> {
public IdToWordsMap() {
super();
}
public void addIdWordPair(Integer id, String word) {
HashSet<String> wordSet;
if (this.containsKey(id)) {
wordSet = this.get(id);
wordSet.add(word);
return;
}
wordSet = new HashSet<>();
wordSet.add(word);
this.put(id, wordSet);
}
}
Public record WordWeight
package ngordnet.main;
public record WordWeight (String word, Double weight) implements Comparable<WordWeight>{
@Override
public int compareTo(WordWeight wordWeight) {
return weight().compareTo(wordWeight.weight());
}
}
Each node in the graph above is a synset. Synsets consist of one or more words in English that all have the same meaning. For example, one synset is , which represents the act of descending to the ground with a parachute. “jump, parachuting” is a hyponym of “descent”, since “jump, parachuting” is-a “descent”.
Words in English may belong to multiple synsets. This is just another way of saying words may have multiple meanings. For example, the word “jump” also belongs to the synset , which represents the more figurative notion of jumping (e.g. a jump in attendance) rather the literal meaning of jump from the other synset (e.g. a jump over a puddle). The hypernym of the synset “jump, leap” is “increase”, since “jump, leap” is-an “increase”. Of course, there are other ways to “increase” something: for example, we can increase something through “augmentation,” and thus it is no surprise that we have an arrow pointing downwards from “increase” to “augmentation” in the diagram above.
Synsets may include not just words, but also what are known as . You can think of these as single words that occur next to each other so often that they are considered a single word, e.g. . To avoid ambiguity, we will represent the constituent words of collocations as being separated with an underscore _ instead of the usual convention in English of separating them with spaces. For simplicity, we will refer to collocations as simply “words” throughout this document.
If you’re curious, you can browse the Wordnet database by , though this is not necessary for this project.
The second was already sort of a "graph" since it provided an adjacency list in a form of an IntegerID as a parent/hypernym and an Integerlist of 1 or more of it's children/hyponyms. I used anwithin the class to store these relationships.
I proceeded by creating a class which would act as an intermediary between the only required class in this exam - - the class that handles queries from the frontend and the Graphclass, for which WordNet acts as a wrapper.
A HashMap:
A HashMap of :
More verbose description:
Based on the , it's not hard to understand how retrieving hyponyms for one hypernym works (check WordNet - getHyponyms method for the exact implementation).
I implemented this in WordNet - getHyponymsIntersect, where I first call the getHyponyms method for each word on the list, save each result in a TreeSet. These TreeSets are stored in an array which I iterate over and use to determine which words (hyponyms) appear in all TreeSets.
is a public record with 2 props - Stringword and Doubleweight
All tests can be found here: . Additionally I'm linking the specific unit test I implemented above the code of each class.
Unit test:
Unit Test:
No test, class used within whose proper functionality is tested.