Project 2b: NGordNet (WordNet)

This project was more real-world like, open-ended, since the instructions ask for a specific outcome, but don't tell you how to get there.

PreviousLab 08: HashMap NextLab 11: BYOW Introduction

Last updated 2 years ago

Project 2b: NGordNet (WordNet)

This project was more real-world like, open-ended, since the instructions ask for a specific outcome, but don't tell you how to get there.

WordNet to running the code. Additionally, you need to .

Instructions:

Solution:

About the project

Great project without too many instructions on how to do it.

Included:

with an adjacency list.
Implementing lookups to support the functionality of the Graph.
.

Using the WordNet Dataset

Before we can incorporate WordNet into our project, we first need to understand the WordNet dataset.

is a “semantic lexicon for the English language” that is used extensively by computational linguists and cognitive scientists; for example, it was a key component in IBM’s Watson. WordNet groups words into sets of synonyms called synsets and describes semantic relationships between them. One such relationship is the is-a relationship, which connects a hyponym (more specific synset) to a hypernym (more general synset). For example, “change” is a hypernym of “demotion”, since “demotion” is-a (type of) “change”. “ change” is in turn a hyponym of “action”, since “change” is-a (type of) “action”. A visual depiction of some hyponym relationships in English is given below:

A synset may be a hyponym of multiple synsets. For example, “actifed” is a hyponym of both “antihistamine” and “ nasal_decongestant”, since “actifed” is both of these things.

Challenges, Description, Implementation

I decided to divide this section into the most significant challenges I faced.

Challenge 1 - Coming up with a design for the supporting graph

Description

The first challenge was coming up with a design for a Graph to support the requested operations. The data was provided in 2 types of text files:

File type #1: List of noun synsets. The file synsets.txt (and other smaller files with synset in the name) lists all the synsets in WordNet. The first field is the synset id (an integer), the second field is the synonym set (or synset) , and the third field is its dictionary definition (also called its “gloss” for some reason). For example, the line

 36,AND_circuit AND_gate,a circuit in a computer that fires only when all of its inputs fire

means that the synset { AND_circuit, AND_gate } has an id number of 36 and its definition is “a circuit in a computer that fires only when all of its inputs fire”. The individual nouns that comprise a synset are separated by spaces (and a synset element is not permitted to contain a space). The S synset ids are numbered 0 through S − 1; the id numbers will appear consecutively in the synset file. You will not (officially) use the definitions in this project, though you’re welcome to use them in some interesting way if you’d like if you decide to add optional features at the end of this project. The id numbers are useful because they also appear in the hyponym files, described as file type #2.

File type #2: List of hyponyms. The file hyponyms.txt (and other smaller files with hyponym in the name) contains the hyponym relationships: The first field is a synset id; subsequent fields are the id numbers of the synset’s direct hyponyms. For example, the following line

79537,38611,9007

means that the synset 79537 (“viceroy vicereine”) has two hyponyms: 38611 (“exarch”) and 9007 (“Khedive”), representing that exarchs and Khedives are both types of viceroys (or vicereine). The synsets are obtained from the corresponding lines in the file synsets.txt:

79537,viceroy vicereine,governor of a country or province who rules...
38611,exarch,a viceroy who governed a large province in the Roman Empire
9007,Khedive,one of the Turkish viceroys who ruled Egypt between...

There may be more than one line that starts with the same synset ID. For example, in hyponyms16.txt, we have

11,12
11,13

This indicates that both synsets 12 and 13 are direct hyponyms of synset 11. These two could also have been combined on to one line, i.e. the line below would have the exact same meaning, namely that synsets 12 and 13 are direct hyponyms of synset 11.

11,12,13

You might ask why there are two ways of specifying the same thing. Real world data is often messy, and we have to deal with it.

Implementation

if we "remove" all the "noise", the data provided by the instructors was basically in 2 formats.

The first was a "map" an Integer ID associated with a String word.

In order to support lookup by words provided by the user and to be able to use the results of the lookup (all the ids associated with the word) to query the Graph, I implemented 2 HashMaps that acted as instance properties of the WordNet class:

Upon lookup returns a HashSet of ids associated with the requested word, so that we can look them up in the Graph.
Upon lookup returns a HashSet of words associated with the requested id, so that we can associate the results returned by querying the Graph with actual words.

Challenge 2 - Handling List of Words

Description

The challenge of handling a list of words layed in only returning words that are the hyponyms of all of the provided (the list) hypernyms. If the user types “female, animal” in the words box. Then, clicking “Hyponyms” should display [amazon, bird, cat, chick, dam, demoiselle, female, female_mammal, filly, hag, hen, nanny, nymph, siren], as these are all the words which are hyponyms of female and animal.

Implementation

The result is then returned to the user.

Challenge 3 - Handling k > 0

Description

The methods implemented in the previous challenges returned as many "proper" results as possible. However - what if the user wanted to specify the maximum amount of words they're interested in? In that case, I have to find a way to return the the top k most popular words (possibly in a certain time period).

Implementation

To get the k most popular hyponyms of a hypernym (possibly a list of hypernyms), I implemented the getKPopularHyponyms method in WordNet class.

getKPopularHyponyms

uses a MaxPQ for sorting the hyponyms by popularity (weight) via WordNet - getWordWeightPQ
- implements Comparable interface - compareTo uses the weight property for comparison
is the method that is ultimately used to handle all the requests (called in HyponymHandler.handle), be it 1 or a list of hypernyms, a k = 0 or a k > 0.
The data flow is such as:
HyponymHandler.handle => Wordnet.getKPopularHyponyms()

Summary

Those were the most challenging parts of the project, with the most challenging being designing a proper structure to represent the Directed Acyclic Graph that would represent the relationship between the hypernyms and hyponyms.

The only required part of the project was the HyponymHandler class and its handle method. The rest was to be decided by the students.

Extensive unit test were provided to test the HyponymHandler handle method and I also wrote some unit tests for the classes I implemented to verify their correctness.

Tests:

Class HyponymHandler

Extensive tests for this class had already provided (see link above).

package ngordnet.main;

import java.util.*;
import java.util.stream.Collectors;
import java.util.stream.Stream;

import edu.princeton.cs.algs4.In;
import edu.princeton.cs.algs4.MaxPQ;
import ngordnet.ngrams.NGramMap;

public class WordNet {
    //wrapper for a graph

    /**
     * Keeps track of the ids associated with the word
     */
    private WordToIdsMap wordToIdsMap;

    /**
     * Keeps track of the words associated with the id
     */
    private IdToWordsMap idToWordsMap;

    /**
     * Keeps track of id to id connection hypernym -> hyponym
     */
    private Graph graph;

    public WordNet(String synsetsFileName, String hyponymsFileName) {
        graph = new Graph();
        wordToIdsMap = new WordToIdsMap();
        idToWordsMap = new IdToWordsMap();

        setMapsFromFile(synsetsFileName);
        setGraphFromFile(hyponymsFileName);
    }

    /**
     * Returns all the hyponyms of the hypernym, including the hypernym itself - in order thx to being a TreeSet
     * @param hypernym
     */
    public TreeSet<String> getHyponyms(String hypernym) {
        HashSet<Integer> parentIds = wordToIdsMap.get(hypernym);
        HashSet<Integer> childIds = new HashSet<>();
        TreeSet<String> hyponyms = new TreeSet<>();

        for (int parentId : parentIds) {
            //find and get all child ids from the graph
            childIds.addAll(graph.getAllChildren(parentId));
            //add all words associated with the parent ids to hyponyms set
            hyponyms.addAll(idToWordsMap.get(parentId));
        }

        for (int childId : childIds) {
            //add all words associated with the found child ids to hyponyms set
            hyponyms.addAll(idToWordsMap.get(childId));
        }

        return hyponyms;
    }

    /**
     * Returns the hyponyms that are common between all requested hypernyms (sets intersection)
     * @param hypernyms
     */
    public TreeSet<String> getHyponymsIntersect(List<String> hypernyms) {
        TreeSet<String> hyponyms;
        TreeSet<String>[] hyponymsSets = new TreeSet[hypernyms.size()];

        //create hypernyms.size() amount of hyponyms sets
        for (int i = 0; i < hypernyms.size(); i++) {
           hyponymsSets[i] = getHyponyms(hypernyms.get(i));
        }

        if (hyponymsSets[0] == null) {
            hyponyms = new TreeSet<>();
        } else {
            //init hyponyms with the value of the 0th set
            hyponyms = hyponymsSets[0];
        }

        //iterate over the rest of hyponyms sets - removing items not found in the other ones
        for (int i = 1; i < hyponymsSets.length; i++) {
            if (hyponymsSets[i] != null) {
                hyponyms.retainAll(hyponymsSets[i]);
            }
        }

        return hyponyms;
    }

    /**
     * Returns k most popular hyponyms between years
     */
    public TreeSet<String> getKPopularHyponyms(NGramMap ngramMap, List<String> hypernyms, int k, int startYear, int endYear) {
        TreeSet<String> requestedHyponyms = getHyponymsIntersect(hypernyms);

        if (k <= 0) {
            return requestedHyponyms;
        }

        TreeSet<String> kPopularHyponyms = new TreeSet<>(String.CASE_INSENSITIVE_ORDER);
        MaxPQ<WordWeight> maxPQ = getWordWeightPQ(ngramMap, requestedHyponyms, startYear, endYear);

        Iterator<WordWeight> iterator = maxPQ.iterator();

        int i = 0;
        while (iterator.hasNext() && i < k) {
            kPopularHyponyms.add(iterator.next().word());
            i++;
        }

        return kPopularHyponyms;
    }

    private void addWordToMap(String word, Integer id) {
        wordToIdsMap.addWordIdPair(word, id);
        idToWordsMap.addIdWordPair(id, word);
    }

    private HashSet<Integer> getWordIdSet(String word) {
        return wordToIdsMap.get(word);
    }

    private HashSet<String> getIdWordSet(int id) {
        return idToWordsMap.get(id);
    }

    /**
     * Sets both word to ids and ids to word mappings from the provided synsets.txt file
     * @param fileName
     */
    private void setMapsFromFile(String fileName) {
        In in = new In(fileName);
        ArrayList<String> line;
        int id;

        while (in.hasNextLine()) {
            line = Stream.of(in.readLine().split(",")).collect(Collectors.toCollection(ArrayList<String>::new));
            id =  Integer.parseInt(line.get(0));
            String[] words = (line.get(1)).trim().split("\\s+");

            for (String word: words) {
                addWordToMap(word, id);
            }
        }
    }

    /**
     * Sets id to id mappings from the provided hyponyms.txt
     * @param fileName
     */
    private void setGraphFromFile(String fileName) {
        In in = new In(fileName);
        ArrayList<String> line;
        int parentId;

        while (in.hasNextLine()) {
            line = Stream.of(in.readLine().split(",")).collect(Collectors.toCollection(ArrayList<String>::new));
            parentId =  Integer.parseInt(line.get(0));

            for (int i = 1; i < line.size(); i++) {
               int childId = Integer.parseInt(line.get(i));
               graph.addNode(parentId, childId);
            }
        }
    }

    /**
     * Returns a PQ of k amount of WordWeight objects - weight is based on their popularity in NgramMap
     * between the given years
     */
    private MaxPQ<WordWeight> getWordWeightPQ(NGramMap nGramMap, Collection<String> words,  int startYear, int endYear) {
        MaxPQ<WordWeight> maxPQ = new MaxPQ<>();

        for (String word : words) {
            /*
             * nMap - summedWHistory only takes lists as arg
             * returns a timeseries year -> sum
             * data() of timeseries returns the sums
             * .stream().mapToDouble().sum() to get the total sum
             */
            double weight = nGramMap.summedWeightHistory(Collections.singletonList(word), startYear, endYear)
                    .data()
                    .stream()
                    .mapToDouble(Double::doubleValue).sum();
            WordWeight wordWeight = new WordWeight(word, weight);

            maxPQ.insert(wordWeight);
        }
        return maxPQ;
    }
}

Class WordNet

package ngordnet.main;

import java.util.*;
import java.util.stream.Collectors;
import java.util.stream.Stream;

import edu.princeton.cs.algs4.In;
import edu.princeton.cs.algs4.MaxPQ;
import ngordnet.ngrams.NGramMap;

public class WordNet {
    //wrapper for a graph

    /**
     * Keeps track of the ids associated with the word
     */
    private WordToIdsMap wordToIdsMap;

    /**
     * Keeps track of the words associated with the id
     */
    private IdToWordsMap idToWordsMap;

    /**
     * Keeps track of id to id connection hypernym -> hyponym
     */
    private Graph graph;

    public WordNet(String synsetsFileName, String hyponymsFileName) {
        graph = new Graph();
        wordToIdsMap = new WordToIdsMap();
        idToWordsMap = new IdToWordsMap();

        setMapsFromFile(synsetsFileName);
        setGraphFromFile(hyponymsFileName);
    }

    /**
     * Returns all the hyponyms of the hypernym, including the hypernym itself - in order thx to being a TreeSet
     * @param hypernym
     */
    public TreeSet<String> getHyponyms(String hypernym) {
        HashSet<Integer> parentIds = wordToIdsMap.get(hypernym);
        HashSet<Integer> childIds = new HashSet<>();
        TreeSet<String> hyponyms = new TreeSet<>();

        for (int parentId : parentIds) {
            //find and get all child ids from the graph
            childIds.addAll(graph.getAllChildren(parentId));
            //add all words associated with the parent ids to hyponyms set
            hyponyms.addAll(idToWordsMap.get(parentId));
        }

        for (int childId : childIds) {
            //add all words associated with the found child ids to hyponyms set
            hyponyms.addAll(idToWordsMap.get(childId));
        }

        return hyponyms;
    }

    /**
     * Returns the hyponyms that are common between all requested hypernyms (sets intersection)
     * @param hypernyms
     */
    public TreeSet<String> getHyponymsIntersect(List<String> hypernyms) {
        TreeSet<String> hyponyms;
        TreeSet<String>[] hyponymsSets = new TreeSet[hypernyms.size()];

        //create hypernyms.size() amount of hyponyms sets
        for (int i = 0; i < hypernyms.size(); i++) {
           hyponymsSets[i] = getHyponyms(hypernyms.get(i));
        }

        if (hyponymsSets[0] == null) {
            hyponyms = new TreeSet<>();
        } else {
            //init hyponyms with the value of the 0th set
            hyponyms = hyponymsSets[0];
        }

        //iterate over the rest of hyponyms sets - removing items not found in the other ones
        for (int i = 1; i < hyponymsSets.length; i++) {
            if (hyponymsSets[i] != null) {
                hyponyms.retainAll(hyponymsSets[i]);
            }
        }

        return hyponyms;
    }

    /**
     * Returns k most popular hyponyms between years
     */
    public TreeSet<String> getKPopularHyponyms(NGramMap ngramMap, List<String> hypernyms, int k, int startYear, int endYear) {
        TreeSet<String> requestedHyponyms = getHyponymsIntersect(hypernyms);

        if (k <= 0) {
            return requestedHyponyms;
        }

        TreeSet<String> kPopularHyponyms = new TreeSet<>(String.CASE_INSENSITIVE_ORDER);
        MaxPQ<WordWeight> maxPQ = getWordWeightPQ(ngramMap, requestedHyponyms, startYear, endYear);

        Iterator<WordWeight> iterator = maxPQ.iterator();

        int i = 0;
        while (iterator.hasNext() && i < k) {
            kPopularHyponyms.add(iterator.next().word());
            i++;
        }

        return kPopularHyponyms;
    }

    private void addWordToMap(String word, Integer id) {
        wordToIdsMap.addWordIdPair(word, id);
        idToWordsMap.addIdWordPair(id, word);
    }

    private HashSet<Integer> getWordIdSet(String word) {
        return wordToIdsMap.get(word);
    }

    private HashSet<String> getIdWordSet(int id) {
        return idToWordsMap.get(id);
    }

    /**
     * Sets both word to ids and ids to word mappings from the provided synsets.txt file
     * @param fileName
     */
    private void setMapsFromFile(String fileName) {
        In in = new In(fileName);
        ArrayList<String> line;
        int id;

        while (in.hasNextLine()) {
            line = Stream.of(in.readLine().split(",")).collect(Collectors.toCollection(ArrayList<String>::new));
            id =  Integer.parseInt(line.get(0));
            String[] words = (line.get(1)).trim().split("\\s+");

            for (String word: words) {
                addWordToMap(word, id);
            }
        }
    }

    /**
     * Sets id to id mappings from the provided hyponyms.txt
     * @param fileName
     */
    private void setGraphFromFile(String fileName) {
        In in = new In(fileName);
        ArrayList<String> line;
        int parentId;

        while (in.hasNextLine()) {
            line = Stream.of(in.readLine().split(",")).collect(Collectors.toCollection(ArrayList<String>::new));
            parentId =  Integer.parseInt(line.get(0));

            for (int i = 1; i < line.size(); i++) {
               int childId = Integer.parseInt(line.get(i));
               graph.addNode(parentId, childId);
            }
        }
    }

    /**
     * Returns a PQ of k amount of WordWeight objects - weight is based on their popularity in NgramMap
     * between the given years
     */
    private MaxPQ<WordWeight> getWordWeightPQ(NGramMap nGramMap, Collection<String> words,  int startYear, int endYear) {
        MaxPQ<WordWeight> maxPQ = new MaxPQ<>();

        for (String word : words) {
            /*
             * nMap - summedWHistory only takes lists as arg
             * returns a timeseries year -> sum
             * data() of timeseries returns the sums
             * .stream().mapToDouble().sum() to get the total sum
             */
            double weight = nGramMap.summedWeightHistory(Collections.singletonList(word), startYear, endYear)
                    .data()
                    .stream()
                    .mapToDouble(Double::doubleValue).sum();
            WordWeight wordWeight = new WordWeight(word, weight);

            maxPQ.insert(wordWeight);
        }
        return maxPQ;
    }
}

Class Graph

package ngordnet.main;

import java.util.HashSet;

public class Graph {

    private IdToIdsMap idToIdsMap;

    public Graph() {
        idToIdsMap = new IdToIdsMap();
    }

    public void addNode(int parentId, int childId) {
        idToIdsMap.addIdIdPair(parentId, childId);
    }

    public HashSet<Integer> getNode(int parentId) {
        return idToIdsMap.get(parentId);
    }

    /**
     * Returns all children of the parent, not including the parent
     * @param parentId
     * @return
     */
    public HashSet<Integer> getAllChildren(int parentId) {
       HashSet<Integer> children = new HashSet<>();

       if (getNode(parentId) != null) {
           for (int childId: getNode(parentId)) {
               children.addAll(getAllChildrenHelper(childId, children));
           }
        }
       return children;
    }

    /**
     * Recursively collects all children
     * @param parentId
     * @param children
     */
    private HashSet<Integer> getAllChildrenHelper(int parentId, HashSet<Integer> children) {
        children.add(parentId);

        if (getNode(parentId) != null) {
            for (int childId: getNode(parentId)) {
                getAllChildrenHelper(childId, children);
            }
        }

        return children;
    }
}

Class IdToIdsMap

package ngordnet.main;

import java.util.HashMap;
import java.util.HashSet;

public class IdToIdsMap extends HashMap<Integer, HashSet> {
    public IdToIdsMap() {
        super();
    }

    public void addIdIdPair(int parentId, int childId) {
        HashSet<Integer> childIdSet;

        if (this.containsKey(parentId)) {
            childIdSet = this.get(parentId);
            childIdSet.add(childId);
            return;
        }

        childIdSet = new HashSet<>();
        childIdSet.add(childId);
        this.put(parentId, childIdSet);
    }
}

Class WordToIdsMap

package ngordnet.main;

import java.util.HashMap;
import java.util.HashSet;

/**
 * LookUp of word associated with a set of ids (a word can have multiple ids, depending on its meaning in the WordNet)
 */
public class WordToIdsMap extends HashMap<String, HashSet<Integer>> {
    public WordToIdsMap() {
        super();
    }

    public void addWordIdPair(String word, Integer id) {
        HashSet<Integer> idSet;

        if (this.containsKey(word)) {
            idSet = this.get(word);
            idSet.add(id);
            return;
        }

        idSet = new HashSet<>();
        idSet.add(id);
        this.put(word, idSet);
    }

    @Override
    public HashSet get(Object key) {
        HashSet<Integer> set = super.get(key);

        if (set == null) {
            set = new HashSet<>();
        }

        return set;
    }
}

Class IdToWordsMap

package ngordnet.main;

import java.util.HashMap;
import java.util.HashSet;

public class IdToWordsMap extends HashMap<Integer, HashSet<String>> {
    public IdToWordsMap() {
        super();
    }

    public void addIdWordPair(Integer id, String word) {
        HashSet<String> wordSet;

        if (this.containsKey(id)) {
            wordSet = this.get(id);
            wordSet.add(word);
            return;
        }

        wordSet = new HashSet<>();
        wordSet.add(word);
        this.put(id, wordSet);
    }
}

Public record WordWeight

package ngordnet.main;

public record WordWeight (String word, Double weight) implements Comparable<WordWeight>{
    @Override
    public int compareTo(WordWeight wordWeight) {
        return weight().compareTo(wordWeight.weight());
    }
}

PreviousLab 08: HashMap NextLab 11: BYOW Introduction

Last updated 2 years ago