Project 2a: NGordNet (NGrams)

Prerequisites to running the code. Additionally, you need to download the project 2 datafiles.

Instructions: https://fa22.datastructur.es/materials/proj/proj2a/

Solution: https://github.com/tomthestrom/cs61b/tree/master/proj2a

About the project

The Google Ngram dataset provides many terabytes of information about the historical frequencies of all observed words and phrases in English (or more precisely all observed ngrams). Google provides the Google Ngram Viewer on the web , allowing users to visualize the relative historical popularity of words and phrases. For example, the link above plots the relative popularity of the phrases “global warming” (a 2gram) and “to the moon” (a 3gram).

In project 2a, I built a version of this tool that only handles 1grams. In other words, the code only handles individual words.

The following classes were required to be implemented:

TimeSeries:
- used as a map for holding an (int) year => a (double) historical frequency
NGramMap
- actually the most interesting/difficult part
- I had to come up with my own design to support predefined queries from the frontend
- supports querying by storing an underlying representation of distribution of words, their mentions (counts) over the years
- a HashMap with year as a key and a TreeMap for a value, the TreeMap holds words and their mentions
- actually created a separate class (custom defined type) for the TreeMap called WordsCountMap to avoid nested mapping.

Some additional code was written to hook these up with the frontend.

TimeSeries

Instructions

For this class, you will fill in the methods of the ngordnet/ngrams/TimeSeries.java file that we’ve provided in the skeleton.

Note: there are two constructors for this class, and you must complete them both.

A TimeSeries is a special purpose extension of the existing TreeMap class where the key type parameter is always Integer, and the value type parameter is always Double. Each key will correspond to a year, and each value a numerical data point for that year.

For example, the following code would create a TimeSeries and associate the number 3.6 with 1992 and 9.2 with 1993.

TimeSeries ts=new TimeSeries();
        ts.put(1992,3.6);
        ts.put(1993,9.2);

The TimeSeries class provides additional utility methods to the TreeMap class which it extends:

List<Integer> years(): Returns all years in the TimeSeries.
List<Double> data(): Returns all data as a List<Double>.
TimeSeries plus(TimeSeries x): Returns the yearwise sum of x and this.
TimeSeries dividedBy(TimeSeries x): Returns the yearwise quotient of this and x.

Implementation

About

Not much to say, just some easy data manipulation. I think reading the instructions and seeing the code is quite self-explenatory.

Code

public class TimeSeries extends TreeMap<Integer, Double> {
    /** Constructs a new empty TimeSeries. */
    public TimeSeries() {
        super();
    }

    /** Creates a copy of TS, but only between STARTYEAR and ENDYEAR,
     *  inclusive of both end points. */
    public TimeSeries(TimeSeries ts, int startYear, int endYear) {
        super();
        int i = startYear;

        while (i <= endYear) {
            if (ts.containsKey(i)) {
                this.put(i, ts.get(i));
            }
            i++;
        }
    }

    /** Returns all years for this TimeSeries (in any order). */
    public List<Integer> years() {
        List<Integer> allYears = new ArrayList<Integer>();

        for(Map.Entry<Integer,Double> entry : this.entrySet()) {
            Integer key = entry.getKey();
            allYears.add(key);
        }

        return allYears;
    }

    /** Returns all data for this TimeSeries (in any order).
     *  Must be in the same order as years(). */
    public List<Double> data() {
        List<Double> allData = new ArrayList<Double>();

        for(Map.Entry<Integer,Double> entry : this.entrySet()) {
            Double value = entry.getValue();
            allData.add(value);
        }

        return allData;
    }

    /** Returns the yearwise sum of this TimeSeries with the given TS. In other words, for
     *  each year, sum the data from this TimeSeries with the data from TS. Should return a
     *  new TimeSeries (does not modify this TimeSeries). */
    public TimeSeries plus(TimeSeries ts) {
       TimeSeries summedSeries = new TimeSeries();
       summedSeries.putAll(this);

        for(Map.Entry<Integer,Double> entry : ts.entrySet()) {
            Integer key = entry.getKey();
            Double value = entry.getValue();

            //if key exists, sum the value in summedSeries(copy of this) with the value in ts;
            if (summedSeries.containsKey(key)) {
                value = summedSeries.get(key) + value;
            }

            summedSeries.put(key, value);
        }

        return summedSeries;
    }

     /** Returns the quotient of the value for each year this TimeSeries divided by the
      *  value for the same year in TS. If TS is missing a year that exists in this TimeSeries,
      *  throw an IllegalArgumentException. If TS has a year that is not in this TimeSeries, ignore it.
      *  Should return a new TimeSeries (does not modify this TimeSeries). */
     public TimeSeries dividedBy(TimeSeries ts) {
         TimeSeries dividedSeries = new TimeSeries();
         dividedSeries.putAll(this);

         for(Map.Entry<Integer,Double> entry : dividedSeries.entrySet()) {
             Integer key = entry.getKey();

             if (!ts.containsKey(key)) {
                 throw new IllegalArgumentException();
             }
             Double quotient = dividedSeries.get(key) / ts.get(key);

             dividedSeries.put(key, quotient);
         }

         return dividedSeries;
    }
}

NGramMap

Instructions

The NGramMap has the following constructors and functions you’ll need to fill in for this part:

NGramMap(String wordsFilename, String countsFilename): The constructor for a NGramMap.
TimeSeries countHistory(String word): Returns yearwise count of the given word for all available years.
TimeSeries totalCountHistory(): Returns yearwise count of all words for all time. This data should come from the data in the file specified by countsFilename, not from summing all words in the file given by wordsFilename.
TimeSeries weightHistory(String word): Returns yearwise relative frequency (a.k.a. normalized count) of the given word for all time. For example, if there were 100,000 words across all volumes in 1575 and 2,100 occurrences of the word “guitar”, then weightHistory("guitar") would have a value of 2100/100000 for the year 1575. If a word does not appear in a given year, that year should not be included in the TimeSeries.
TimeSeries summedWeightHistory(Collection<String> words): Returns the yearwise sum of the relative frequencies ( a.k.a. normalized counts) for the given words for all time.
Additionally, another version of countHistory, weightHistory, and summedWeightHistory that take starting and ending year arguments.

Implementation

About

This class required a bit of a thought - it's basically the brain of the whole app, providing an interface to query the data. The construction takes longer - since I had to read the files in the constructor.

After that, to create a suitable backbone for querying this data structure for data, I decided to use a combination of a HashMap with years as its keys and a TreeMap as its value - HashMap<Integer, WordsCountMap> wordsInYears:

the Integer key is the year
the value - WordsCountMap is a TreeMap<String, Integer>
- holds records of all words and their counts:
  - The String value stands for the words itself
  - The Integer value stands for its count - the number of times it was mentioned
This way I was able to represent the years, the words that "occured" in them and the "count" - number times a certain word was mentioned.

Reasoning: every year occured just once - great value for a key. Every word is unique - another great value for a key, the number of times a word was mentioned might not be unique - great value for a value. :)

The rest of the code is more/less just processing the data based on different queries.

Code

public class NGramMap {
    private HashMap<Integer, WordsCountMap> wordsInYears;
    private TimeSeries counts;

    /**
     * Constructs an NGramMap from WORDSFILENAME and COUNTSFILENAME.
     */
    public NGramMap(String wordsFilename, String countsFilename) {
        wordsInYears = new HashMap<>();
        counts = new TimeSeries();
        setWordsInYearsFromFile(wordsFilename);
        setCountsFromFile(countsFilename);
    }

    /**
     * Provides the history of WORD. The returned TimeSeries should be a copy,
     * not a link to this NGramMap's TimeSeries. In other words, changes made
     * to the object returned by this function should not also affect the
     * NGramMap. This is also known as a "defensive copy".
     */
    public TimeSeries countHistory(String word) {
        TimeSeries counts = new TimeSeries();

        for (Map.Entry<Integer, WordsCountMap> entry : wordsInYears.entrySet()) {
            Integer year = entry.getKey();
            WordsCountMap wordsCountMap = entry.getValue();

            if (wordsCountMap.containsKey(word)) {
                double wordCount = wordsCountMap.get(word).doubleValue();
                counts.put(year, wordCount);
            }
        }
        return counts;
    }

    /**
     * Provides the history of WORD between STARTYEAR and ENDYEAR, inclusive of both ends. The
     * returned TimeSeries should be a copy, not a link to this NGramMap's TimeSeries. In other words,
     * changes made to the object returned by this function should not also affect the
     * NGramMap. This is also known as a "defensive copy".
     */
    public TimeSeries countHistory(String word, int startYear, int endYear) {
        TimeSeries counts = new TimeSeries();

        int currYear = startYear;

        while (currYear <= endYear) {
            if (yearRecordExists(currYear) && yearContainsWord(currYear, word)) {
                double wordCount = wordCountInYear(currYear, word);
                counts.put(currYear, wordCount);
            }
            currYear++;
        }
        return counts;
    }

    /**
     * Returns a defensive copy of the total number of words recorded per year in all volumes.
     */
    public TimeSeries totalCountHistory() {
        TimeSeries totalCount = new TimeSeries();

        for (Map.Entry<Integer, Double> entry : counts.entrySet()) {
            Integer year = entry.getKey();
            Double wordsPerYear = entry.getValue();

            totalCount.put(year, wordsPerYear);
        }
        return totalCount;
    }

    /**
     * Provides a TimeSeries containing the relative frequency per year of WORD compared to
     * all words recorded in that year.
     */
    public TimeSeries weightHistory(String word) {
        TimeSeries relativeFreq = new TimeSeries();

        for (Map.Entry<Integer, Double> entry : counts.entrySet()) {
            Integer year = entry.getKey();

            if (yearRecordExists(year) && yearContainsWord(year, word)) {
                double wordFrequency = wordCountInYear(year, word) / totalWordsInYear(year);;
                relativeFreq.put(year, wordFrequency);
            }
        }

        return relativeFreq;
    }

    /**
     * Provides a TimeSeries containing the relative frequency per year of WORD between STARTYEAR
     * and ENDYEAR, inclusive of both ends.
     */
    public TimeSeries weightHistory(String word, int startYear, int endYear) {
        TimeSeries relativeFreq = new TimeSeries();
        int currYear = startYear;

        while (currYear <= endYear) {
            if (yearRecordExists(currYear) && yearContainsWord(currYear, word)) {
                double wordFrequency = wordCountInYear(currYear, word) / totalWordsInYear(currYear);
                relativeFreq.put(currYear, wordFrequency);
            }

            currYear++;
        }

        return relativeFreq;
    }

    /**
     * Returns the summed relative frequency per year of all words in WORDS.
     */
    public TimeSeries summedWeightHistory(Collection<String> words) {
        TimeSeries summedFreq = new TimeSeries();

        for (Map.Entry<Integer, WordsCountMap> entry : wordsInYears.entrySet()) {
            Integer year = entry.getKey();
            WordsCountMap wordsCountMap = entry.getValue();
            double summedWordCount = 0;

            for (String word : words) {
                if (wordsCountMap.containsKey(word)) {
                    summedWordCount += wordCountInYear(year, word);
                }
            }

            if (summedWordCount > 0) {
                summedFreq.put(year, summedWordCount / totalWordsInYear(year));
            }
        }

        return summedFreq;
    }

    /**
     * Provides the summed relative frequency per year of all words in WORDS
     * between STARTYEAR and ENDYEAR, inclusive of both ends. If a word does not exist in
     * this time frame, ignore it rather than throwing an exception.
     */
    public TimeSeries summedWeightHistory(Collection<String> words,
                                          int startYear, int endYear) {

        TimeSeries summedFreq = new TimeSeries();
        int currYear = startYear;

        while (currYear <= endYear) {
            double summedWordCount = 0;

            if (yearRecordExists(currYear)) {
                for (String word : words) {
                    if (yearContainsWord(currYear, word)) {
                        summedWordCount += wordCountInYear(currYear, word);
                    }
                }
            }

            if (summedWordCount > 0) {
                summedFreq.put(currYear, summedWordCount / totalWordsInYear(currYear));
            }

            currYear++;
        }
        return summedFreq;
    }

    /**
     * Sets up the HashMap wordsInYears
     */
    private void setWordsInYearsFromFile(String fileName) {
        WordsCountMap wordsCountMap;
        In in = new In(fileName);
        String word;
        int year;
        int wordCount;

        while (in.hasNextLine()) {
            word = in.readString();
            year = in.readInt();
            wordCount = in.readInt();

            //if no entry for year - create a new entry with key as year and value as a blank TreeMap
            if (!wordsInYears.containsKey(year)) {
                wordsCountMap = new WordsCountMap();
                wordsInYears.put(year, wordsCountMap);
            }

            //get the TreeMap for the corresponding year
            wordsCountMap = wordsInYears.get(year);
            //save the word for the current year with its count
            wordsCountMap.put(word, wordCount);


            //read the rest of the line - all the data we need is in the 1st 3 cols
            in.readLine();
        }
    }

    /**
     * Sets up the TimeSeries counts
     */
    private void setCountsFromFile(String fileName) {
        In in = new In(fileName);
        ArrayList<String> line;
        int year;
        double wordCount;

        while (in.hasNextLine()) {
            //unfortunately In utility class doesn't contain a nice method to spec separator, gotta read file like this
            line = Stream.of(in.readLine().split(",")).collect(Collectors.toCollection(ArrayList<String>::new));
            year =  Integer.parseInt(line.get(0));
            wordCount = Double.parseDouble(line.get(1));

            counts.put(year, wordCount);
        }
    }

    private boolean yearContainsWord(int year, String word) {
        return wordsInYears.get(year).containsKey(word);
    }

    private boolean yearRecordExists(int year) {
        return wordsInYears.containsKey(year);
    }

    private double wordCountInYear(int year, String word) {
        return wordsInYears.get(year).get(word).doubleValue();
    }

    private double totalWordsInYear(int year) {
        if (yearRecordExists(year)) {
            return counts.get(year);
        }

        return 0;
    }
}

HistoryTextHandler

Instructions

Full instructions: https://fa22.datastructur.es/materials/proj/proj2a/#historytexthandler

The point is to implement the handle method such that it returns data as a string for a given word(s) in the following format. In the example, the user requested words cat, dog, start year 2000, end year 2020:

Expected result for the query above:

 cat: {2000=1.71568475416827E-5, 2001=1.6120939684412677E-5, 2002=1.61742010630623E-5, 2003=1.703155141714967E-5, 2004=1.7418408946715716E-5, 2005=1.8042211615010028E-5, 2006=1.8126126955841936E-5, 2007=1.9411504094739293E-5, 2008=1.9999492186117545E-5, 2009=2.1599428349729816E-5, 2010=2.1712564894218663E-5, 2011=2.4857238078766228E-5, 2012=2.4198586699546612E-5, 2013=2.3131865569578688E-5, 2014=2.5344693375481996E-5, 2015=2.5237182007765998E-5, 2016=2.3157514119191215E-5, 2017=2.482102172595473E-5, 2018=2.3556758130732888E-5, 2019=2.4581322086049953E-5}
   dog: {2000=3.127517699525712E-5, 2001=2.99511426723737E-5, 2002=3.0283458650225453E-5, 2003=3.1470761877596034E-5, 2004=3.2890514515432536E-5, 2005=3.753038415155302E-5, 2006=3.74430614362125E-5, 2007=3.987077208249744E-5, 2008=4.267197824115907E-5, 2009=4.81026086549733E-5, 2010=5.30567576173992E-5, 2011=6.048536820577008E-5, 2012=5.179787485962082E-5, 2013=5.0225599367200654E-5, 2014=5.5575537540090384E-5, 2015=5.44261096781696E-5, 2016=4.805214145459557E-5, 2017=5.4171157785607686E-5, 2018=5.206751570646653E-5, 2019=5.5807040409297486E-5}

Implementation

About

The NgramMap is passed into the constructor and upon calling handle, it's queried.

Code

/**
 * Handles a request in format <List> words, startYear, endYear and returns a string representation with the word
 * in front and timeseries y=usage count per each year between startYear and endYear
 */
public class HistoryTextHandler extends NgordnetQueryHandler {
    private NGramMap ngm;
    public HistoryTextHandler(NGramMap ngm) {
       this.ngm = ngm;
    }


    @Override
    public String handle(NgordnetQuery q) {
        List<String> words = q.words();
        int startYear = q.startYear();
        int endYear = q.endYear();


        StringBuilder response = new StringBuilder();

        for (String word : words) {
            response.append(word);
            response.append(": ");
            response.append(ngm.countHistory(word, startYear, endYear));
            response.append(System.getProperty("line.separator"));
        }

        return response.toString();
    }
}

HistoryHandler

Instructions

Full instructions: https://fa22.datastructur.es/materials/proj/proj2a/#historyhandler

Basically I was expected to return a String that was a base 64 encoded image representing a plot based on a user's query.

Implementation

About

NGramMap passed into the constructor and then queried based on the query passed into the handle method.

I used Plotter to create the plot.

Code

/**
 * Handles a request - NgordnetQuery in format <List> words, startYear, endYear
 *  returns an encoded plot for each word that has any history
 *  if no data for any word - return an empty plot
 */
public class HistoryHandler extends NgordnetQueryHandler {
    private NGramMap ngm;
    public HistoryHandler(NGramMap ngm) {
        this.ngm = ngm;
    }

    /**
     * Returns an encoded plot with x, y axis
     */
    @Override
    public String handle(NgordnetQuery q) {
        List<String> words = q.words();
        ArrayList<String> wordsToPlot = new ArrayList<>();

        ArrayList<TimeSeries> wordsHistory = new ArrayList<>();
        int startYear = q.startYear();
        int endYear = q.endYear();

        for (String word : words) {
            TimeSeries countHistory = ngm.countHistory(word, startYear, endYear);

            //make sure both lists are the same length - only add word if it has any history - can't display no data
            if (!countHistory.isEmpty()) {
                wordsToPlot.add(word);
                wordsHistory.add(countHistory);
            }
        }

        XYChart chart = Plotter.generateTimeSeriesChart(wordsToPlot, wordsHistory);

        return Plotter.encodeChartAsString(chart);
    }
}

PreviousLab 07: BSTMap NextLab 08: HashMap

Last updated 2 years ago