Project 2a: NGordNet (NGrams)

Prerequisites to running the code. Additionally, you need to download the project 2 datafiles.

Instructions: https://fa22.datastructur.es/materials/proj/proj2a/

Solution: https://github.com/tomthestrom/cs61b/tree/master/proj2a

About the project

The Google Ngram dataset provides many terabytes of information about the historical frequencies of all observed words and phrases in English (or more precisely all observed ngrams). Google provides the Google Ngram Viewer on the web , allowing users to visualize the relative historical popularity of words and phrases. For example, the link above plots the relative popularity of the phrases “global warming” (a 2gram) and “to the moon” (a 3gram).

In project 2a, I built a version of this tool that only handles 1grams. In other words, the code only handles individual words.

The following classes were required to be implemented:

  • TimeSeries:

    • used as a map for holding an (int) year => a (double) historical frequency

  • NGramMap

    • actually the most interesting/difficult part

    • I had to come up with my own design to support predefined queries from the frontend

    • supports querying by storing an underlying representation of distribution of words, their mentions (counts) over the years

    • a HashMap with year as a key and a TreeMap for a value, the TreeMap holds words and their mentions

    • actually created a separate class (custom defined type) for the TreeMap called WordsCountMap to avoid nested mapping.

Some additional code was written to hook these up with the frontend.

TimeSeries

Instructions

For this class, you will fill in the methods of the ngordnet/ngrams/TimeSeries.java file that we’ve provided in the skeleton.

Note: there are two constructors for this class, and you must complete them both.

A TimeSeries is a special purpose extension of the existing TreeMap class where the key type parameter is always Integer, and the value type parameter is always Double. Each key will correspond to a year, and each value a numerical data point for that year.

For example, the following code would create a TimeSeries and associate the number 3.6 with 1992 and 9.2 with 1993.

The TimeSeries class provides additional utility methods to the TreeMap class which it extends:

  • List<Integer> years(): Returns all years in the TimeSeries.

  • List<Double> data(): Returns all data as a List<Double>.

  • TimeSeries plus(TimeSeries x): Returns the yearwise sum of x and this.

  • TimeSeries dividedBy(TimeSeries x): Returns the yearwise quotient of this and x.

Implementation

About

Not much to say, just some easy data manipulation. I think reading the instructions and seeing the code is quite self-explenatory.

Code

NGramMap

Instructions

The NGramMap has the following constructors and functions you’ll need to fill in for this part:

  • NGramMap(String wordsFilename, String countsFilename): The constructor for a NGramMap.

  • TimeSeries countHistory(String word): Returns yearwise count of the given word for all available years.

  • TimeSeries totalCountHistory(): Returns yearwise count of all words for all time. This data should come from the data in the file specified by countsFilename, not from summing all words in the file given by wordsFilename.

  • TimeSeries weightHistory(String word): Returns yearwise relative frequency (a.k.a. normalized count) of the given word for all time. For example, if there were 100,000 words across all volumes in 1575 and 2,100 occurrences of the word “guitar”, then weightHistory("guitar") would have a value of 2100/100000 for the year 1575. If a word does not appear in a given year, that year should not be included in the TimeSeries.

  • TimeSeries summedWeightHistory(Collection<String> words): Returns the yearwise sum of the relative frequencies ( a.k.a. normalized counts) for the given words for all time.

  • Additionally, another version of countHistory, weightHistory, and summedWeightHistory that take starting and ending year arguments.

Implementation

About

This class required a bit of a thought - it's basically the brain of the whole app, providing an interface to query the data. The construction takes longer - since I had to read the files in the constructor.

After that, to create a suitable backbone for querying this data structure for data, I decided to use a combination of a HashMap with years as its keys and a TreeMap as its value - HashMap<Integer, WordsCountMap> wordsInYears:

  • the Integer key is the year

  • the value - WordsCountMap is a TreeMap<String, Integer>

    • holds records of all words and their counts:

      • The String value stands for the words itself

      • The Integer value stands for its count - the number of times it was mentioned

  • This way I was able to represent the years, the words that "occured" in them and the "count" - number times a certain word was mentioned.

Reasoning: every year occured just once - great value for a key. Every word is unique - another great value for a key, the number of times a word was mentioned might not be unique - great value for a value. :)

The rest of the code is more/less just processing the data based on different queries.

Code

HistoryTextHandler

Instructions

Full instructions: https://fa22.datastructur.es/materials/proj/proj2a/#historytexthandler

The point is to implement the handle method such that it returns data as a string for a given word(s) in the following format. In the example, the user requested words cat, dog, start year 2000, end year 2020:

Expected result for the query above:

Implementation

About

The NgramMap is passed into the constructor and upon calling handle, it's queried.

Code

HistoryHandler

Instructions

Full instructions: https://fa22.datastructur.es/materials/proj/proj2a/#historyhandler

Basically I was expected to return a String that was a base 64 encoded image representing a plot based on a user's query.

Implementation

About

NGramMap passed into the constructor and then queried based on the query passed into the handle method.

I used Plotter to create the plot.

Code

Last updated