Project 2a: NGordNet (NGrams)
Prerequisites to running the code. Additionally, you need to download the project 2 datafiles.
Instructions: https://fa22.datastructur.es/materials/proj/proj2a/
Solution: https://github.com/tomthestrom/cs61b/tree/master/proj2a
About the project
The Google Ngram dataset provides many terabytes of information about the historical frequencies of all observed words and phrases in English (or more precisely all observed ngrams). Google provides the Google Ngram Viewer on the web , allowing users to visualize the relative historical popularity of words and phrases. For example, the link above plots the relative popularity of the phrases “global warming” (a 2gram) and “to the moon” (a 3gram).
In project 2a, I built a version of this tool that only handles 1grams. In other words, the code only handles individual words.
The following classes were required to be implemented:
used as a map for holding an (int) year => a (double) historical frequency
actually the most interesting/difficult part
I had to come up with my own design to support predefined queries from the frontend
supports querying by storing an underlying representation of distribution of words, their mentions (counts) over the years
a
HashMapwith year as a key and aTreeMapfor a value, theTreeMapholds words and their mentionsactually created a separate class (custom defined type) for the
TreeMapcalledWordsCountMapto avoid nested mapping.
Some additional code was written to hook these up with the frontend.
TimeSeries
Instructions
For this class, you will fill in the methods of the ngordnet/ngrams/TimeSeries.java file that we’ve provided in the skeleton.
Note: there are two constructors for this class, and you must complete them both.
A TimeSeries is a special purpose extension of the existing TreeMap class where the key type parameter is always Integer, and the value type parameter is always Double. Each key will correspond to a year, and each value a numerical data point for that year.
For example, the following code would create a TimeSeries and associate the number 3.6 with 1992 and 9.2 with 1993.
The TimeSeries class provides additional utility methods to the TreeMap class which it extends:
List<Integer> years(): Returns all years in the TimeSeries.List<Double> data(): Returns all data as aList<Double>.TimeSeries plus(TimeSeries x): Returns the yearwise sum of x and this.TimeSeries dividedBy(TimeSeries x): Returns the yearwise quotient of this and x.
Implementation
About
Not much to say, just some easy data manipulation. I think reading the instructions and seeing the code is quite self-explenatory.
Code
NGramMap
Instructions
The NGramMap has the following constructors and functions you’ll need to fill in for this part:
NGramMap(String wordsFilename, String countsFilename): The constructor for a NGramMap.TimeSeries countHistory(String word): Returns yearwise count of the given word for all available years.TimeSeries totalCountHistory(): Returns yearwise count of all words for all time. This data should come from the data in the file specified bycountsFilename, not from summing all words in the file given bywordsFilename.TimeSeries weightHistory(String word): Returns yearwise relative frequency (a.k.a. normalized count) of the given word for all time. For example, if there were 100,000 words across all volumes in 1575 and 2,100 occurrences of the word “guitar”, thenweightHistory("guitar")would have a value of 2100/100000 for the year 1575. If a word does not appear in a given year, that year should not be included in the TimeSeries.TimeSeries summedWeightHistory(Collection<String> words): Returns the yearwise sum of the relative frequencies ( a.k.a. normalized counts) for the given words for all time.Additionally, another version of
countHistory,weightHistory, andsummedWeightHistorythat take starting and ending year arguments.
Implementation
About
This class required a bit of a thought - it's basically the brain of the whole app, providing an interface to query the data. The construction takes longer - since I had to read the files in the constructor.
After that, to create a suitable backbone for querying this data structure for data, I decided to use a combination of a HashMap with years as its keys and a TreeMap as its value - HashMap<Integer, WordsCountMap> wordsInYears:
the Integer key is the year
the value -
WordsCountMapis aTreeMap<String, Integer>holds records of all words and their counts:
The
Stringvalue stands for the words itselfThe
Integer value stands for its count - the number of times it was mentioned
This way I was able to represent the years, the words that "occured" in them and the "count" - number times a certain word was mentioned.
Reasoning: every year occured just once - great value for a key. Every word is unique - another great value for a key, the number of times a word was mentioned might not be unique - great value for a value. :)
The rest of the code is more/less just processing the data based on different queries.
Code
HistoryTextHandler
Instructions
Full instructions: https://fa22.datastructur.es/materials/proj/proj2a/#historytexthandler
The point is to implement the handle method such that it returns data as a string for a given word(s) in the following format. In the example, the user requested words cat, dog, start year 2000, end year 2020:
Expected result for the query above:
Implementation
About
The NgramMap is passed into the constructor and upon calling handle, it's queried.
Code
HistoryHandler
Instructions
Full instructions: https://fa22.datastructur.es/materials/proj/proj2a/#historyhandler
Basically I was expected to return a String that was a base 64 encoded image representing a plot based on a user's query.
Implementation
About
NGramMap passed into the constructor and then queried based on the query passed into the handle method.
I used Plotter to create the plot.
Code
Last updated