This was a 3 hour coding project to take a set of reviews taken from the Yelp website, and use that as training data for a program to decide the number of stars a rating will give.
Each item in the dataset represents a review with the full text of the review, the number of stars given, the business that the review is for, the person who reviewed it and more.
What we did was, we looked at the words in the review and averaged the number of stars associated with that word. We then threw out any word that was too close to the middle (3 stars) and kept the rest.
To then predict the rating of a review, we first take the average review given to that business, and then based on the words used in the review, either raised or lowered the value of the stars.
With this we were able to get a Root Mean Squared error of around 1.079, which was an improvement on the naïve approach.
You can view the code for this project here.