Journal Reflection #2: Insights into your first ML project

Please use your reply to this blog post to detail the following:

  1. A full description of the nature of your first ML project.
  2. Some insights into what predictions you were able to make on the dataset you used for your project.
  3. What new skills did you pick up from this project? For example, had you used Jupyter Notebooks before? Did you encounter any weird bugs, twists, or turns in your dataset that caused issues? How did you resolve those issues?
  4. What types of conclusions can you derive from the images/graphs you’ve created with your project? If you didn’t create charts or graphs and instead explored things like Markov Chains, how much work do you think you need to do to further refine your project to make its output more realistic?
  5. Did you create any formulas to examine or rate the data you parsed?

Take the time to look through the project posts of your classmates. You don’t have to comment on the posts of your classmates, but if you want to give them praise or just comment on their project if you found it cool, please do.

This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to Journal Reflection #2: Insights into your first ML project

  1. Gil Mebane says:

    1) For my first ML project, I decided to make a sentiment analyzer in order to determine the sentiment of different Amazon reviews. While I initially pursued this project in hopes of eventually predicting stock prices based on the sentiment of related tweets, I quickly realized that sentiment analysis in itself was a somewhat complicated process so I decided my efforts were best directed towards first understanding how sentiment analysis works. First, I began by importing a csv file containing 20,000 Amazon reviews and their respective sentiments. From here I ‘sanitized’/cleaned the data by converting the words in each individual review into tokens, and removed stop words (essentially words that are very common in the English language like is, and, but, or, etc.… which do not inherently contribute to sentiment), legitimized the tokens (essentially removing endings such as -ing and -ed in order to isolate the root of words, so my model would not view words such as ‘boring’ and ‘bored’ as two separate terms), and made these tokens into lists (with each list being one review). Moreover, I also went ahead and converted the sentiments from int values (0 representing negative and 1 representing positive) to strings, tokenized this data, and added them to a list. Next I utilized the pre-made sentiment analysis tool from the NLTK library to get a baseline for the accuracy of non-AI powered sentiment analysis (in this case about 80%) (this pre-made sentiment analysis tool simply sums the sentiment of each word in each review by comparing them to a dictionary with respective sentiments and then returns if the sentiment is more positive or negative leaning). Turning toward the AI side of things, I decided to use a BOW (bag of words model) in order to compare how each individual term contributes to the sentiment of the review across all 20,000 reviews (essentially it looks at how they contribute to the sentiment of the reviews as a whole rather than using a dictionary to determine the sentiment of an individual review). In order to implement the BOW model I first broke my data into a train and a test set (70 to 30 percent ratio), then broke the train and test files data sets into their respective independent and dependent variable components (independent = review and dependent = sentiment of the review), attempted to vectorize the words in each review (in this case using a r'[a-zA-Z]+’ token pattern to remove punctuation, random symbols, and random strings of letters that weren’t actually words) (I also used a min df of 10 in order to remove outlier words that occurred less than 10 times across all 20,000 reviews which would not necessarily signify sentiment accurately), and finally tried to pass/convert these into a BOW model which I fit to the train set so that only the words occurring there would be analyzed in the test set so words that the model was not familiar with would be ignored and not cause errors. Sadly this did not go as intended as although the data were in list form and tokenized, the data was still in a panda’s dataframe which is inherently a series object. To resolve this issue I simply converted the data set to lists that contained two tuples (representing the review and the sentiment). Finally, I was able to pass this data into the BOW model successfully so I fitted this BOW model and its respective sentiments to a logistic regression model (which is useful in this case as sentiment is a positive/negative outcome or, in other words, only has two possible outcomes). All in all, I was left with a sentiment analysis model that could predict the sentiment of a review with 89% accuracy.

    2) While I could most certainly talk about which words I learned were more positive/negative in sentiment I do not think this was the most surprising takeaway from this project. On the contrary, I was very surprised to learn that punctuation is inherently neutral in sentiment (which is why I removed punctuation from the reviews before training my model). This came as a surprise to me as I initially assumed that punctuation such as exclamation points would yield a much more positive sentiment, however, I failed to consider that exclamation points only signify strong sentiment as opposed to positive or negative as they can and were used in both negative and positive reviews.

    3) There were many new skills that I picked up from this project. First and foremost I learned what a BOW model is (essentially just a matrix that records how many times words occur in a string, list, data frame, etc…) how to make a BOW model, learned what tokenization was (essentially assigning things like words an identifier so that the machine can more easily access them) and how to perform it, learned how to vectorize data and how in doing so you can also filter data (see response to question 1 for more), and in generally learned how sentiment analysis actually worked (see response to question 1 for more). Turning toward the de-bugging side of this project, I learned that pd.dataframe files are inherently series even if they contain lists. From this, I also learned how to convert series to lists and how to create tuples within lists in order to preserve both data points present in the initial pd.dataframe file.

    4) While I did not actually graph any of my results, I think that one main way I could have further refined my project would be to remove the product names from the reviews. Moreover, while my model was still very accurate (89%) part of the inaccuracy most likely stemmed from there being multiple bad/good reviews on the same product (in other words if a product had generally all positive reviews any mention of the product’s name might make my model predict that other reviews containing reference to that name were also positive).

    5) While I didn’t exactly ‘create’ a formula to examine/rate the data I parsed I did use the model_lg.score method from the sklearn.linear_model library in order to test the accuracy of my model against the test data set. Furthermore, this method predicted that my model had an 89% accuracy.

  2. Anand Jayashankar says:

    1) My first ML project was using a linear regression algorithm to try and predict the stat lines for top 5 picks in the NBA draft in their rookie years based on various metrics from their college stats and athleticism paired with how good the team they are drafted to is. I manually created a dataset of the top 5 picks from the last ten NBA drafts which incorporated a multitude of the aforementioned factors. I then used the ScKit-learn library in PyCharm to split 70% of the dataset for “training”, and 30% for “testing”. From there, I was able to run the linear regression algorithm and get an accuracy. When I tried to predict the “total” stats for nba rookies (ppg+rpg+apg), my accuracy was relatively low (47%). However, when I split the dataset up to try and individually predict ppg, apg, and rbg, I got some higher percentages, especially for apg which had an accuracy of over 77%.

    2) The predicts I was able to make is simply the stat lines for rookies in the NBA. I think something that stood out to me is that it appears assist numbers are relatively predictable from college to NBA. When I thought about this further, it made sense because playmaking is an ability that often translates regardless of competition, while something like rebounding or scoring is often not as easily translatable when moving “up” levels in basketball.

    3) I had used Jupyter before, but just for the Iris project with had directions for what to do in each step. This was the first time I used Jupyter “on my own”, which was a really cool experience. I didn’t have any major “bugs”; the biggest issue I had was that the accuracy score when predicting “total” production relatively low. The way I went about solving this was making 3 seperate linear regression algorithms for apg, ppg, and rbg, rather than combining them all into one category of “total” production. This did help increase my accuracy percentages, but also showed me that assists are more translatable than ppg and rbg.

    4)I didn’t actually graph any of my results. I think the biggest thing to increase my accuracy, which I mentioned in the presentation, is taking into account a returning team’s production at the position a player was drafted. For example, if a team drafts a PG, but they already have Steph Curry, it would make sense that the draftee’s rookie year production would be relatively low compared to if the same player was drafted to a team without a good returning PG. So, if I was trying to improve my model, I would input this as one of the factors in the independent variable of the linear regression algorithm.

    5) I didn’t really create any formula. Similar to Gil, I did use the score method from sklearn.linear_model to figure out how accurate my linear model(s) were.

Leave a Reply