R Package: uniqueread

Github: https://github.com/christilly/uniqueread

Goals of the package

The goal of the uniqueread package is to analyze Goodreads data and give insight into how a Goodreads user rates the books they read. This package computes residual data by comparing the user's rating of a book to the average Goodreads rating of that book. The output includes the average residual rating of the user (whether the user tended to rate the books higher or lower on average), residual standard deviation (how dispersed the ratings of the user are), the book title associated with the maximum residual (the book the user gave the highest difference in rating compared to the average Goodreads rating AKA the book they loved more than average), and the minimum residual (the book the user gave the most negative difference in rating compared to the average Goodreads rating AKA the book they hated more than average). Lastly, the package displays a histogram of the distribution of ratings. Books the user tended to score higher than the average Goodreads rating are on the right of the 0, while books the user tended to score lower than the average Goodreads rating are on the left of the 0.

How to use the package and functions

It is beneficial to read through the document in this link to learn how to use the package. This provides a rundown of how to use the code and what it accomplishes. I will also describe the purpose of each function from the package below.

read_GR_data

The purpose of this function is to load in the Goodreads data from the user into a data frame and remove unnecessary columns from the data for further processing. I made it so that if a user uploads a file that cannot be found, the code throws an error message and encourages the user to double-check the file path. If it uploads successfully, the program will tell the user that it worked and what file they uploaded.

I made a list of columns to remove from the initial data frame and used the function -any_of() to remove those columns, but to ensure the program still runs even if some of those columns are not found in the original data file. I removed these columns so that unnecessary data is not analyzed and for simplicity in analyzing the results during the creation and testing of my package.

clean_GR_ratings

The main goal of clean_GR_ratings is very important, it is to remove any books that have a 0 star rating. This is because the lowest rating a user can give a book on Goodreads is 1 star. The books with 0 stars are therefore unread or unrated by the user and should not be used in the analysis.

I used the dplyr package because it is useful in editing data frames. My process was to change ratings of 0 to NA, and then remove data with NA values in the My.Rating column. The cleaned data can then be placed into a new data frame.

calculate_residuals

The calculate residuals function is used to create a new column in the data with residuals data. The residuals data is calculated by subtracting the average rating from the user rating for each book left in the data frame.

Functions from dplyr were used to create a new column in the data frame with the residuals data.

res_data

This function analyzes the residuals data for average, standard deviation, highest positive, and lowest negative residuals. The average residual and standard deviation values show how the user scored books compared to the average ratings and how widely spread their ratings are. The highest residual value informs which book the user rated highest compared to the average, and the lowest residual value informs which book the user rated lowest compared to the average. The associated titles are also found using which().

The goal here is to give insight into the user's ratings. The highest residual can also be seen as finding the book the user loved more than others, while the lowest residual can be seen as finding the book the user hated more than others.

dist_hist

This function creates a visual display of how often a user rates the books they read above or below average. Books the user tended to score higher than the average Goodreads rating are on the right of the 0, while books the user tended to score lower than the average Goodreads rating are on the left of the 0.

I used ggplot2 with theme_classic() to create a clean and neat histogram. The line intersecting x=0 is a visual aid to show where user ratings were above and below average.

Important files in the package

R/uniqueread.R - File with R code

man/ - Folder with a description of each function including name, function, and arguments

tests - Folder for testing functions during package development, I tested clean_GR_ratings

vignettes - folder where uniqueread.RMD is held

DESCRIPTION - File explaining details of package including author, version, license, and recommended imports

LICENSE - Explains licensing, I chose GPL-3 GNU general public license

NAMESPACE - File containing all the functions in the package

README.md - File explaining the goals of the package and how to use it

sample_export.csv - Example data with which to test the package

Development process

Generating an idea

My first step in the package development process was to generate an idea. I wanted to do something that would be fun and interesting to work on. I chose to focus on Goodreads because I love reading and have used the website for a long time, and I found that it was really easy to download my data. I then had a few ideas of how to find interesting information from the data. At first, I wanted to make a program that analyzes how long a person takes to read books of different lengths, but I quickly realized that Goodreads does not allow a person to see when a person started reading a book in the export file. Therefore, I shifted towards the uniqueread idea of analyzing a person's book ratings. Dr. Friedman was an excellent resource in helping me visualize what kind of analysis I can do.

Planning the package

I started planning the package by creating a rough outline of how I wanted the program to progress. I wanted to upload the data into a data frame, clean the data, calculate residuals between user and average rating, and do analysis on the residuals. I started by making simple code to accomplish these tasks, learning new and important function calls along the way. I did not make any of my code into functions yet and simply tested if my idea would work. I downloaded my own Goodreads data and ran analysis on it. Then I created the sample file to play around with as well.

Creating and editing functions

Once I was convinced my code would work, I transitioned the rough code I started with into formal functions. I tested them along the way to make sure there were no errors and the code accomplished what I wanted it to. I created the descriptions of each function and used roxygen and devtools to update all of my documentation while I worked. Whenever I was done with a session, I uploaded my work to Github and gave brief descriptions of what I did that day.

Package development

During package development, I created a document for myself where I kept all of the most useful devtools function calls and their purpose. I used devtools::load_all() to load my package and see if the functions work whenever I edited them. I used devtools::document() to automatically update the documentation when I changed the functions. Closer to when I was finishing the package, I used devtools::check() to ensure there were no errors when running the package and to test it. I used devtools::use_testthat() to create the test files, and I created a test for clean_GR_ratings which passed the devtools::check() call.

I created the RMD file as part of an assignment for this class and made it as detailed as possible so that people using the package can clearly follow a guideline of how to use it.

I used a lot of outside resources to learn how to use devtools, how to organize a package and the kinds of files it needs, how to format functions to be read by roxygen, and how to use functions specific to the kind of analysis I was doing.

Role of AI

I made my rough code and transitioned it into functions using the knowledge I gained from this class, previous assignments, and outside resources. When there were errors that I could not figure out on my own, I used ChatGPT to help me identify and fix them. I also used ChatGPT as a tool to tell me if my finished functions are correctly formatted and if there are any things I should consider. For instance, ChatGPT advised me that function calls from a specific package needed to be specified to avoid errors when calling it (such as by using ggplot2:: followed by the specific call).

What I learned

"Anyone who has never made a mistake has never tried anything new." - Albert Einstein

Throughout creating the package, and throughout the entire semester, I learned a lot about R and programming in general. I previously took a coding class as an undergraduate student in Python, so I had a background in how programming languages generally operate. At first, I was sure nothing could get me to like R more than Python, but now, I have learned how easy to use R is, especially when it comes to data analysis, and have already used it outside of class for data analysis in research. Therefore, most importantly, I learned the usefulness of R and how to apply it to my own life.

Another very valuable lesson I learned is how to use Github. I learned how to edit code on my computer and have it uploaded to Github so I can show my workflow in real-time. It has always been clear to me that knowing how to use Github is extremely important when working as a programmer, so I was glad to spend time learning how to use it this semester.

Lastly, I learned a lot about what packages are, how they are created, their contents, and how to use them. It is great to learn the basics behind a programming language, but it is hard to truly maximize the value of the code without being able to use packages. I am especially grateful to have learned very useful R packages, including dplyr and ggplot2, as I have used these outside of the class as well.

In reference to the quote, I failed many times during this semester by getting a bunch of errors from my code or not understanding how to do an assignment. However, each time I made a mistake, I learned something new. This class reminded me that learning a new skill is hard and involves a lot of failure, but the payoff is in feeling rewarded for having tried something new.

Search This Blog

LIS 6371 Open Source R Blog - Christine Tran