Homework Assignment 1
Lab Assignments
Overview
The purpose of this assignment is to get you working with the recipes package and preprocessing the variables in two different datasets. You will use the same datasets with processed variables to build models in the next assignments.
Please prepare your assignment using RMarkdown. There are alternative ways to submit your assignment depending on your preference.
-
You knit the R Markdown document to a PDF document and then submit both the .Rmd and PDF files by uploading them on Canvas.
-
You knit the R Markdown document to an html document and host it on your website/blog or any publicly available platform. Then, you can submit the .Rmd file by uploading it on Canvas and put the link for the html document as a comment.
-
If you have a Github repo and store all your work for this class in a Github repo, you can create a folder for this assignment in that repo, and put the .Rmd file and PDF document under a specific folder. Then, you can submit the link for the Github repo on Canvas.
To receive full credit, you must complete the following tasks. Please make sure that all the R code you wrote for completing these tasks and any associated output are explicitly printed in your submitted document. If the task asks you to submit the data files you created, please upload these datasets along with your submission.
If you have any questions, please do not hesitate to reach out to me.
Part 1: Preprocessing Text Data
Description
For this part of the assignment, you will work with a Twitter dataset which is randomly sampled from a larger dataset on the Kaggle platform (see this link for the original data). In this subset data, there are 1,500 tweets and three variables.
- sentiment: a character string variable with two values (Positive and Negative) for the outcome variable to predict.
- time: a character string variable indicating time of a tweet (e.g.,Thu Jun 18 07:35:01 PDT 2009)
- tweet: a character string variable that provides the full text of a tweet.
Our ultimate goal is to build a model to predict whether or not a tweet has a positive sentiment by using the information from time of the tweet and text of the tweet. We will do this in the following assignments. For this assignment, we will only engineer features to use them later for building our models and prepare the dataset for model development.
Please complete the following tasks. Provide the R code you wrote and any associated output for each task.
Tasks
Task 1.1 Import the tweet data from this link.
Task 1.2 The time variable in this dataset is a character string such as Thu Jun 18 07:35:01 PDT 2009. Create four new columns in the dataset using this time variable to show the day, date, month, and hour of a tweet. The table below provides some examples of how these four new columns would look like given time as a character string. Make sure that day column is a numeric variable from 1 to 7 (Monday = 1, Sunday =7), date column is a numeric variable from 1 to 31, and hour column is a numeric variable from 0 to 23, and month column is a factor variable. Calculate and print the frequencies for each new column (day, month, date, and hour).
| time | day | month | date | hour |
|---|---|---|---|---|
| Thu Jun 18 07:35:01 PDT 2009 | 4 | Jun | 18 | 7 |
| Sun May 10 00:31:52 PDT 2009 | 7 | May | 10 | 0 |
| Sun May 31 09:15:19 PDT 2009 | 7 | May | 31 | 9 |
| Fri May 22 07:25:52 PDT 2009 | 5 | May | 22 | 7 |
| Sun May 31 02:09:52 PDT 2009 | 7 | May | 31 | 2 |
| Sun Jun 07 09:13:08 PDT 2009 | 7 | Jun | 7 | 9 |
# Hint: suppose x is a vector of strings in format like 'Sun May 10 00:31:52 PDT 2009'
x <- c("Thu Jun 18 07:35:01 PDT 2009",
"Sun May 10 00:31:52 PDT 2009",
"Sun May 31 09:15:19 PDT 2009",
"Fri May 22 07:25:52 PDT 2009")
x
[1] "Thu Jun 18 07:35:01 PDT 2009" "Sun May 10 00:31:52 PDT 2009"
[3] "Sun May 31 09:15:19 PDT 2009" "Fri May 22 07:25:52 PDT 2009"
# You can extract the days by subsetting from character 1 to 3
substr(x,1,3)
[1] "Thu" "Sun" "Sun" "Fri"
# You can extract the months by subsetting from character 5 to 7
substr(x,5,7)
[1] "Jun" "May" "May" "May"
# You can extract the dates by subsetting from character 9 to 10
substr(x,9,10)
[1] "18" "10" "31" "22"
# You can extract the hours by subsetting from character 12 to 13
substr(x,12,13)
[1] "07" "00" "09" "07"
Task 1.3 Recode the outcome variable (sentiment) into a binary variable such that Positive is equal to 1 and Negative is equal to 0. Calculate and print the frequencies for tweets with positive and negative sentiments.
Task 1.4 Load the reticulate package and Python libraries (torch, numpy, transformers, nltk, and tokenizers). Then, load the text package. Using these packages, generate tweet embeddings for each tweet in this dataset using the last layer (layer = 12) from the roberta-base model, a pre-trained NLP model. Tweet embeddings for each tweet should be a vector of numbers with length 768. Append these embeddings to the original data.
Task 1.5 Remove the two columns time and tweet from the dataset as you do not need them anymore.
Task 1.6 Prepare a recipe using the recipe() and prep() functions from the recipes package for final transformation of the variables in this dataset.
First, make sure you have the most recent developer version of the recipes package from Github. If not, install it from Github.
devtools::install_github("tidymodels/recipes")
Your recipe should have the following specifications:
- each cyclic variable (
day,date, andhour) is recoded into two new variables of sin and cos terms (?step_harmonic()). monthvariable is recoded into dummy variables using one-hot encoding (?step_dummy)- all numerical embeddings (Dim1 - Dim768) are standardized (
?step_normalize)
Print the blueprint. Your blueprint should look like the following.
Recipe
Inputs:
role #variables
outcome 1
predictor 772
Training data contained 1500 data points and no missing data.
Operations:
Dummy variables from month [trained]
Harmonic numeric variables for <none> [trained]
Harmonic numeric variables for <none> [trained]
Harmonic numeric variables for <none> [trained]
Centering and scaling for Dim1, Dim2, Dim3, Dim4, Dim5, Dim6, Dim7, Dim8,... [trained]
Task 1.7 Finally, apply this recipe to the whole dataset and obtain the final version of the dataset with transformed variables. The final dataset should have 1500 rows and 781 columns as the following:
- one column representing the outcome variable,
sentiment, - one column representing the original
dayvariable, - one column representing the original
datevariable, - one column representing the original
hourvariable, - 768 columns for tweet embeddings,
- three columns for dummy variables representing the variable
month, - two columns for the sin and cos terms representing the variable
day, - two columns for the sin and cos terms representing the variable
date, - two columns for the sin and cos terms representing the variable
hour.
Task 1.8 Remove the original day,date, and hour variables from the dataset as we do not need them anymore because we already created sin and cos terms for each one of them.
Task 1.9 Export the final dataset (1500 x 778) as a .csv file and upload it to Canvas along your submission.
Part 2: Preprocessing Continuous and Categorical Variables
Description
For the second part of the assignment, we are going to use a dataset compiled by Dr. Daniel Anderson. These specific data are simulated from an actual statewide testing administration across the state of Oregon, but the overall distributions are highly similar. The dataset has 189,426 observations and 29 variables. Below is a table of data dictionary for the variables in this dataset.
| Variable | name | description |
|---|---|---|
| 1 | id | Student identifier |
| 2 | sex | Code indicating the biological sex of the student (F = Female; M = Male) |
| 3 | ethnic_cd | Code representing the racial/ethnic reporting subgroup category for the student A = Asian race, non-Hispanic B = Black/African American race, non-Hispanic H = Hispanic ethnicity I = American Indian/Alaskan Native race, non-Hispanic M = Multi-racial, non-Hispanic P = Pacific Islander race, non-Hispanic W = White race, non-Hispanic |
| 4 | enrl_grd | Code indicating the enrolled grade level of the student; or a grade level assigned to an ungraded student based on student age. |
| 5 | tst_bnch | Code indicating the benchmark level of the administered test 1B = Benchmark 1 (grade 3) 2B = Benchmark 2 (grade 5) 3B = Benchmark 3 (grade 8) G4 = Grade 4 benchmark G6 = Grade 6 benchmark G7 = Grade 7 benchmark X3 = Extended Grade 3 X4 = Extended Grade 4 X5 = Extended Grade 5 X6 = Extended Grade 6 X7 = Extended Grade 7 X8 = Extended Grade 8 |
| 6 | tst_dt | Date the test was taken (mm/dd/yyyy) |
| 7 | migrant_ed_fg | Indicates student participation in a program designed to assure that migratory children receive full and appropriate opportunity to meet the state academic content and student academic achievement standards. |
| 8 | ind_ed_fg | Indicates student participation in a program designed to meet the unique educational and culturally related academic needs of American Indians. |
| 9 | sp_ed_fg | Indicates student participation in an Individualized Education Plan (IEP/IFSP). |
| 10 | tag_ed_fg | Indicates student participation in a Talented and Gifted program. |
| 11 | econ_dsvntg | Indicates student eligibility for a Free or Reduced Lunch program. |
| 12 | ayp_lep | Indicates a student who received services or was eligible to receive services in a Limited English Proficient program. Blank = Not eligible or served by an LEP program A = First year LEP student without ELPA B = First year LEP student with ELPA E = Experienced LEP student (more than 5 years) F = Former LEP (student exited LEP program more than two years ago) --- new in 2016-17 M = Monitored Year 1 (student exited LEP program in the prior year) --- new in 2016-17 N = Not eligible or served by an LEP program S = Monitored Year 2 (student exited LEP program two years ago) --- new in 2016-17 T = Transitioning (student exited LEP program in the prior year or two years ago) --- discontinued in 2016-17 W = Student exited an LEP program on or before May 1 of the current year X = Student exited an LEP program after May 1 of the current year Y = Student in LEP program between 2 and 5 years |
| 13 | stay_in_dist | Indicates that the student has been enrolled for more than 50% of the days in the school year as of the first school day in May at the district where the student is resident on the first school day in May. |
| 14 | stay_in_schl | Indicates that the student has been enrolled for more than 50% of the days in the school year as of the first school day in May at the school where the student is resident on the first school day in May. |
| 15 | dist_sped | Indicates that the student was enrolled in a district special education program during the school year and received general education classroom instruction for less than 40% of the time as of the first school day in May. |
| 16 | trgt_assist_fg | Flag indicating the record is included in Title 1 Targeted Assistance for the Adequate Yearly Progress (AYP) school performance calculations. |
| 17 | ayp_dist_partic | Flag indicating the record is included in the denominator of Adequate Yearly Progress (AYP) district participation calculations. |
| 18 | ayp_schl_partic | Flag indicating the record is included in the denominator of Adequate Yearly Progress (AYP) school participation calculations. |
| 19 | ayp_dist_prfrm | Flag indicating the record is included in the denominator of Adequate Yearly Progress (AYP) district performance calculations. |
| 20 | ayp_schl_prfrm | Flag indicating the record is included in the denominator of Adequate Yearly Progress (AYP) school performance calculations. |
| 21 | rc_dist_partic | Flag indicating the record is included in the denominator of Report Card (RC) district participation calculations. |
| 22 | rc_schl_partic | Flag indicating the record is included in the denominator of Report Card (RC) school participation calculations. |
| 23 | rc_dist_prfrm | Flag indicating the record is included in the denominator of Report Card (RC) district performance calculations. |
| 24 | rc_schl_prfrm | Flag indicating the record is included in the denominator of Report Card (RC) school participation calculations. |
| 25 | grp_rpt_dist_partic | Flag indicating the record is included in the denominator of Group Report district participation calculations. |
| 26 | grp_rpt_schl_partic | Flag indicating the record is included in the denominator of Group Report school participation calculations. |
| 27 | grp_rpt_dist_prfrm | Flag indicating the record is included in the denominator of Group Report district performance calculations. |
| 28 | grp_rpt_schl_prfrm | Flag indicating the record is included in the denominator of Group Report school participation calculations. |
| 29 | score | Scale Score for Total test |
Tasks
Task 2.1 Import the Oregon testing data from this link.
Task 2.2 The tst_dt variable is a character string such as 5/14/2018 0:00. Create two new columns in the dataset using this variable to show the date and month the test was taken. The table below provides some examples of how these two new columns would look like given tsd_dt as a character string. Make sure that both date and month columns are a numeric variables. Once you create these two new columns, remove the colun tst_dt from the dataset as you do not it anymore. Calculate and print the frequencies for the new columns (date and month)
| tst_dt | month | date |
|---|---|---|
| 5/14/2018 0:00 | 5 | 14 |
| 6/5/2018 0:00 | 6 | 5 |
| 5/1/2018 0:00 | 5 | 1 |
| 5/1/2018 0:00 | 5 | 1 |
| 5/22/2018 0:00 | 5 | 22 |
| 5/25/2018 0:00 | 5 | 25 |
# Hint: suppose x is a vector of strings in format of MM/DD/YYYY H:MM
x <- c("5/14/2018 0:00","6/5/2018 0:00","5/1/2018 0:00","5/1/2018 0:00","5/22/2018 0:00","5/25/2018 0:00")
x
[1] "5/14/2018 0:00" "6/5/2018 0:00" "5/1/2018 0:00" "5/1/2018 0:00"
[5] "5/22/2018 0:00" "5/25/2018 0:00"
# You can extract the date and month using the following code
strsplit(x,'/') # returns a list of vectors with each element of x splitted by /
[[1]]
[1] "5" "14" "2018 0:00"
[[2]]
[1] "6" "5" "2018 0:00"
[[3]]
[1] "5" "1" "2018 0:00"
[[4]]
[1] "5" "1" "2018 0:00"
[[5]]
[1] "5" "22" "2018 0:00"
[[6]]
[1] "5" "25" "2018 0:00"
sapply(strsplit(x,'/'),`[`,1) # calls the first element of each list element
[1] "5" "6" "5" "5" "5" "5"
as.numeric(sapply(strsplit(x,'/'),`[`,1)) # makes them numeric
[1] 5 6 5 5 5 5
as.numeric(sapply(strsplit(x,'/'),`[`,2)) # calls the second element of each list element
[1] 14 5 1 1 22 25
Task 2.3 Using the ff_glimpse() function from the finalfit package, provide a snapshot of missingness in this dataset. This function also returns the number of levels for categorical variables. If there is any variable with large amount of missingness (e.g. more than 75%), remove this variable from the dataset.
Task 2.4 Most of the variables in this dataset are categorical, and particularly a binary variable with a Yes and No response. Check the frequency of unique values for all categorical variables. If there is any inconsistency (e.g., Yes is coded as both ‘y’ and ‘Y’) for any of these variables in terms of how values are coded, fix them. Also, check the distribution of numeric variables and make sure there is no anomaly.
Task 2.5 Prepare a recipe using the recipe() and prep() functions from the recipes package for final transformation of the variables in this dataset.
Suppose that we categorize the variables in this datasets as the following:
idis the ID variablescoreis the outcome variableenrl_grdis a numeric predictordateandmonthare cyclic predictorssex,ethnic_cd,tst_bnch,migrant_ed_fg,ind_ed_fg,sp_ed_fg,tag_ed_fg,econ_dsvntg,stay_in_dist,stay_in_schl,dist_sped,trgt_assist_fg,ayp_dist_partic,ayp_schl_partic,ayp_dist_prfrm,ayp_schl_prfrm,rc_dist_partic,rc_schl_partic,rc_dist_prfrm,rc_schl_prfrm,grp_rpt_dist_partic,grp_rpt_schl_partic,grp_rpt_dist_prfrm,grp_rpt_schl_prfrmare all categorical predictors.
Your recipe should have the following specifications in the order below:
- create an indicator variable for missingness for all predictors,
- remove the numeric predictors with zero variance,
- replace missing values with mean for numeric predictors,
- replace missing values with mode for categorical predictors,
- recode cyclic predictors into two new variables of sin and cos terms,
- expand numeric predictors using using natural splines with three degrees of freedom and standardize,
- recode categorical predictors into dummy variables using one-hot encoding.
Print the blueprint. Your blueprint should look like the following.
Recipe
Inputs:
role #variables
id 1
outcome 1
predictor 27
Training data contained 189426 data points and 538 incomplete rows.
Operations:
Creating missing data variable indicators for sex, ethnic_cd, tst_bnch, migrant_ed_fg, ind_ed... [trained]
Zero variance filter removed na_ind_sex, na_ind_ethnic_cd, na_ind_ts... [trained]
Mean Imputation for enrl_grd, month, date [trained]
Mode Imputation for sex, ethnic_cd, tst_bnch, migrant_ed_fg, ind_ed... [trained]
Harmonic numeric variables for <none> [trained]
Harmonic numeric variables for <none> [trained]
Natural Splines on enrl_grd [trained]
Centering and scaling for enrl_grd_ns_1, enrl_grd_ns_2, enrl_grd_ns_3 [trained]
Dummy variables from sex, ethnic_cd, tst_bnch, migrant_ed_fg, ind_ed_fg, sp_ed... [trained]
Task 2.6 Finally, apply this recipe to the whole dataset and obtain the final version of the dataset with transformed variables. The final dataset should have 189,426 rows and 76 columns as the following:
- one column representing the ID variable,
id, - one column representing the outcome variable,
score, - one column representing the original
datevariable, - one column representing the original
monthvariable, - eight columns representing missing indicator variables,
- two columns for the sin and cos terms representing the variable
date, - two columns for the sin and cos terms representing the variable
month, - three columns for natural splines of
enrl_grd_ns, - two columns for dummy variables representing
sex, - seven columns for dummy variables representing
ethnic_cd_W, - six columns for dummy variables representing
tst_bnch, - two columns for dummy variables representing
migrant_ed_fg, - two columns for dummy variables representing
ind_ed_fg, - two columns for dummy variables representing
sp_ed_fg, - two columns for dummy variables representing
tag_ed_fg, - two columns for dummy variables representing
econ_dsvntg, - two columns for dummy variables representing
stay_in_dist_N, - two columns for dummy variables representing
stay_in_schl, - two columns for dummy variables representing
dist_sped, - two columns for dummy variables representing
trgt_assist_fg, - two columns for dummy variables representing
ayp_dist_partic, - two columns for dummy variables representing
ayp_schl_partic, - two columns for dummy variables representing
ayp_dist_prfrm, - two columns for dummy variables representing
ayp_schl_prfrm - two columns for dummy variables representing
rc_dist_partic, - two columns for dummy variables representing
rc_schl_partic, - two columns for dummy variables representing
rc_dist_prfrm, - two columns for dummy variables representing
rc_schl_prfrm, - two columns for dummy variables representing
grp_rpt_dist_partic, - two columns for dummy variables representing
grp_rpt_schl_partic, - two columns for dummy variables representing
grp_rpt_dist_prfrm, - two columns for dummy variables representing
grp_rpt_schl_prfrm,
Task 2.7 Remove the original date and month variables from the dataset as we do not need them anymore because we already created sin and cos terms for each one of them.
Task 2.8 Export the final dataset (189,426 x 74) as a .csv file and upload it to Canvas along your submission.