• Readability
  • Recidivism

[Updated: Sat, Oct 23, 2021 - 12:45:51 ]

There are two datasets we will analyze throughout the whole course. The first dataset has a continuous outcome and the second dataset has a binary outcome. We will apply several methods and algorithms to these two datasets during the course. This will give us an opportunity to compare and contrast the prediction outcomes from several models and methods on the same datasets. This section provides some background information and context for these two datasets.

Readability

The readability dataset comes from a recent Kaggle Competition (CommonLit Readability Prize). You can directly download the training dataset from the competition website, or you can import it from the course website.

readability <- read.csv('https://raw.githubusercontent.com/uo-datasci-specialization/c4-ml-fall-2021/main/data/readability.csv',
                        header=TRUE)

str(readability)
'data.frame':   2834 obs. of  6 variables:
 $ id            : chr  "c12129c31" "85aa80a4c" "b69ac6792" "dd1000b26" ...
 $ url_legal     : chr  "" "" "" "" ...
 $ license       : chr  "" "" "" "" ...
 $ excerpt       : chr  "When the young people returned to the ballroom, it presented a decidedly changed appearance. Instead of an inte"| __truncated__ "All through dinner time, Mrs. Fayre was somewhat silent, her eyes resting on Dolly with a wistful, uncertain ex"| __truncated__ "As Roger had predicted, the snow departed as quickly as it came, and two days after their sleigh ride there was"| __truncated__ "And outside before the palace a great garden was walled round, filled full of stately fruit-trees, gray olives "| __truncated__ ...
 $ target        : num  -0.34 -0.315 -0.58 -1.054 0.247 ...
 $ standard_error: num  0.464 0.481 0.477 0.45 0.511 ...

There is a total of 2834 observations. Each observation represents a reading passage. The most important variables are the excerpt and target columns. The excerpt column includes a plain text data and the target column includes a corresponding measure of readability for each excerpt.

readability[1,]$excerpt
[1] "When the young people returned to the ballroom, it presented a decidedly changed appearance. Instead of an interior scene, it was a winter landscape.\nThe floor was covered with snow-white canvas, not laid on smoothly, but rumpled over bumps and hillocks, like a real snow field. The numerous palms and evergreens that had decorated the room, were powdered with flour and strewn with tufts of cotton, like snow. Also diamond dust had been lightly sprinkled on them, and glittering crystal icicles hung from the branches.\nAt each end of the room, on the wall, hung a beautiful bear-skin rug.\nThese rugs were for prizes, one for the girls and one for the boys. And this was the game.\nThe girls were gathered at one end of the room and the boys at the other, and one end was called the North Pole, and the other the South Pole. Each player was given a small flag which they were to plant on reaching the Pole.\nThis would have been an easy matter, but each traveller was obliged to wear snowshoes."
readability[1,]$target
[1] -0.3402591

According to the data owner, ‘the target value is the result of a Bradley-Terry analysis of more than 111,000 pairwise comparisons between excerpts. Teachers spanning grades 3-12 served as the raters for these comparisons.’ A lower target value indicates a more difficult text to read. The highest target score is equivalent of the 3rd grade level while the lowest target score is equivalent of the 12th grade level. The purpose is to develop a model that predicts a readability score for a given text to identify an appropriate reading level.

We will not consider the standard error variable in our models although it has a strong relationship with the target outcome because the standard errors would not be available for new observations we would like to predict. There may be be creative ways to make use of standard error in a multi-step prediction model (e.g., develop a separate prediction model for standard errors in the first step, and then use the predicted standard errors to predict target scores in the second step); however, we will not get into that in this course.

In the following weeks, we will cover how to generate features from plain text data and whether or not these features can successfully predict the target scores. These features will include universal POS tags, morphological features, syntactic annotations, and some other simple text features (e.g., number of words, number of syllables).

In addition, we will also be exposed a little bit to the world of Natural Language Processing (NLP) through some pre-trained language models (e.g., RoBerta). Our coverage of this material will be at the surface level. We will primarily cover how we can derive numerical sentence embedding from a pre-trained language model using Python through R.

You will need to install the following packages for the following weeks:

install.packages(pkgs = c('udpipe',
                          'quanteda',
                          'quanteda.textstats'), 
                 dependencies = TRUE)
                 
# Make sure to install the developer version of the text package from Github 

# install.packages("devtools")

devtools::install_github("oscarkjell/text")

You can run the following code in your computer to get prepared for the following weeks. Note that you only have to run the following code once to install the necessary packages.

# Install and load the reticulate package

install.packages(pkgs = 'reticulate',
                 dependencies = TRUE)

require(reticulate)

# Install Miniconda

install_miniconda()

# Install the Python modules 

conda_install(envname = 'r-reticulate', 'torch', pip = TRUE)


conda_install(envname = 'r-reticulate','transformers',pip = TRUE)


conda_install(envname = 'r-reticulate','nltk',pip = TRUE)


conda_install(envname = 'r-reticulate','tokenizers',pip = TRUE)

Once you install the Python packages using the code above, you can run the following code. If you are seeing the same output as below, you should be all set to explore some very exciting NLP tools using the Readability dataset.

require(reticulate)
Loading required package: reticulate
# Import the modules

reticulate::import('torch')
reticulate::import('numpy')
reticulate::import('transformers')
reticulate::import('nltk')
reticulate::import('tokenizers')

# Load udpipe

require(udpipe)
Loading required package: udpipe
# Load quanteda

require(quanteda)
Loading required package: quanteda
Package version: 3.1.0
Unicode version: 13.0
ICU version: 69.1
Parallel computing: 40 of 40 threads used.
See https://quanteda.io for tutorials and examples.
# Load quanteda text stats

require(quanteda.textstats)
Loading required package: quanteda.textstats
# Load the text package

require(text)
Loading required package: text
Registered S3 method overwritten by 'tune':
  method                   from   
  required_pkgs.model_spec parsnip
This is text (version 0.9.12). 
Text is new and still rapidly improving. 
Newer versions may have improved functions and updated defaults to reflect current understandings of the state-of-the-art. 
Please send us feedback based on your experience.
Module(torch)
Module(numpy)
Module(transformers)
Module(nltk)
Module(tokenizers)

Recidivism

The Recidivism dataset comes from The National Institute of Justice’s (NIJ) Recidivism Forecasting Challenge. The challenge aims to increase public safety and improve the fair administration of justice across the United States. This challenge had three stages of prediction, and all three stages require to model a binary outcome (recidivated vs. not recidivated in Year 1, Year 2, and Year 3). In this class, we will only work on the second stage and develop a model for predicting the probability that an individual will be recidivated in the second year after initial release.

You can directly download the training dataset from the competition website, or you can import it from the course website. Either way, please make sure you read the Terms of Use at this link before working with this dataset.

recidivism <- read.csv('https://raw.githubusercontent.com/uo-datasci-specialization/c4-ml-fall-2021/main/data/recidivism_full.csv',header=TRUE)

str(recidivism)
'data.frame':   25835 obs. of  54 variables:
 $ ID                                               : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Gender                                           : chr  "M" "M" "M" "M" ...
 $ Race                                             : chr  "BLACK" "BLACK" "BLACK" "WHITE" ...
 $ Age_at_Release                                   : chr  "43-47" "33-37" "48 or older" "38-42" ...
 $ Residence_PUMA                                   : int  16 16 24 16 16 17 18 16 5 16 ...
 $ Gang_Affiliated                                  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ Supervision_Risk_Score_First                     : int  3 6 7 7 4 5 2 5 7 5 ...
 $ Supervision_Level_First                          : chr  "Standard" "Specialized" "High" "High" ...
 $ Education_Level                                  : chr  "At least some college" "Less than HS diploma" "At least some college" "Less than HS diploma" ...
 $ Dependents                                       : chr  "3 or more" "1" "3 or more" "1" ...
 $ Prison_Offense                                   : chr  "Drug" "Violent/Non-Sex" "Drug" "Property" ...
 $ Prison_Years                                     : chr  "More than 3 years" "More than 3 years" "1-2 years" "1-2 years" ...
 $ Prior_Arrest_Episodes_Felony                     : chr  "6" "7" "6" "8" ...
 $ Prior_Arrest_Episodes_Misd                       : chr  "6 or more" "6 or more" "6 or more" "6 or more" ...
 $ Prior_Arrest_Episodes_Violent                    : chr  "1" "3 or more" "3 or more" "0" ...
 $ Prior_Arrest_Episodes_Property                   : chr  "3" "0" "2" "3" ...
 $ Prior_Arrest_Episodes_Drug                       : chr  "3" "3" "2" "3" ...
 $ Prior_Arrest_Episodes_PPViolationCharges         : chr  "4" "5 or more" "5 or more" "3" ...
 $ Prior_Arrest_Episodes_DVCharges                  : logi  FALSE TRUE TRUE FALSE TRUE FALSE ...
 $ Prior_Arrest_Episodes_GunCharges                 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ Prior_Conviction_Episodes_Felony                 : chr  "3 or more" "3 or more" "3 or more" "3 or more" ...
 $ Prior_Conviction_Episodes_Misd                   : chr  "3" "4 or more" "2" "4 or more" ...
 $ Prior_Conviction_Episodes_Viol                   : logi  FALSE TRUE TRUE FALSE TRUE FALSE ...
 $ Prior_Conviction_Episodes_Prop                   : chr  "2" "0" "1" "3 or more" ...
 $ Prior_Conviction_Episodes_Drug                   : chr  "2 or more" "2 or more" "2 or more" "2 or more" ...
 $ Prior_Conviction_Episodes_PPViolationCharges     : logi  FALSE TRUE FALSE FALSE FALSE FALSE ...
 $ Prior_Conviction_Episodes_DomesticViolenceCharges: logi  FALSE TRUE TRUE FALSE FALSE FALSE ...
 $ Prior_Conviction_Episodes_GunCharges             : logi  FALSE TRUE FALSE FALSE FALSE FALSE ...
 $ Prior_Revocations_Parole                         : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ Prior_Revocations_Probation                      : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
 $ Condition_MH_SA                                  : logi  TRUE FALSE TRUE TRUE TRUE FALSE ...
 $ Condition_Cog_Ed                                 : logi  TRUE FALSE TRUE TRUE TRUE FALSE ...
 $ Condition_Other                                  : logi  FALSE FALSE FALSE FALSE TRUE TRUE ...
 $ Violations_ElectronicMonitoring                  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ Violations_Instruction                           : logi  FALSE TRUE TRUE FALSE FALSE FALSE ...
 $ Violations_FailToReport                          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ Violations_MoveWithoutPermission                 : logi  FALSE FALSE TRUE FALSE FALSE TRUE ...
 $ Delinquency_Reports                              : chr  "0" "4 or more" "4 or more" "0" ...
 $ Program_Attendances                              : chr  "6" "0" "6" "6" ...
 $ Program_UnexcusedAbsences                        : chr  "0" "0" "0" "0" ...
 $ Residence_Changes                                : chr  "2" "2" "0" "3 or more" ...
 $ Avg_Days_per_DrugTest                            : num  612 35.7 93.7 25.4 23.1 ...
 $ DrugTests_THC_Positive                           : num  0 0 0.333 0 0 ...
 $ DrugTests_Cocaine_Positive                       : num  0 0 0 0 0 0 0 0 NA 0 ...
 $ DrugTests_Meth_Positive                          : num  0 0 0.1667 0 0.0588 ...
 $ DrugTests_Other_Positive                         : num  0 0 0 0 0 0 0 0 NA 0 ...
 $ Percent_Days_Employed                            : num  0.489 0.425 0 1 0.204 ...
 $ Jobs_Per_Year                                    : num  0.448 2 0 0.719 0.929 ...
 $ Employment_Exempt                                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ Recidivism_Within_3years                         : logi  FALSE TRUE TRUE FALSE TRUE FALSE ...
 $ Recidivism_Arrest_Year1                          : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
 $ Recidivism_Arrest_Year2                          : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
 $ Recidivism_Arrest_Year3                          : logi  FALSE TRUE FALSE FALSE FALSE FALSE ...
 $ Training_Sample                                  : int  1 1 1 1 1 0 1 0 1 1 ...

There are 25,835 observations in the training set and 54 variables including a unique ID variable, four outcome variables (Recidivism in Year 1, Recidivism in Year 2, and Recidivism in Year 3, Recidivism within 3 years), and a filter variable to indicate whether an observation was included in the training dataset or test dataset. The remaining 48 variables are potential predictive features. A full list of these variables can be found at this link.

We will work on developing a model to predict the outcome variable Recidivism_Arrest_Year2 using the 48 potential predictive variables. Before moving forward, we have to remove the individuals who had already been recidivated in Year 1. As you can see below, about 29.9% of the individuals were recidivated in Year 1. I am removing these individuals from the original dataset

table(recidivism$Recidivism_Arrest_Year1)

FALSE  TRUE 
18111  7724 
recidivism2 <- recidivism[recidivism$Recidivism_Arrest_Year1 == FALSE,]

I will also do recoding of some variables before saving the new dataset for later use in class.

  • First, some variables in the dataset are coded as TRUE and FALSE. When these variables are imported into R, R automatically recognizes them as logical variables. I will recode all these variables such that FALSE = 0 and TRUE = 1.
# Find the columns recognized as logical

  cols <- sapply(recidivism, is.logical)

# Convert them to numeric 0s and 1s

  recidivism2[,cols] <- lapply(recidivism2[,cols], as.numeric)
  • Second, the highest value for some variables are coded as 3 or more, 4 or more, 10 or more, etc. These variables can be considered as numeric, but R recognizes them as character vectors due to phrase or more for the highest value. We will recode these variables so ‘X or more’ will be equal to X.
require(dplyr)

# ?recode for more info

# Dependents

  recidivism2$Dependents <- recode(recidivism2$Dependents,
                                   '0'=0,
                                   '1'=1,
                                   '2'=2,
                                   '3 or more'=3)

# Prior Arrest Episodes Felony

  recidivism2$Prior_Arrest_Episodes_Felony <- recode(recidivism2$Prior_Arrest_Episodes_Felony,
                                                     '0'=0,
                                                     '1'=1,
                                                     '2'=2,
                                                     '3'=3,
                                                     '4'=4,
                                                     '5'=5,
                                                     '6'=6,
                                                     '7'=7,
                                                     '8'=8,
                                                     '9'=9,
                                                     '10 or more'=10)
# Prior Arrest Episods Misd

  recidivism2$Prior_Arrest_Episodes_Misd <- recode(recidivism2$Prior_Arrest_Episodes_Misd,
                                                   '0'=0,
                                                   '1'=1,
                                                   '2'=2,
                                                   '3'=3,
                                                   '4'=4,
                                                   '5'=5,
                                                   '6 or more'=6)
  
# Prior Arrest Episodes Violent

  recidivism2$Prior_Arrest_Episodes_Violent <- recode(recidivism2$Prior_Arrest_Episodes_Violent,
                                                      '0'=0,
                                                      '1'=1,
                                                      '2'=2,
                                                      '3 or more'=3)

# Prior Arrest Episods Property

  recidivism2$Prior_Arrest_Episodes_Property <- recode(recidivism2$Prior_Arrest_Episodes_Property,
                                                       '0'=0,
                                                       '1'=1,
                                                       '2'=2,
                                                       '3'=3,
                                                       '4'=4,
                                                       '5 or more'=5)
  
# Prior Arrest Episods Drug

  recidivism2$Prior_Arrest_Episodes_Drug <- recode(recidivism2$Prior_Arrest_Episodes_Drug,
                                                   '0'=0,
                                                   '1'=1,
                                                   '2'=2,
                                                   '3'=3,
                                                   '4'=4,
                                                   '5 or more'=5) 
# Prior Arrest Episods PPViolationCharges

  recidivism2$Prior_Arrest_Episodes_PPViolationCharges <- recode(recidivism2$Prior_Arrest_Episodes_PPViolationCharges,
                                                                 '0'=0,
                                                                 '1'=1,
                                                                 '2'=2,
                                                                 '3'=3,
                                                                 '4'=4,
                                                                 '5 or more'=5)  
  
# Prior Conviction Episodes Felony

  recidivism2$Prior_Conviction_Episodes_Felony <- recode(recidivism2$Prior_Conviction_Episodes_Felony,
                                                         '0'=0,
                                                         '1'=1,
                                                         '2'=2,
                                                         '3 or more'=3)

# Prior Conviction Episodes Misd

  recidivism2$Prior_Conviction_Episodes_Misd <- recode(recidivism2$Prior_Conviction_Episodes_Misd,
                                                       '0'=0,
                                                       '1'=1,
                                                       '2'=2,
                                                       '3'=3,
                                                       '4 or more'=4)
  
# Prior Conviction Episodes Prop

  recidivism2$Prior_Conviction_Episodes_Prop <- recode(recidivism2$Prior_Conviction_Episodes_Prop,
                                                       '0'=0,
                                                       '1'=1,
                                                       '2'=2,
                                                       '3 or more'=3)

# Prior Conviction Episodes Drug

  recidivism2$Prior_Conviction_Episodes_Drug <- recode(recidivism2$Prior_Conviction_Episodes_Drug,
                                                       '0'=0,
                                                       '1'=1,
                                                       '2 or more'=2)

# Delinquency Reports

  recidivism2$Delinquency_Reports <- recode(recidivism2$Delinquency_Reports,
                                            '0'=0,
                                            '1'=1,
                                            '2'=2,
                                            '3'=3,
                                            '4 or more'=4)

# Program Attendances

  recidivism2$Program_Attendances <- recode(recidivism2$Program_Attendances,
                                            '0'=0,
                                            '1'=1,
                                            '2'=2,
                                            '3'=3,
                                            '4'=4,
                                            '5'=5,
                                            '6'=6,
                                            '7'=7,
                                            '8'=8,
                                            '9'=9,
                                            '10 or more'=10)

# Program Unexcused Absences

  recidivism2$Program_UnexcusedAbsences <- recode(recidivism2$Program_UnexcusedAbsences,
                                                  '0'=0,
                                                  '1'=1,
                                                  '2'=2,
                                                  '3 or more'=3)

# Residence Changes

  recidivism2$Residence_Changes <- recode(recidivism2$Residence_Changes,
                                          '0'=0,
                                          '1'=1,
                                          '2'=2,
                                          '3 or more'=3)  
#############################################################
  
str(recidivism2)  
'data.frame':   18111 obs. of  54 variables:
 $ ID                                               : int  1 2 3 4 6 7 8 11 13 15 ...
 $ Gender                                           : chr  "M" "M" "M" "M" ...
 $ Race                                             : chr  "BLACK" "BLACK" "BLACK" "WHITE" ...
 $ Age_at_Release                                   : chr  "43-47" "33-37" "48 or older" "38-42" ...
 $ Residence_PUMA                                   : int  16 16 24 16 17 18 16 5 18 5 ...
 $ Gang_Affiliated                                  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Supervision_Risk_Score_First                     : int  3 6 7 7 5 2 5 3 3 7 ...
 $ Supervision_Level_First                          : chr  "Standard" "Specialized" "High" "High" ...
 $ Education_Level                                  : chr  "At least some college" "Less than HS diploma" "At least some college" "Less than HS diploma" ...
 $ Dependents                                       : num  3 1 3 1 0 2 3 1 1 1 ...
 $ Prison_Offense                                   : chr  "Drug" "Violent/Non-Sex" "Drug" "Property" ...
 $ Prison_Years                                     : chr  "More than 3 years" "More than 3 years" "1-2 years" "1-2 years" ...
 $ Prior_Arrest_Episodes_Felony                     : num  6 7 6 8 4 10 6 3 8 9 ...
 $ Prior_Arrest_Episodes_Misd                       : num  6 6 6 6 0 6 6 6 4 3 ...
 $ Prior_Arrest_Episodes_Violent                    : num  1 3 3 0 1 1 3 2 0 2 ...
 $ Prior_Arrest_Episodes_Property                   : num  3 0 2 3 3 5 1 1 5 2 ...
 $ Prior_Arrest_Episodes_Drug                       : num  3 3 2 3 0 1 2 1 2 4 ...
 $ Prior_Arrest_Episodes_PPViolationCharges         : num  4 5 5 3 0 5 5 3 1 4 ...
 $ Prior_Arrest_Episodes_DVCharges                  : num  0 1 1 0 0 0 0 1 0 0 ...
 $ Prior_Arrest_Episodes_GunCharges                 : num  0 0 0 0 0 1 0 0 0 1 ...
 $ Prior_Conviction_Episodes_Felony                 : num  3 3 3 3 1 3 1 0 1 3 ...
 $ Prior_Conviction_Episodes_Misd                   : num  3 4 2 4 0 1 4 3 0 2 ...
 $ Prior_Conviction_Episodes_Viol                   : num  0 1 1 0 0 0 1 0 0 1 ...
 $ Prior_Conviction_Episodes_Prop                   : num  2 0 1 3 2 3 0 0 2 1 ...
 $ Prior_Conviction_Episodes_Drug                   : num  2 2 2 2 0 0 2 0 1 1 ...
 $ Prior_Conviction_Episodes_PPViolationCharges     : num  0 1 0 0 0 1 1 1 0 1 ...
 $ Prior_Conviction_Episodes_DomesticViolenceCharges: num  0 1 1 0 0 0 0 0 0 0 ...
 $ Prior_Conviction_Episodes_GunCharges             : num  0 1 0 0 0 1 0 0 0 0 ...
 $ Prior_Revocations_Parole                         : num  0 0 0 0 0 0 0 1 0 0 ...
 $ Prior_Revocations_Probation                      : num  0 0 0 1 0 0 0 0 0 0 ...
 $ Condition_MH_SA                                  : num  1 0 1 1 0 0 0 1 0 1 ...
 $ Condition_Cog_Ed                                 : num  1 0 1 1 0 0 1 1 0 1 ...
 $ Condition_Other                                  : num  0 0 0 0 1 0 0 0 0 1 ...
 $ Violations_ElectronicMonitoring                  : num  0 0 0 0 0 0 0 1 0 0 ...
 $ Violations_Instruction                           : num  0 1 1 0 0 0 0 1 0 0 ...
 $ Violations_FailToReport                          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Violations_MoveWithoutPermission                 : num  0 0 1 0 1 0 0 0 0 0 ...
 $ Delinquency_Reports                              : num  0 4 4 0 0 0 0 0 0 0 ...
 $ Program_Attendances                              : num  6 0 6 6 0 0 0 9 0 6 ...
 $ Program_UnexcusedAbsences                        : num  0 0 0 0 0 0 0 2 0 0 ...
 $ Residence_Changes                                : num  2 2 0 3 3 1 0 2 1 1 ...
 $ Avg_Days_per_DrugTest                            : num  612 35.7 93.7 25.4 474.6 ...
 $ DrugTests_THC_Positive                           : num  0 0 0.333 0 0 ...
 $ DrugTests_Cocaine_Positive                       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ DrugTests_Meth_Positive                          : num  0 0 0.167 0 0 ...
 $ DrugTests_Other_Positive                         : num  0 0 0 0 0 ...
 $ Percent_Days_Employed                            : num  0.489 0.425 0 1 0.674 ...
 $ Jobs_Per_Year                                    : num  0.448 2 0 0.719 0.308 ...
 $ Employment_Exempt                                : num  0 0 0 0 0 0 0 1 0 1 ...
 $ Recidivism_Within_3years                         : num  0 1 1 0 0 1 0 1 0 0 ...
 $ Recidivism_Arrest_Year1                          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Recidivism_Arrest_Year2                          : num  0 0 1 0 0 0 0 1 0 0 ...
 $ Recidivism_Arrest_Year3                          : num  0 1 0 0 0 1 0 0 0 0 ...
 $ Training_Sample                                  : int  1 1 1 1 0 1 0 1 1 0 ...

Now, we can write the final version of the dataset for later use.

write.csv(recidivism2, 
          here('data/recidivism_y1 removed.csv'),
          row.names = FALSE)