Overview

MoffittFunctions is a collection of useful functions designed to assist in analysis and creation of professional reports. The current MoffittFunctions functions can be broken down to the following sections:

Testing Functions
- two_samp_bin_test
- two_samp_cont_test
- cor_test
Fancy Output Functions
- pretty_pvalues
- stat_paste
- paste_tbl_grp
- pretty_model_output
- run_pretty_model_output
- pretty_km_output
- run_pretty_km_output
Utility Functions
- round_away_0
- get_session_info
- get_full_name
Example Dataset
- Bladder_Cancer

Getting Started

# Loading MoffittFunctions and Example Dataset
library(MoffittFunctions)
data("Bladder_Cancer")

# Loading dplyr for this vignette code
library(dplyr)

MoffittTemplates Package

The MoffittTemplates package makes extensive use of the MoffittFunctions package, and is a great way get started making professional statistical reports.

Code to initially download MoffittTemplates package:

cred = git2r::cred_ssh_key(
    publickey = "MYPATH/.ssh/id_rsa.pub", 
    privatekey = "MYPATH/.ssh/id_rsa")

devtools::install_git(
  "git@gitlab.moffitt.usf.edu:ReproducibleResearch/R_Markdown_Templates.git", 
  credentials = cred)

RStudio you can Global Options -> Git/SVN to see SSH path, and to make SSH key if needed

Once installed, in RStudio go to File -> New File -> R Markdown -> From Template -> Moffitt PDF Report to start a new Markdown report using the template. Within the template there is code to load and make use of most of the MoffittFunctions functionality.

Example Dataset

The Bladder_Cancer dataset is a real world example dataset used throughout this vignette and most example in the documentation. The dataset is cleaned, using factor variables for categorical variables, and also using labels for all variables (created by the Hmisc::label() function).

Testing Functions

There are currently three testing functions, performing the appropriate statistical test depending on the data and options, returning a p value.

Comparing Two Groups (Binary Variable) for a Binary Variable

two_samp_bin_test() is used for comparing a binary variable to a binary (two group) variable, with options for Barnard, Fisher’s Exact, Chi-Sq, and McNemar tests.


table(Bladder_Cancer$Vital_Status, Bladder_Cancer$PT0N0)
#>        
#>         No Completed Response Complete Response
#>   Alive                    74                33
#>   Dead                     57                 2

two_samp_bin_test(x = Bladder_Cancer$Vital_Status, y = Bladder_Cancer$PT0N0, 
                  method = 'fisher')
#> [1] 0.00001536125

Comparing Two Groups (Binary Variable) for a Continuous Variable

two_samp_cont_test() is used for comparing a continuous variable to a binary (two group) variable, with parametric (t.test) and non-parametric (Wilcox Rank-Sum) options. Also pair data is allowed, where there are parametric (paired t.test) and non-parametric (Wilcox Signed-Rank) options.


by(Bladder_Cancer$Survival_Days, Bladder_Cancer$PT0N0, summary)
#> Bladder_Cancer$PT0N0: No Completed Response
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>     3.0   315.0   593.0   896.6  1292.5  3873.0 
#> -------------------------------------------------------- 
#> Bladder_Cancer$PT0N0: Complete Response
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>      34     538     817    1096    1581    3137

two_samp_cont_test(x = Bladder_Cancer$Survival_Days, y = Bladder_Cancer$PT0N0, 
                   method = 'wilcox')
#> [1] 0.06303205

Comparing Two Continuous Variables (Correlation)

cor_test() is used for comparing two continuous variables, with Pearson, Kendall, and Spearman methods. If Spearman method is chosen and either variable has a tie the approximate distribution is use in the coin::spreaman_test() function. This is usually the preferred method over the asymptotic approximation, which is the method stats:cor.test() uses in cases of ties.


cor(Bladder_Cancer$Age_At_Diagnosis, Bladder_Cancer$Survival_Days, 
    method = 'spearman')
#> [1] -0.13349

cor_test(x = Bladder_Cancer$Age_At_Diagnosis, y = Bladder_Cancer$Survival_Days,
         method = 'spearman')
#> [1] 0.0852

Fancy Output Functions

There are currently seven functions designed to produce professional output that can easily printed in reports.

P Values

pretty_pvalues() can be used on p values, rounding them to a specified digit amount and using < for low p values, as opposed to scientific notation (i.e. “p < 0.0001” if rounding to 4 digits), allows options for emphasizing p-values and specific characters for missing.

pvalue_example = c(1, 0.06753, 0.004435, NA, 1e-16, 0.563533)
# For simple p value display
pretty_pvalues(pvalue_example, digits = 3)
#> [1] "1.000"  "0.068"  "0.004"  "---"    "<0.001" "0.564"

# For display in report
table_p_Values <- pretty_pvalues(pvalue_example, digits = 3, background = "yellow")
kableExtra::kable(table_p_Values, format = 'latex', escape = FALSE, 
                  col.names = c("P-values"), caption = 'Fancy P Values')

You can also specify if you want p= pasted on the front of the p values.

Basic Combining of Variables

stat_paste() is used to combine two or three statistics together, allowing for different rounding and bound character specifications. Common uses for this function are for:

Mean (sd)
Median [min, max]
Estimate (SE of Estimate)
Estimate (95% CI Lower Bound, Upper Bound)
Estimate/Statistic (p value)

# Simple Examples
stat_paste(stat1 = 2.45, stat2 = 0.214, stat3 = 55.3, 
           digits = 2, bound_char = '[')
#> [1] "2.45 [0.21, 55.30]"
stat_paste(stat1 = 6.4864, stat2 = pretty_pvalues(0.0004, digits = 3), 
           digits = 3, bound_char = '(')
#> [1] "6.486 (<0.001)"


Bladder_Cancer %>% 
  group_by(Gender) %>% 
  summarise(`Survival Info (Median [Range])` = 
              stat_paste(stat1 = median(Survival_Months), 
                         stat2 = min(Survival_Months),
                         stat3 = max(Survival_Months),
                         digits = 2, bound_char = '['))
#> # A tibble: 2 x 2
#>   Gender `Survival Info (Median [Range])`
#>   <fct>  <chr>                           
#> 1 Female 30.21 [1.90, 109.00]            
#> 2 Male   20.40 [0.10, 127.19]

Bladder_Cancer %>% 
  summarise(
    `Age vs. Survival Months Cor(p value)` = 
      stat_paste(stat1 = cor(Age_At_Diagnosis, Survival_Months, method = 'spearman'), 
                 stat2 = 
                   pretty_pvalues(
                     cor_test(Age_At_Diagnosis, Survival_Months, method = 'spearman'), 
                     digits = 3, include_p = TRUE),
                 digits = 2, bound_char = '('))
#> # A tibble: 1 x 1
#>   `Age vs. Survival Months Cor(p value)`
#>   <chr>                                 
#> 1 -0.13 (p=0.085)

Advanced Combining of Variables

paste_tbl_grp() paste together information, often statistics, from two groups. There are two predefined combinations: mean(sd) and median[min,max], but user may also paste any single measure together.


summary_info <- Bladder_Cancer %>%
 group_by(Gender, Any_Downstaging) %>%
 summarise_at("Survival_Months", funs(n = length, mean, sd, median, min, max)) %>%
 tidyr::gather(variable, value, -Any_Downstaging, -Gender) %>%
 tidyr::unite(var, Any_Downstaging, variable) %>% 
 tidyr::spread(var, value) %>%
 mutate(`No Downstaging` = "No Downstaging", Downstaging = "Downstaging") %>% 
 paste_tbl_grp(vars_to_paste = c('n', 'mean_sd', 'median_min_max'), 
               first_name = 'No Downstaging', second_name = 'Downstaging')

kableExtra::kable(summary_info, format = 'latex', escape = TRUE, booktabs = TRUE, 
                  caption = 'Summary Information Comparison') %>% 
  kableExtra::kable_styling(font_size = 6.5) %>% 
  kableExtra::footnote(
    'Summary Information for Downstaging vs. No-Downstaging, by Gender')

Model Output Functions

pretty_model_output() and run_pretty_model_output() are used to produce professional tables for single or multiple Linear, Logistic, or Cox Proportional-Hazards Regression Models, calculating estimates, odds ratios, or hazard ratios, respectively, with confidence intervals. P values are also produced. For categorical variables with 3+ levels overall Type 3 p values are calculated (matches SAS’s default overall p values), in addition to p values comparing to the first level (reference).

pretty_model_output() uses the model fits, while run_pretty_model_output() uses the variables and dataset, running the desired model. The run_pretty_model_output() will use variable labels if they exist (created by the Hmisc::label() function). Many details can be adjusted, such as overall test method (“Wald” or “LR”), title (will be added as column), confidence level, estimate and p value rounded digits, significant alpha level for highlighting along with color, italic, and bolding p value options, and latex or non-latex desired output.

In run_pretty_model_output(), y_in, event_in, and event_level are used defined differently, depending on the type of model. For Linear Regression y_in is the dependent variable, and event_in and event_level are left NULL. For Logistic Regression y_in is the dependent variable, event_level is the event level of the variable (i.e. “1” or “Response”), and event_in is left NULL. For Cox Regression y_in is the time component, event_in is the event status variable, and event_level is the event level of the event_in variable (i.e. “1” or “Dead”).

Linear Regression Example

# Using pretty_model_output() for a single multivariable model

my_fit <- lm(
  Surgery_Year ~ Age_At_Diagnosis + Gender + Clinical_Stage_Grouped + 
    Histology_Grouped, data = Bladder_Cancer)
pretty_model_output(fit = my_fit, model_data = Bladder_Cancer)
#> # A tibble: 9 x 5
#>   Variable      Level          `Est (95% CI)`    `P Value` `Overall P Valu…
#>   <chr>         <chr>          <chr>             <chr>     <chr>           
#> 1 Age           ""             0.034 (-0.014, 0… 0.1617    ""              
#> 2 Gender        Female         1.0 (Reference)   -         ""              
#> 3 Gender        Male           -0.414 (-1.379, … 0.3983    ""              
#> 4 Clinical AJC… Stage I/II (<… 1.0 (Reference)   -         0.0004          
#> 5 Clinical AJC… Stage III (T3… -2.190 (-3.259, … <0.0001   ""              
#> 6 Clinical AJC… Stage IV (T4N… -1.007 (-2.490, … 0.1817    ""              
#> 7 Histology     Pure Urotheli… 1.0 (Reference)   -         0.0038          
#> 8 Histology     Mixed Tumors   1.109 (0.093, 2.… 0.0326    ""              
#> 9 Histology     Variant Histo… 1.661 (0.605, 2.… 0.0022    ""


vars_to_run = c('Age_At_Diagnosis', 'Gender', 'Clinical_Stage_Grouped', 'Histology_Grouped')

# Using run_pretty_model_output() for multiple univariate linear regression models
# Use purrr::map_dfr function to run the run_pretty_model_output() multiple times
linear_univariate_output <- purrr::map_dfr(
  vars_to_run, run_pretty_model_output, model_data = Bladder_Cancer, 
  y_in = 'Surgery_Year', event_in = NULL, event_level = NULL, 
  output_type = 'latex') 

#Use kableExtra to make fancy output
kableExtra::kable(
  linear_univariate_output, 'latex', escape = F, booktabs = TRUE, 
  linesep = '', caption = 'Linear Regression Univariate Models') %>%
  kableExtra::collapse_rows(c(1:2), row_group_label_position = 'stack',
                            headers_to_remove = 1:2, latex_hline = 'major') %>% 
  kableExtra::kable_styling(font_size = 10)


# Using run_pretty_model_output() for a multivariable linear regression model
linear_multivariable_output <- run_pretty_model_output(
  x_in = vars_to_run, model_data = Bladder_Cancer, y_in = 'Surgery_Year', 
  event_in = NULL, event_level = NULL, output_type = 'latex')

#Use kableExtra to make fancy output
kableExtra::kable(
  linear_multivariable_output %>% dplyr::select(-n), 'latex', 
  escape = F, booktabs = TRUE, linesep = '', 
  caption = 'Linear Regression Multivariable Model') %>%
  kableExtra::collapse_rows(c(1:2), row_group_label_position = 'stack',
                            headers_to_remove = 1:2, latex_hline = 'major') %>% 
  kableExtra::kable_styling(font_size = 10) %>% 
  kableExtra::footnote(
    paste0('Model sample size is ', 
           linear_multivariable_output %>% select(n) %>% slice(1)))

Logistic Regression Example

# Using pretty_model_output() for a single multivariable model

my_fit <- glm(
  Any_Downstaging == 'Downstaging' ~ Age_At_Diagnosis + Gender + Clinical_Stage_Grouped + 
    Histology_Grouped, data = Bladder_Cancer, family = binomial(link = "logit"))
pretty_model_output(fit = my_fit, model_data = Bladder_Cancer)
#> # A tibble: 9 x 5
#>   Variable       Level          `OR (95% CI)`    `P Value` `Overall P Valu…
#>   <chr>          <chr>          <chr>            <chr>     <chr>           
#> 1 Age            ""             1.027 (0.989, 1… 0.1737    ""              
#> 2 Gender         Female         1.0 (Reference)  -         ""              
#> 3 Gender         Male           0.612 (0.281, 1… 0.2111    ""              
#> 4 Clinical AJCC… Stage I/II (<… 1.0 (Reference)  -         0.0168          
#> 5 Clinical AJCC… Stage III (T3… 0.295 (0.111, 0… 0.0095    ""              
#> 6 Clinical AJCC… Stage IV (T4N… 0.356 (0.099, 1… 0.0912    ""              
#> 7 Histology      Pure Urotheli… 1.0 (Reference)  -         0.1587          
#> 8 Histology      Mixed Tumors   1.102 (0.495, 2… 0.8114    ""              
#> 9 Histology      Variant Histo… 0.457 (0.188, 1… 0.0740    ""


vars_to_run = c('Age_At_Diagnosis', 'Gender', 'Clinical_Stage_Grouped', 'Histology_Grouped')

# Using run_pretty_model_output() for multiple univariate logistic regression models
# Use purrr::map_dfr function to run the run_pretty_model_output() multiple times
logistic_univariate_output <- purrr::map_dfr(
  vars_to_run, run_pretty_model_output, model_data = Bladder_Cancer, 
  y_in = 'Any_Downstaging', event_in = NULL, event_level = 'Downstaging', 
  output_type = 'latex') 

#Use kableExtra to make fancy output
kableExtra::kable(
  logistic_univariate_output, 'latex', escape = F, booktabs = TRUE, 
  linesep = '', caption = 'Logistic Regression Univariate Models') %>%
  kableExtra::collapse_rows(c(1:2), row_group_label_position = 'stack',
                            headers_to_remove = 1:2, latex_hline = 'major') %>% 
  kableExtra::kable_styling(font_size = 10)


# Using run_pretty_model_output() for a multivariable logistic regression model
logistic_multivariable_output <- run_pretty_model_output(
  x_in = vars_to_run, model_data = Bladder_Cancer, y_in = 'Any_Downstaging', 
  event_in = NULL, event_level = 'Downstaging', output_type = 'latex')

#Use kableExtra to make fancy output
kableExtra::kable(
  logistic_multivariable_output %>% dplyr::select(-n), 'latex', 
  escape = F, booktabs = TRUE, linesep = '', 
  caption = 'Logistic Regression Multivariable Model') %>%
  kableExtra::collapse_rows(c(1:2), row_group_label_position = 'stack',
                            headers_to_remove = 1:2, latex_hline = 'major') %>% 
  kableExtra::kable_styling(font_size = 10) %>% 
  kableExtra::footnote(
    paste0('Model sample size is ', 
           logistic_multivariable_output %>% select(n) %>% slice(1)))

Cox Proportional-Hazards Regression Example

# Using pretty_model_output() for a single multivariable model

surv_obj <- survival::Surv(Bladder_Cancer$Survival_Months, Bladder_Cancer$Vital_Status == 'Dead')   
my_fit <- survival::coxph(
  surv_obj ~ Age_At_Diagnosis + Gender + Clinical_Stage_Grouped + 
    Histology_Grouped, data = Bladder_Cancer)
pretty_model_output(fit = my_fit, model_data = Bladder_Cancer)
#> # A tibble: 9 x 5
#>   Variable       Level          `HR (95% CI)`    `P Value` `Overall P Valu…
#>   <chr>          <chr>          <chr>            <chr>     <chr>           
#> 1 Age            ""             0.997 (0.967, 1… 0.8384    ""              
#> 2 Gender         Female         1.0 (Reference)  -         ""              
#> 3 Gender         Male           1.677 (0.863, 3… 0.1269    ""              
#> 4 Clinical AJCC… Stage I/II (<… 1.0 (Reference)  -         0.0349          
#> 5 Clinical AJCC… Stage III (T3… 2.217 (1.209, 4… 0.0101    ""              
#> 6 Clinical AJCC… Stage IV (T4N… 1.683 (0.662, 4… 0.2740    ""              
#> 7 Histology      Pure Urotheli… 1.0 (Reference)  -         0.5314          
#> 8 Histology      Mixed Tumors   1.276 (0.678, 2… 0.4501    ""              
#> 9 Histology      Variant Histo… 1.419 (0.715, 2… 0.3168    ""


vars_to_run = c('Age_At_Diagnosis', 'Gender', 'Clinical_Stage_Grouped', 'Histology_Grouped')

# Using run_pretty_model_output() for multiple univariate Cox regression models
# Use purrr::map_dfr function to run the run_pretty_model_output() multiple times
cox_univariate_output <- purrr::map_dfr(
  vars_to_run, run_pretty_model_output, model_data = Bladder_Cancer, 
  y_in = 'Survival_Months', event_in = 'Vital_Status', event_level = 'Dead', 
  output_type = 'latex') 

#Use kableExtra to make fancy output
kableExtra::kable(
  cox_univariate_output, 'latex', escape = F, booktabs = TRUE, 
  linesep = '', caption = 'Cox Proportional-Hazards Regression Univariate Models') %>%
  kableExtra::collapse_rows(c(1:2), row_group_label_position = 'stack',
                            headers_to_remove = 1:2, latex_hline = 'major') %>% 
  kableExtra::kable_styling(font_size = 10)


# Using run_pretty_model_output() for a multivariable Cox regression model
cox_multivariable_output <- run_pretty_model_output(
  x_in = vars_to_run, model_data = Bladder_Cancer, y_in = 'Survival_Months', 
  event_in = 'Vital_Status', event_level = 'Dead', output_type = 'latex')

#Use kableExtra to make fancy output
kableExtra::kable(
  cox_multivariable_output %>% dplyr::select(-`n (events)`), 'latex', 
  escape = F, booktabs = TRUE, linesep = '', 
  caption = 'Cox Proportional-Hazards Regression Multivariable Model') %>%
  kableExtra::collapse_rows(c(1:2), row_group_label_position = 'stack',
                            headers_to_remove = 1:2, latex_hline = 'major') %>% 
  kableExtra::kable_styling(font_size = 10) %>% 
  kableExtra::footnote(
    paste0('Model sample size (events) is ', 
           cox_multivariable_output %>% select(`n (events)`) %>% slice(1)))

Kaplan–Meier Output Functions

pretty_km_output() and run_pretty_km_output() are used to produce professional tables with Kaplan–Meier median survival estimates, and the estimates at given time points, if listed. pretty_km_output() uses a survfit object, while run_pretty_km_output() uses the variables, and strata if applicable, and runs creates the survfit objects, also calculating the log-rank p value, if applicable.

Many details can be adjusted, such as title (will be added as column), strata name, confidence level, survival estimate prefix (default is “Time”), survival estimate, median estimate, and p value rounded digits, significant alpha level for highlighting along with color, italic, and bolding p value options, and latex or non-latex desired output.


# Using pretty_km_output() for a single comparisons
surv_obj <- survival::Surv(Bladder_Cancer$Survival_Months, 
                           Bladder_Cancer$Vital_Status == 'Dead')   
downstage_fit <- survival::survfit(surv_obj ~ PT0N0, data = Bladder_Cancer)

pretty_km_output(fit = downstage_fit, time_est = 60, 
                 surv_est_prefix = 'Month', surv_est_digits = 3)
#> # A tibble: 2 x 6
#>   Group Level                N `N Events` `Median Estimat… `Month:60`      
#>   <chr> <chr>            <int>      <dbl> <chr>            <chr>           
#> 1 PT0N0 No Completed Re…   131         57 48.7 (30.3, N.E… 0.493 (0.399, 0…
#> 2 PT0N0 Complete Respon…    35          2 N.E.             0.940 (0.863, 1…

# Using run_pretty_km_output() for multiple comparisons

# First create vector of strata to compare (NA for no strate). 
vars_to_run = c(NA, 'Gender', 'Clinical_Stage_Grouped', 'PT0N0', 'Any_Downstaging')

# Next use purrr::map_dfr function to run the run_pretty_km_output() multiple times
km_output <- purrr::map_dfr(
  vars_to_run, run_pretty_km_output, model_data = Bladder_Cancer, 
  time_in = 'Survival_Months', event_in = 'Vital_Status', event_level = 'Dead', 
  time_est = c(24,60), surv_est_prefix = 'Month', p_digits = 5, 
  output_type = 'latex') %>% 
  select(Group, Level, everything())

#Finally use kableExtra to make fancy output
kableExtra::kable(km_output, 'latex', escape = F, booktabs = TRUE, linesep = '', 
     caption = 'Kaplan–Meier Output (Multiple Comparisons)') %>%
  kableExtra::collapse_rows(c(1:2), row_group_label_position = 'stack',
                            headers_to_remove = 1:2, latex_hline = 'major') %>% 
  kableExtra::kable_styling(font_size = 7.5) %>% 
  kableExtra::footnote('Survival Percentage Estimates at 24 and 60 Months')

Utility Functions

round_away_0() is a function to properly perform mathematical rounding (i.e. rounding away from 0 when tied), as opposed to the round() function, which rounds to the nearest even number when tied. Also round_away_0() allows for trailing zeros (i.e. 0.100 if rounding to 3 digits).

vals_to_round = c(NA,-3.5:3.5)
vals_to_round
#> [1]   NA -3.5 -2.5 -1.5 -0.5  0.5  1.5  2.5  3.5
round(vals_to_round)
#> [1] NA -4 -2 -2  0  0  2  2  4
round_away_0(vals_to_round)
#> [1] NA -4 -3 -2 -1  1  2  3  4
round_away_0(vals_to_round, digits = 2, trailing_zeros = TRUE)
#> [1] NA      "-3.50" "-2.50" "-1.50" "-0.50" "0.50"  "1.50"  "2.50"  "3.50"

get_session_info() produces reproducible tables, which are great to add to the end of reports. The first table gives Software Session Information and the second table gives Software Package Version Information get_full_name() is a function used by get_session_info() to get the user’s name, based on user’s ID.

my_session_info <- get_session_info()

kableExtra::kable(my_session_info$platform_table, 'latex', booktabs = TRUE, 
      linesep = '', caption = "Reproducibility Software Session Information") %>% 
      kableExtra::kable_styling(font_size = 8)


kableExtra::kable(my_session_info$packages_table, 'latex', booktabs = TRUE, 
      linesep = '', caption = "Reproducibility Software Package Version Information") %>% 
      kableExtra::kable_styling(font_size = 8)

Introduction to MoffittFunctions

William (Jimmy) Fulp

2019-03-15