Overview

MoffittFunctions is a collection of useful functions designed to assist in analysis and creation of professional reports. The current MoffittFunctions functions can be broken down to the following sections:

  • Testing Functions
    • two_samp_bin_test
    • two_samp_cont_test
    • cor_test
  • Fancy Output Functions
    • pretty_pvalues
    • stat_paste
    • paste_tbl_grp
    • pretty_model_output
    • run_pretty_model_output
    • pretty_km_output
    • run_pretty_km_output
  • Utility Functions
    • round_away_0
    • get_session_info
    • get_full_name
  • Example Dataset
    • Bladder_Cancer

MoffittTemplates Package

The MoffittTemplates package makes extensive use of the MoffittFunctions package, and is a great way get started making professional statistical reports.

Code to initially download MoffittTemplates package:

RStudio you can Global Options -> Git/SVN to see SSH path, and to make SSH key if needed

Once installed, in RStudio go to File -> New File -> R Markdown -> From Template -> Moffitt PDF Report to start a new Markdown report using the template. Within the template there is code to load and make use of most of the MoffittFunctions functionality.

Example Dataset

The Bladder_Cancer dataset is a real world example dataset used throughout this vignette and most example in the documentation. The dataset is cleaned, using factor variables for categorical variables, and also using labels for all variables (created by the Hmisc::label() function).

Testing Functions

There are currently three testing functions, performing the appropriate statistical test depending on the data and options, returning a p value.

Comparing Two Groups (Binary Variable) for a Binary Variable

two_samp_bin_test() is used for comparing a binary variable to a binary (two group) variable, with options for Barnard, Fisher’s Exact, Chi-Sq, and McNemar tests.

Comparing Two Groups (Binary Variable) for a Continuous Variable

two_samp_cont_test() is used for comparing a continuous variable to a binary (two group) variable, with parametric (t.test) and non-parametric (Wilcox Rank-Sum) options. Also pair data is allowed, where there are parametric (paired t.test) and non-parametric (Wilcox Signed-Rank) options.

Comparing Two Continuous Variables (Correlation)

cor_test() is used for comparing two continuous variables, with Pearson, Kendall, and Spearman methods. If Spearman method is chosen and either variable has a tie the approximate distribution is use in the coin::spreaman_test() function. This is usually the preferred method over the asymptotic approximation, which is the method stats:cor.test() uses in cases of ties.

Fancy Output Functions

There are currently seven functions designed to produce professional output that can easily printed in reports.

P Values

pretty_pvalues() can be used on p values, rounding them to a specified digit amount and using < for low p values, as opposed to scientific notation (i.e. “p < 0.0001” if rounding to 4 digits), allows options for emphasizing p-values and specific characters for missing.

You can also specify if you want p= pasted on the front of the p values.

Basic Combining of Variables

stat_paste() is used to combine two or three statistics together, allowing for different rounding and bound character specifications. Common uses for this function are for:

  • Mean (sd)
  • Median [min, max]
  • Estimate (SE of Estimate)
  • Estimate (95% CI Lower Bound, Upper Bound)
  • Estimate/Statistic (p value)

Advanced Combining of Variables

paste_tbl_grp() paste together information, often statistics, from two groups. There are two predefined combinations: mean(sd) and median[min,max], but user may also paste any single measure together.


summary_info <- Bladder_Cancer %>%
 group_by(Gender, Any_Downstaging) %>%
 summarise_at("Survival_Months", funs(n = length, mean, sd, median, min, max)) %>%
 tidyr::gather(variable, value, -Any_Downstaging, -Gender) %>%
 tidyr::unite(var, Any_Downstaging, variable) %>% 
 tidyr::spread(var, value) %>%
 mutate(`No Downstaging` = "No Downstaging", Downstaging = "Downstaging") %>% 
 paste_tbl_grp(vars_to_paste = c('n', 'mean_sd', 'median_min_max'), 
               first_name = 'No Downstaging', second_name = 'Downstaging')

kableExtra::kable(summary_info, format = 'latex', escape = TRUE, booktabs = TRUE, 
                  caption = 'Summary Information Comparison') %>% 
  kableExtra::kable_styling(font_size = 6.5) %>% 
  kableExtra::footnote(
    'Summary Information for Downstaging vs. No-Downstaging, by Gender')

Model Output Functions

pretty_model_output() and run_pretty_model_output() are used to produce professional tables for single or multiple Linear, Logistic, or Cox Proportional-Hazards Regression Models, calculating estimates, odds ratios, or hazard ratios, respectively, with confidence intervals. P values are also produced. For categorical variables with 3+ levels overall Type 3 p values are calculated (matches SAS’s default overall p values), in addition to p values comparing to the first level (reference).

pretty_model_output() uses the model fits, while run_pretty_model_output() uses the variables and dataset, running the desired model. The run_pretty_model_output() will use variable labels if they exist (created by the Hmisc::label() function). Many details can be adjusted, such as overall test method (“Wald” or “LR”), title (will be added as column), confidence level, estimate and p value rounded digits, significant alpha level for highlighting along with color, italic, and bolding p value options, and latex or non-latex desired output.

In run_pretty_model_output(), y_in, event_in, and event_level are used defined differently, depending on the type of model. For Linear Regression y_in is the dependent variable, and event_in and event_level are left NULL. For Logistic Regression y_in is the dependent variable, event_level is the event level of the variable (i.e. “1” or “Response”), and event_in is left NULL. For Cox Regression y_in is the time component, event_in is the event status variable, and event_level is the event level of the event_in variable (i.e. “1” or “Dead”).

Linear Regression Example

Logistic Regression Example

Cox Proportional-Hazards Regression Example

Kaplan–Meier Output Functions

pretty_km_output() and run_pretty_km_output() are used to produce professional tables with Kaplan–Meier median survival estimates, and the estimates at given time points, if listed. pretty_km_output() uses a survfit object, while run_pretty_km_output() uses the variables, and strata if applicable, and runs creates the survfit objects, also calculating the log-rank p value, if applicable.

Many details can be adjusted, such as title (will be added as column), strata name, confidence level, survival estimate prefix (default is “Time”), survival estimate, median estimate, and p value rounded digits, significant alpha level for highlighting along with color, italic, and bolding p value options, and latex or non-latex desired output.

Utility Functions

round_away_0() is a function to properly perform mathematical rounding (i.e. rounding away from 0 when tied), as opposed to the round() function, which rounds to the nearest even number when tied. Also round_away_0() allows for trailing zeros (i.e. 0.100 if rounding to 3 digits).

get_session_info() produces reproducible tables, which are great to add to the end of reports. The first table gives Software Session Information and the second table gives Software Package Version Information get_full_name() is a function used by get_session_info() to get the user’s name, based on user’s ID.