MS Thesis: Sleep and alertness in medical interns

This post is an abridged version of a consulting project which culminated into my master’s thesis. Most of the verbiage is cut out to highlight the pieces I found most interesting.

TL;DR: Sleep and alertness were statistically comparable between the two shifts, and we could not statistically say that sleep mediated alertness. Nonetheless, the analysis was very thorough and a model example of extracting actionable insight from a variety data streams, and solely for that it is worth skimming.


Our primary objective is to identify whether any difference exists in sleep duration, response speed, or number of attentional lapses between medical interns on day or night shifts. An ancillary objective is to determine whether sleep mediates the effect of shift-type on alertness.


Fatigue contributes to accidents and errors in the workplace, which can be lethal in a medical setting. Two primary exposures to fatigue in the medical setting are sleep deprivation and shift-type (as in day-shift vs night-shift). To combat the increased risk accidents in night-shift workers found in many recent studies, a regulation was passed in 2011 to limit the length of any shift to 16 hours. Even with this new regulation, there is still serious concerns with night shift work. Therefore my thesis sought to further study sleep and alertness in medical interns; comparing day-shifters against night-shifters.


Sleep-wake activity of 49 medical interns on 2-week oncology and pulmonary rotations was measured continuously through wrist-worn actigraphy, and supplemented by daily diary entries. Alertness was measured daily through a brief psychomotor vigilance test (PVT). Generalized linear mixed models (GLMM) fit with inverse probability weights (IPW) were used to evaluate sleep and alertness between these two groups in the presence of missing data. Mediation analyses evaluated whether any existing differences in alertness could be attributed to sleep duration. Sensitivity analyses were used to gauge the influence of the inverse probability weights and generally the influence of missing data on our inference of shift-type.

Study population

Our study population consists of medical interns primarily on rotation at the Hospital at the University of Pennsylvania’s (HUP’s) oncology or pulmonary department. Each intern was studied for two 2-week periods. One 2-week period was a day shift from 7am-7pm, and the other 2-week period was a night float shift from 7pm-7am. Our three study outcomes were sleep duration, mean reciprocal reaction time, and the number of attentional lapses, where the first measures sleep, and the latter two alertness.


Sleep is measured continuously throughout the length of each rotation by an acitgraph (a wrist-worn, watch-like accelerometer utilizing activity counts and ambient light to classify 1-minute epochs into sleep, wake, or missing periods). We expected missing data because participants were instructed to take off the actiwatch during impact sports, swimming or bathing, and relevant medical procedures. Sleep periods from sleep logs were used to impute missing actigraphy data on the minute-to-minute level.

Alertness outcomes were measured through a validated Psychomotor Vigilance Test (PVT). Mean reciprocal reaction time (MRRT) captures response speed, and is based on the measured reaction time to stimuli presented at random inter-stimulus intervals (2-5 seconds). Response speed is a reciprocal transform of reaction time, where higher values indicate better performance. Attentional lapses are defined as the number of reaction times greater than or equal to 355ms in a 3-min PVT assessment.

Our hypothesized relationship was that night-shifts may have an effect independent of sleep, but at least partly mediated by sleep. All other variables we adjusted for had no bearing on current shift but could have some relation to both of our outcomes.

Statistical choices

Our primary endpoint was studied through a generalized linear mixed model to account for the repeated measurement structure of the data across all outcomes. Assuming continuous outcome variables are generated from a \(N(0,\sigma^2)\), we model the mean of these outcomes through the following model: \[ g(E(Y_i|b))=g(\mu_i^b)=X_i^T\beta+Z_i^Tb \] where we model the subject-specific mean conditional on random effects \(b\), g() is the link function, \(X_i\) the subject-specific design matrix for the fixed effects, \(Z_i\) the subject-specific design matrix for the random effects \(\mathbf{b}\), where \(b_i\sim N(0,D(\theta))\). Exact maximum likelihood inference is feasible when g() is the identity link (so good for sleep duration and response speed) but not when g() is not identity (i.e. for attentional lapses). This outcome requires an approximation of the integral since we assume this count has data-generating distribution \(Poisson(\lambda)\), where g() is the log link function. For this outcome, parameters \((\beta,\theta)\) are estimated through Gaussian-Hermite quadrature, where we numerically evaluate the integral \(l(\beta,\theta)\) using a weighted sum over a set of predefined quadrature points: \[ L(\beta,\theta)=e^{l(\beta,\theta)}\propto|D|^{-\frac12}\int_{-\infty}^\infty exp \bigg\{\sum_{i=1}^ml_i(Y_i|b,\beta)-\frac12b^TD(\theta)^{-1}b\bigg\}db \] All analyses utilized shift type (day or night) as the primary exposure. Our sleep analysis included precision variables age, sex, and an indicator of whether the interns was on their day-off, while our alertness analyses further included indicators of whether interns consumed caffeine in the last 24 hours, felt fatigued, or were distracted during their PVT.


Interns slept a statistically insignificant 2.4 minutes less (95% CI: 2.4±16.8min) on night shifts relative to day shifts. In contrast, a statistically significant change in attention-related performance was observed, with a decreased response speed of 0.13 1/sec (95% CI: -0.13±0.10 1/sec) and a multiplicative increase in the number of attentional lapses of 1.7 (95% CI: 0.49±0.21).

Study population

A total of 62 medical interns were contacted, of which thirteen declined, leaving 49 interns. Defining noncompliance for a shift where less than 80% of Actiwatch data was collected, we further removed three individuals and 1 shift from an individual due to noncompliance, leaving us with 46 interns who completed 68 total shifts for an aggregate 848 days of observation. 22 interns completed day and night shifts, 17 only day- shifts, and 7 only night shifts.

All groups were relatively homogenous in their distribution of age, sex, and length of rotation, barring some small differences in sleep length during work days and off-days. One notable difference was that those who completed only night shift had more than half of available sleep data missing or comingfrom a sleep log, suggesting a relatively poorer measure of sleep duration for this group.


Cut-points for analyses-involving sleep were identified by determining the exact point where the proportion of missing actigraphy data dropped below and rose above 20% in the minute-level plots. With a brief exception in day 8, missingness remained below 20% from cut-point 1 (7:26pm on day 1) and cut-point 2 (2:24pm on day 12). Coincidentally, this is around the time devices were issued and collected. All analyses were restricted from day 2-11 to permit all subjects the opportunity to collect sleep duration through both actigraphy and sleep logs.

Aggregating the sleep-wake data at the minute-level revealed a cyclical pattern that was consistent throughout the length of the rotation, where nearly all subject were nearly all subjects were awake at 7am and 7pm (time at which shifts are relieved), nearly all day-shifters were sleeping slightly after 12am midnight, and night-shifters slightly after 12pm noon.

Taking some representative interns who did both shifts, we observed little difference between day-shifts and night-shifts in their subject-specific sleep trajectories. Apart from the occasional spike in sleep duration near the middle of a rotation, sleep duration was relatively constant across day-shifts. One distinction between the two shift types was that night-shifters exhibited greater variability in their sleep duration.


Our alertness analysis was similarly restricted from day 2-11, with the additional caveat that subjects could not perform a PVT on a day-off since these measurements must be obtained in the hospital. Although the average response speed of night-shifters was slightly worse (-0.5 1/sec) than day-shifters, there existed a handful of observations on day-shifts with very poor response speed (0-3 1/sec). Similarly, night-shifters experienced an additional lapse on average relative to day-shifters, but both shifts contained a number of observations where the number of attentional lapses exceeded ten.

Final results

An IPW-weighted random intercept model, adjusted for age, gender, and whether the intern was on their day-off or not determined that interns on night shifts slept a statistically insignificant 2.0 minutes less (95% CI: -19.2m to 14.4m). On average, a male intern of age 28 slept 6 hours, 47 minutes during a work-day and a statistically different duration of 8 hours, 38 minutes on a day-off (95% CI: 8h,14m to 9h,2m). Apart from day-off, all other variables had small effect sizes which were not significant.

Two separate IPW-weighted random intercept model for MRRT and lapses were fit, and adjusted for age, gender, and whether the intern consumed caffeinated beverages or reported feeling sleepy or distracted. We observed a decrease in attention-related performance measured through response speed of 1.38 1/sec (95% CI: 0.43 1/sec to 2.32 1/sec) and a multiplicative increase in the number of attentional lapses of 1.7 (95% CI: 1.4 to 2.0). Statistical significance aside, the magnitude of the effect of night shift on attentional performance and lapses is modest.


Relative to day shifts, night shifts are significantly associated with alertness. Evidence failed to suggest that sleep duration mediated this effect, or that sleep duration is associated with shift-type. Further sleep research may shed light on qualities of sleep not captured by duration, but instrumental in the effect of shift-type on alertness. We recommend increased vigilance of interns on night-shifts, especially during tasks which may compromise patient safety.

Our collection of objective measures for sleep and alertness and rigorous statistical analysis is motivated by the need for more detailed scientific research as to how sleep affects physical and mental performance. In general these findings agree with similar, previous studies in that unorthodox shifts are related to decreases in performance of operational tasks.

Our failure to find whether sleep mediates the effect of night-shift work on alertness leaves room for alternative explanations. One such thought is that it may not just be how much, but additionally how sleep is obtained on night-shifts which may influence alertness. Night-shift work enforces mandatory naps to recoup lost sleep, and so their average-daily sleep cycle should look different than that of day-shift work. Additionally, our lack of safety outcomes leaves some unresolved ambiguity on whether decreased alertness might also endanger physician and patient safety, or whether this risk may be mitigated by supervision and increased continuity of care.

Under the 2011 ACGME rules and on the basis of our evidence, sleep may be comparable between day-shift and night-shift work, and practically speaking alertness is as well.


Mediation analysis

To determine whether sleep mediates any differences in alertness, we first estimate the effect of our exposure (shift-type) on our outcome (alertness), then we estimate the effect of our exposure on our mediator (sleep duration), then we re-estimate the effect of our exposure on our outcome, adjusted by our mediator. We conclude that sleep duration mediates this effect if the coefficient of our exposure is no longer statistically significant after adjustment. Our model of choice here is also GLMMs.

Our latter pair of models have already demonstrated an association between our exposure (shift- type) and outcome (alertness). Our next model in this procedure associates sleep duration (pential mediator) to alertness in the absence of shift-type. In this second step, results between the two outcomes were discordant, with sleep duration having a statistically insignificant effect of small magnitude on response speed (-30 1/sec, 95% CI: -10 1/sec to 75 1/sec), but statistically significant effect on the count of lapses (1.2, 95% CI: 1.1 to 1.3). Our final model relating shift type to lapses in the presence of sleep duration retains all relevant and previous significance levels, although a small increase of 3.9% was noted in the parameter for shift-type. This would imply that in the presence of our mediator, the effect of shift-type on alertness is even more pronounced, which counters our mediation hypothesis.

Considering the small effect size and discordant analyses in the second step, and the change in our exposure in the opposite direction in the final step, it is doubtful whether sleep duration is likely to mediate the effect under study. We conclude our exploration of whether sleep may mediate the relation between sleep and alertness on the basis of insufficient evidence.

Missing data

Inverse probability weighting (IPW) is commonly used to correct for the bias induced by complete case analysis in the presence of informative missing data. Essentially, the idea is to up-weight all individuals by their probability of being observed, so that individuals who are unlikely to be observed are upweighted by a larger magnitude, and those who are more common by a smaller magnitude. Weighting each observation by the inverse probability of being a complete case requires a missingness model. Let \(Y\) represent the complete data matrix, \(Y_{obs}\) and \(Y_{mis}\) the observed and missing cases of \(Y\) respectively, \(M\) our missing data indicator, and \(\phi\) the collection of unknown predictors completely specifying the missingness model. Sufficient predictors are included in \(Z\) such that our assumption that the data is missing at random (MAR) is feasible: \[ f(M|Y,\phi)=f(M|Y_{obs},\phi) \ \ \forall \ Y_{mis},\phi \] In the longitudinal setting, provided that the conditional mean model specified and missingness model are correctly specified, we are guaranteed that the estimate \(\hat\beta\) is consistent and asymptotically normal. A random-intercept, generalized linear mixed model with a logistic link was used for the missingness model, where the predictors were age, shift type, sex, and indicators of whether the intern was on a day-off, or on their first or last study day.

IPW is limited by one key drawback: large weights (i.e. small predicted probabilities), which may negatively affect inference, potentially yielding parameter estimates with inordinately large variances. One adjustment shown to improve the accuracy and precision of parameter estimates under a misspecified logistic regression model is weight-trimming which we apply. To assess the impact of large weights, we vary our truncation quantiles and assess their effect on the weights and final parameter estimates.

Missingness occurred in both outcomes and covariates. Of the days considered in our analysis, 60 of 678 (8.8%) observations were missing for sleep duration, 205 of 602 (34.0%) for response speed and lapses, 206 of 602 (34.2%) for the caffeine indicator, and 392 of 602 (65%) for the fatigued and distracted indicators. Inverse probability weights were obtained through two separate logistic random-intercept models regressed on the sleep and alertness outcomes respectively (see table 3.3). Both models adjusted for age, gender, shift-type, and the sleep model additionally adjusted for whether the subject was on their day off.

Diagnostics revealed that our estimated weights were concentrated around 1 for our sleep outcome, with a maximum weight of 5.2 for sleep, whereas our alertness outcome had 71 weights larger than 3, the largest being 6.8. Although 3 is not terribly influential, we still assessed how the magnitude of these weights affected our final inference. To examine the sensitivity of our outcomes to the magnitude of our weights, we performed weight trimming by various truncation percentiles. We found that stimates and standard errors were insensitive to their presence, most likely because in including the fatigued and distracted indicators, some of the larger weights were excluded from our final analysis.

comments powered by Disqus