Experiment 1 Results

Data

We start by loading participant-level data containing estimates of psychometric function (PF) parameters per participant * visualation condition. These parameter estimates were computed in Matlab using a combination of custom analysis scripts and a library of PF fitting functions from Geoffrey Boynton. PF fitting code is available in our repo. Custom scripts are available upon request but are not included in supplemental materials becasue they contain non-anonymized MTurk WorkerIDs.

A power analysis based on pilot data suggested we would need 50 participants to detect within-subjects differences in just-noticable differences (JNDs) for HOPs and error bars with 80% power. Following our preregistered analysis plan, we iteratively collected data and excluded PF fits based on poor fit quality and poor performance. Overall, we recruited 62 participants. Six of these participants were excluded per our preregistered exclusion criteria, but we accidentally collected six participants too many. In order to use our intended sample size for statistical inference, we eliminated six participants at random maintaining counterbalancing of the starting visualization condition within our final sample of 50 participants. Data for the sample of 50 participants used for the statistical inferences presented in the paper are in the files “E1-AnonymousStats-InferenceSample.csv” and “E1-AnonymousRawData-InferenceSample.csv”. Data for all 56 recruited subjects passing our preregisted exclusion criteria are provided in the files “E1-AnonymousStats-FullSample.csv” and “E1-AnonymousRawData-FullSample.csv”.

We’ll focus mostly on the estimates of PF parameters in the file “E1-AnonymousStats-InferenceSample.csv” in order to reproduce the analyses presented in the paper.

statsDf = read.csv("E1-AnonymousStats-InferenceSample.csv")

The variables in this dataset are as follows.

  1. Subject: MTurk workerIDs
  • These are anonymized identifiers (not actual worker IDs) in order to maintain privacy.
  • Each participant has two rows in the data frame; there are 50 participants.
  1. Visualization: the visualization condition under which data were collected
  • Coding: c = error bars; h = HOPs
  • Each participant completed two blocks of trials, one for each visualization condition (within-subjects).
  1. StartCond: the visualization condition on which a worker started
  • Coding of conditions is identical to the Visualization variable.
  • Starting condition was counterbalanced across participants (between-subjects).
  1. Threshold: the JND fit to each observer’s data under each visualization condition
  • JDNs are in units of the absolute value of the log likelihood ratio that a stimulus was produced by the no growth vs the growth trend.
  • The JND measures the level of evidence at which the participant is expected to answer with their mean accuracy.
  • The JND is the point on the x-axis which corresponds to the mean value of the psychometric function (PF) on the y-axis.
  1. Spread: the standard deviation of the psychometric function (PF) fit to an observer’s data under each visualization condition
  • The Spread parameter of the PF shares the same units as the JND.
  • This is a measure of the width of the PF.
  • This parameter estimate is inversely proportional to the incline of the PF at its inflection point (aka the JND).
  • PF spread represents the noise in the observer’s perception of the evidence presented in a stimulus.
  1. ConfidenceFitness: a mixing parameter describing the degree to which reported confidence values are predicted by a statistical formulation of confidence vs randomly sampled confidence values
  • Units range from 0 (totally random confidence reporting) to 1 (confidence reporting is in sync with statistical confidence).
  1. CompletionTime: the number of milliseconds the participant spent completing the trials used to fit each psychometric function
  • This is the entire time participants had the webpage open between the beginning of the task and their answer on the last trial, so this should not be considered a controlled measure of time spent attending to the task. This time does not include time spent reading the instructions.

We also load in the raw trial-level response data for reference in our analysis.

rawDf = read.csv("E1-AnonymousRawData-InferenceSample.csv")

Linear Models

We use mixed effects linear models for statistical inference. In specifying the formulation of the linear model for each parameter estimate (columns 4-6 in our stats data frame), we use ANOVA to test whether including an interaction between visualization condition and starting condition leads to a significant reduction in the residual sum of squares. Thus, we use ANOVA to select the most parsimonious linear model supported by our data for each outcome measure. These details can be found in our preregistered analysis plan.

These are linear models for each of our three parameter estimates: thresholds (aka JNDs), spreads, and confidence fitness.

# linear models for each outcome variable
tMdl1 <- lmer(Threshold ~ Visualization + StartCond + (1|Subject), data = statsDf)
tMdl2 <- lmer(Threshold ~ Visualization + StartCond + (1|Subject) + Visualization:StartCond, data = statsDf)
sMdl1 <- lmer(Spread ~ Visualization + StartCond + (1|Subject), data = statsDf)
sMdl2 <- lmer(Spread ~ Visualization + StartCond + (1|Subject) + Visualization:StartCond, data = statsDf)
cMdl1 <- lmer(ConfidenceFitness ~ Visualization + StartCond + (1|Subject), data = statsDf)
cMdl2 <- lmer(ConfidenceFitness ~ Visualization + StartCond + (1|Subject) + Visualization:StartCond, data = statsDf)

Running ANOVA on our models with and without an interaction term for Visualization:StartCond indicates that we should include the interaction term for modeling PF spreads but not for modeling JNDs and confidence fitness. This is how we chose the models we present in the paper.

# find most parsimonious model of each pair
anova(tMdl1, tMdl2) # threshold: use mdl1
anova(sMdl1, sMdl2) # spread: use mdl2
anova(cMdl1, cMdl2) # confidence fitness: use mdl1

Results Per Measure

JNDs

A summary of our linear model on JND estimates.

summary(tMdl1)
## Linear mixed model fit by REML t-tests use Satterthwaite approximations
##   to degrees of freedom [lmerMod]
## Formula: Threshold ~ Visualization + StartCond + (1 | Subject)
##    Data: statsDf
## 
## REML criterion at convergence: 359.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.1831 -0.4666 -0.1446  0.4836  3.5551 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Subject  (Intercept) 0.2835   0.5324  
##  Residual             1.8697   1.3674  
## Number of obs: 100, groups:  Subject, 50
## 
## Fixed effects:
##                Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)      3.2394     0.2597 80.3200  12.475   <2e-16 ***
## Visualizationh  -0.6810     0.2735 49.0000  -2.490   0.0162 *  
## StartCondh      -0.3924     0.3122 48.0000  -1.257   0.2149    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Vslztn
## Visualiztnh -0.527       
## StartCondh  -0.601  0.000
confint(tMdl1)
## Computing profile confidence intervals ...
## Warning in optwrap(optimizer, par = thopt, fn = mkdevfun(rho, 0L), lower
## = fitted@lower): convergence code 3 from bobyqa: bobyqa -- a trust region
## step failed to reduce q
##                     2.5 %     97.5 %
## .sig01          0.0000000  0.9634991
## .sigma          1.1261515  1.6445279
## (Intercept)     2.7334309  3.7452754
## Visualizationh -1.1910196 -0.1400816
## StartCondh     -0.9032509  0.1192558

Let’s take a look at our regression coefficients for JNDs estimates.

We see that there is a reliable effect of visualation condition on JNDs, such that JNDs are lower on average when participants use HOPs. This suggest that users can make more ambiguous judgments correctly when using HOPs than when using error bars.

Just out of curiousity, what happens to this effect if we include the interaction between visualization condition and starting condition in the model? Here, we check the robustness of the effect of visualization on JNDs to decisions about model specification.

We can see that the point estimate of the effect of visualization on JNDs remains relatively stable, but adding the interaction term to the model increases the variability in the estimate. We report the model without the interaction term in the paper not becasue it yields a statistically significant effect for visualization but because we selected this model by following the procedure in our preregistration. Our analysis plan and model selection procedures were submitted to OSF prior to data collection, so this modeling decision was made a priori. We acknowledge that the same decision would rightly be considered p-hacking if we chose the model after seeing the results of the inference.

In order to better understand the impact of visualization condition on JNDs, it is helpful to see the raw data.

It is noteworthy that this effect is driven by 12% of observers (6 total) who performed much worse in the error bars condition than in the HOPs condition. In light of this data, it seems that our effect of visualization on JNDs is best characterized as a difference in the consistency with which observers can use these uncertainty visualizations to do the task rather than a difference in performance among all observers.

Reviewers asked whether this subgroup of participants with poor performance (larger JNDs) is accounted for by the time spent completing the task. To check this, we compare the model of JNDs presented in the paper to a similar model including the time spent to complete the trials used to fit each PF as a predictor.

# convert completion time from milliseconds to minutes
statsDf$CompletionTime <- statsDf$CompletionTime / 1000 / 60
# specify the model with completion time as a predictor
tMdl3 <- lmer(Threshold ~ Visualization + StartCond + CompletionTime + (1|Subject), data = statsDf)

We can see that adding completion time to the model doesn’t impact the model coefficients.

Spreads

A summary of our linear model on PF spread estimates.

summary(sMdl2)
## Linear mixed model fit by REML t-tests use Satterthwaite approximations
##   to degrees of freedom [lmerMod]
## Formula: 
## Spread ~ Visualization + StartCond + (1 | Subject) + Visualization:StartCond
##    Data: statsDf
## 
## REML criterion at convergence: 379.3
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.2554 -0.5170 -0.1146  0.2844  4.6464 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Subject  (Intercept) 0.7009   0.8372  
##  Residual             2.0528   1.4327  
## Number of obs: 100, groups:  Subject, 50
## 
## Fixed effects:
##                           Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)                 2.2859     0.3319 90.1600   6.887  7.4e-10 ***
## Visualizationh             -0.4075     0.4052 48.0000  -1.006   0.3197    
## StartCondh                 -0.9970     0.4694 90.1600  -2.124   0.0364 *  
## Visualizationh:StartCondh   1.4581     0.5731 48.0000   2.544   0.0142 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Vslztn StrtCn
## Visualiztnh -0.611              
## StartCondh  -0.707  0.432       
## Vslztnh:StC  0.432 -0.707 -0.611
confint(sMdl2)
## Computing profile confidence intervals ...
## Warning in optwrap(optimizer, par = start, fn = function(x)
## dd(mkpar(npar1, : convergence code 3 from bobyqa: bobyqa -- a trust region
## step failed to reduce q
##                                2.5 %      97.5 %
## .sig01                     0.0000000  1.26206734
## .sigma                     1.1678878  1.73121023
## (Intercept)                1.6419189  2.92978395
## Visualizationh            -1.2008845  0.38591589
## StartCondh                -1.9076172 -0.08630107
## Visualizationh:StartCondh  0.3360298  2.58010444

Again, we look at regression coefficients for our mixed effects linear model. This time we are modeling PF spread, which measures the noise in the participant’s perception of evidence in the task.

When we look at the model coefficients for PF spreads, we can see a couple of noteworthy effects. First, starting in the HOPs condition seems to make users more sensitive to evidence in the task. Perhaps participants are learning a mental representation for the task in the first block and paying less attention to the uncertainty visualizations thereafter. On this interpretation, maybe HOPs help participants learn what to expect from sampling error on individual samples more than error bars.

Next, we take a closer look at the interaction between visualization condition and starting condition for the spread parameter estimates. In the plot below, each subplot contains within-subjects shifts in PF spread between the two blocks of the experiment, where participants are grouped based on the visualization condtion in which they started the task.

Here, we can see that on average the condition that a participant starts in is the one in which their spread parameter estimate is larger. Thus, we might cautiously conclude that the noise in the perception of evidence decreases as participants become more practiced at the task. However, we cannot resolve from our data whether this interaction effect is due to practice or learning.

Confidence Fitness

A summary of our linear model on confidence fitness estimates.

summary(cMdl1)
## Linear mixed model fit by REML t-tests use Satterthwaite approximations
##   to degrees of freedom [lmerMod]
## Formula: ConfidenceFitness ~ Visualization + StartCond + (1 | Subject)
##    Data: statsDf
## 
## REML criterion at convergence: 71.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.1317 -0.8698 -0.6272  0.8505  2.0567 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev. 
##  Subject  (Intercept) 4.258e-18 2.063e-09
##  Residual             1.094e-01 3.308e-01
## Number of obs: 100, groups:  Subject, 50
## 
## Fixed effects:
##                Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)     0.31739    0.05730 97.00000   5.539  2.6e-07 ***
## Visualizationh -0.08571    0.06617 97.00000  -1.295    0.198    
## StartCondh      0.05701    0.06617 97.00000   0.862    0.391    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Vslztn
## Visualiztnh -0.577       
## StartCondh  -0.577  0.000
confint(cMdl1)
## Computing profile confidence intervals ...
## Warning in zetafun(np, ns): slightly lower deviances (diff=-7.10543e-15)
## detected
## Warning in nextpar(mat, cc, i, delta, lowcut, upcut): unexpected decrease
## in profile: using minstep
## Warning in zetafun(np, ns): slightly lower deviances (diff=-2.13163e-14)
## detected
## Warning in nextpar(mat, cc, i, delta, lowcut, upcut): Last two rows have
## identical or NA .zeta values: using minstep
## Warning in zetafun(np, ns): slightly lower deviances (diff=-7.10543e-15)
## detected
## Warning in FUN(X[[i]], ...): non-monotonic profile for .sig01
## Warning in optwrap(optimizer, par = start, fn = function(x)
## dd(mkpar(npar1, : convergence code 3 from bobyqa: bobyqa -- a trust region
## step failed to reduce q

## Warning in optwrap(optimizer, par = start, fn = function(x)
## dd(mkpar(npar1, : convergence code 3 from bobyqa: bobyqa -- a trust region
## step failed to reduce q

## Warning in optwrap(optimizer, par = start, fn = function(x)
## dd(mkpar(npar1, : convergence code 3 from bobyqa: bobyqa -- a trust region
## step failed to reduce q

## Warning in optwrap(optimizer, par = start, fn = function(x)
## dd(mkpar(npar1, : convergence code 3 from bobyqa: bobyqa -- a trust region
## step failed to reduce q

## Warning in optwrap(optimizer, par = start, fn = function(x)
## dd(mkpar(npar1, : convergence code 3 from bobyqa: bobyqa -- a trust region
## step failed to reduce q

## Warning in optwrap(optimizer, par = start, fn = function(x)
## dd(mkpar(npar1, : convergence code 3 from bobyqa: bobyqa -- a trust region
## step failed to reduce q
## Warning in optwrap(optimizer, par = thopt, fn = mkdevfun(rho, 0L), lower
## = fitted@lower): convergence code 3 from bobyqa: bobyqa -- a trust region
## step failed to reduce q

## Warning in optwrap(optimizer, par = thopt, fn = mkdevfun(rho, 0L), lower
## = fitted@lower): convergence code 3 from bobyqa: bobyqa -- a trust region
## step failed to reduce q
## Warning in confint.thpr(pp, level = level, zeta = zeta): bad spline fit
## for .sig01: falling back to linear interpolation
##                      2.5 %     97.5 %
## .sig01          0.00000000 0.13159734
## .sigma          0.28540097 0.37678220
## (Intercept)     0.20571214 0.42907476
## Visualizationh -0.21466448 0.04325245
## StartCondh     -0.07194954 0.18596739

Again, we visualize the linear model coefficients for estimates. We are modeling confidnece fitness, a mixing parameter from 0 to 1 describing the degree to which reported confidence corresponds to a statistical formulation of confidence.

On average, neither visualization condition or starting condition seem to reliably impact confidence fitness.

Just out of curiousity, what happens to the results of this analysis if we include the interaction between visualization condition and starting condition in the model? Here’s a comparison of the two possible models to show the impact of our decision procedure for model specification.

Adding the interaction term strengthens the effect of visualization condition. This appears to be evidence of a masking relationship, whereby visualization and the interaction between visualization and starting condition are both associated with confidence fitness but in opposite directions. These opposite effects cancel each other out when only visualization is included in the model. If we were to interpret the result of the model with the interaction term, we would say that on average participants were more random in their reporting of confidence when using HOPs, and their confidence reporting was closer to a statistical formulation of confidence when using error bars. However, just like our modeling of JNDs, our preregistered procedure for choosing model specifications selected the model with no interaction. Thus, we do not report on the model with the interaction term in the paper.

In order to better understand our confidence data, we conducted an exploratory data analysis on reported confidence. We used a mixed effects linear model on trial-level response data to estimate reported confidence as a function of the fixed effects of visualization, starting condition, and their interaction as well as fixed effects of stimulus intensity, whether or not a participant’s answer was correct, and their interaction. We also model a random effect of participant.

# Log likelihood ratio (Ratio) is stored in the raw data with signs (negative vs positive) indicating the data-generating model for the stimulus, where positive log ratios indicate no growth and negative log ratios indicate a growth trend.
# We need to take the absolute value of this log likelihood ratio in order to model confidence as a function of evidence regardless of the data generating model, as we do in the paper.
rawDf$Evidence <- abs(rawDf$Ratio)
# create the linear model and print a summary
rawConfMdl <- lmer(Confidence ~ Evidence * Correct + Visualization * StartCond + (1|WorkerID), data = rawDf)
summary(rawConfMdl)
## Linear mixed model fit by REML t-tests use Satterthwaite approximations
##   to degrees of freedom [lmerMod]
## Formula: Confidence ~ Evidence * Correct + Visualization * StartCond +  
##     (1 | WorkerID)
##    Data: rawDf
## 
## REML criterion at convergence: 46333.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.6254 -0.6153  0.1295  0.7062  3.2356 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  WorkerID (Intercept)  66.86    8.177  
##  Residual             127.88   11.308  
## Number of obs: 6000, groups:  WorkerID, 50
## 
## Fixed effects:
##                            Estimate Std. Error        df t value Pr(>|t|)
## (Intercept)                 69.5573     1.8149   70.0000  38.325  < 2e-16
## Evidence                     0.1001     0.1783 5953.0000   0.562   0.5743
## CorrectTRUE                  1.9252     0.8136 5946.0000   2.366   0.0180
## Visualizationh               1.6167     0.4151 5945.0000   3.895 9.92e-05
## StartCondh                   0.7909     2.3497   50.0000   0.337   0.7379
## Evidence:CorrectTRUE         1.7266     0.1866 5948.0000   9.254  < 2e-16
## Visualizationh:StartCondh   -1.3314     0.5850 5945.0000  -2.276   0.0229
##                              
## (Intercept)               ***
## Evidence                     
## CorrectTRUE               *  
## Visualizationh            ***
## StartCondh                   
## Evidence:CorrectTRUE      ***
## Visualizationh:StartCondh *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Evidnc CrTRUE Vslztn StrtCn E:CTRU
## Evidence    -0.347                                   
## CorrectTRUE -0.355  0.718                            
## Visualiztnh -0.134  0.071  0.009                     
## StartCondh  -0.651  0.015  0.002  0.089              
## Evdnc:CTRUE  0.321 -0.937 -0.828 -0.042 -0.009       
## Vslztnh:StC  0.095 -0.048 -0.014 -0.708 -0.125  0.034
confint(rawConfMdl)
## Computing profile confidence intervals ...
##                                2.5 %     97.5 %
## .sig01                     6.6370976  9.9055379
## .sigma                    11.1034222 11.5096835
## (Intercept)               66.0121753 73.1001646
## Evidence                  -0.2498475  0.4488439
## CorrectTRUE                0.3311432  3.5196559
## Visualizationh             0.8032014  2.4297645
## StartCondh                -3.8078443  5.3900402
## Evidence:CorrectTRUE       1.3613090  2.0925013
## Visualizationh:StartCondh -2.4774246 -0.1849921

We see main effects for correctness and visualization condition as well as significant interactions between ratio and correctness and visualization and start condition. The boost in reported confidence on correct trials and the interaction between stimulus intensity and correctness were expected based on the findings of Sanders et al. (2016), who created the confidence fitness model. Confidence goes up with stimulus intensity for trials where the participant was correct, but confidence goes down with increasing stimulus intensity on trials where participants were wrong. This can be appreciated by looking at our raw confidence data based on stimulus intensity and correctness, although the visualization is crowded.

Sanders et al. (2016) found that the expected confidence generated by their model predicted this interaction. In other words, this behavior actually comports with the statistical formulation of confidence used in our model. Let’s check for this predictive behavior in the expected confidence estimations from our implementation of the confidence fitness model.

Note that reported confidence (dots) often covers a wider range of the y-axis than the expected confidence (lines) predicted by our model. We’ve traced the origin of this difference in variability to the Monte Carlo simulation. For subjects with narrow PFs, the amount of noise added to evidence on each trial to generate simulated percepts is small. This means that the simulated observer only gets trials wrong where the evidence is really close to 0 (indicating that the stimulus conveys minimal information to disambiguate the underlying trend). A concequence of the simulated observer perceiving most stimuli correctly is that the model predicts values of confidence \[Pr(correct \mid perceivedEvidence)\] which are constant and high across most values of perceived evidence. In other words, low noise in simulated percepts leads to low variability in predicted confidence. This might explain the lack of good predictive behavior for subjects with small PF spreads.

See the file “JobsReport_ConfidenceFitness_Supplement.Rmd” for a detailed explanation of the confidence fitness algorithm and additional remarks on the model’s strengths and limitations.

Next, we examine the main effect of visualization on confidence reporting and the interaction between visualization and start condition. It is important to acknowledge that these are small effects on average, no greater than 2 units on our confidence scale which ranges from 50 - 100. Nonetheless, let’s visualize our confidence reporting data and try to see this effect.

Participants are more confident on average in the HOPs condition regardless of starting condition. However, as the interaction effect indicates, participants are most confident on average when they start in the errorbars condition and move into the HOPs condition in the second block of trials. It seem like the small average shifts in confidence reporting we observe in our exploratory analysis are probably not practically significant from a visualization design perspective.

Our confidence fitness analysis shows that the quality of confidence reporting is not very consistent within observers or between visualization conditions. Below is a plot of estimated confidence fitness within observers, across visualization conditions. Each point represents confidence fitness estimates in a single observer on each visualization condition. The distance from y = x indicates the inconsistency of confidence reporting within each individual. The wide distribution among points represents the inconsistency across individuals.

It seems that the impact of visualization condition on confidence reporting is difficult to interpret on its own. Hypothetically, if reported confidence is high but confidence fitness is low, this suggests that high confidence is not warranted based on the evidence presented and the observers PF. Thus, effects of visualization on reported confidence might be more meaningfully interpretted in reference to a ground truth analysis of confidence reporting such as the confidence fitness model. However, we’ve shown here that the confidence fitness model does not predict reported confidence very accurately. Future work should search for better-fitting models to establish a ground truth for confidence.

Additional Analysis of Response Bias

In our task participants must interpret which of two data generating scenarios (‘growth’ or ‘no growth’ in the job market) is more likely to have produced a given sample of jobs numbers. Overall, are participants more likely to answer ‘growth’ than ‘no growth’, or vice-versa? We can see that the frequency of each answer is approximately equal as a proportion of the number of trials, which is what we would expect given that the experiment is designed to have an equal number of trials where ‘growth’ and ‘no growth’ are the correct answer.

# proportion of 'no growth' responses in the raw data
sum(rawDf$Response=="steady") / length(rawDf$Response)
## [1] 0.5015
# proportion of 'growth' responses in the raw data
sum(rawDf$Response=="increase") / length(rawDf$Response)
## [1] 0.4985

We also want to know what makes our participants more likely to answer ‘growth’ vs ‘no growth’. To address this we conduct and exploratory analysis on responses for each trial. We use logistic mixed effects regression estimate response (‘growth’ or ‘no growth’) as a function of the fixed effects of visualization, starting condition, and their interaction as well as fixed effects of log likelihood ratio (signed evidence) and whether or not a participant’s answer was correct. We also model a random effect of participant. We do not model an interaction between log likelihood ratio and correctness because the estimation procedure does not converge for due to some combinations of these predictors where there are very few observations.

# logistic regression of response bias
bMdl <- glmer( Response ~ Ratio + Correct + Visualization * StartCond + (1 | WorkerID), data = rawDf, family = binomial)
summary(bMdl)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: Response ~ Ratio + Correct + Visualization * StartCond + (1 |  
##     WorkerID)
##    Data: rawDf
## 
##      AIC      BIC   logLik deviance df.resid 
##   4090.5   4137.4  -2038.3   4076.5     5993 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -14.2127  -0.3316   0.0622   0.3425  12.8859 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  WorkerID (Intercept) 0.46     0.6782  
## Number of obs: 6000, groups:  WorkerID, 50
## 
## Fixed effects:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                0.17727    0.17781    1.00   0.3188    
## Ratio                      0.46147    0.01106   41.71   <2e-16 ***
## CorrectTRUE               -0.10786    0.09973   -1.08   0.2795    
## Visualizationh             0.09860    0.11841    0.83   0.4050    
## StartCondh                 0.06590    0.22554    0.29   0.7702    
## Visualizationh:StartCondh -0.37265    0.16377   -2.28   0.0229 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Ratio  CrTRUE Vslztn StrtCn
## Ratio        0.022                            
## CorrectTRUE -0.424 -0.016                     
## Visualiztnh -0.342  0.024 -0.029              
## StartCondh  -0.642  0.005 -0.012  0.280       
## Vslztnh:StC  0.251 -0.064  0.010 -0.724 -0.378
confint(bMdl)
## Computing profile confidence intervals ...
##                                2.5 %      97.5 %
## .sig01                     0.5327622  0.87144386
## (Intercept)               -0.1742870  0.53000910
## Ratio                      0.4401726  0.48354315
## CorrectTRUE               -0.3034494  0.08758121
## Visualizationh            -0.1333946  0.33086313
## StartCondh                -0.3833541  0.51502589
## Visualizationh:StartCondh -0.6941651 -0.05203869

In order to understand this model, we need to know how variables are coded in the model: + Responses of ‘growth’ are coded as 0 and responses of ‘no growth are coded as 1. This means that positive coefficiencts indicate that a predictor makes a response of ’no growth’ more likely, and negative coefficiencts indicate that a predictor makes a response of ‘growth’ more likely. + Ratio in this model is signed so that negative values indicate evidence for the ‘growth’ scenario and positive values indicate evidence for the ‘no growth’ scenario. Throughout the study, we uset he absolute value of this log likelihood ratio as our metric of evidence for perceptual decision making, but here we do not take the absolute value. + The variables Correct, Visualization, and StartCond are coded as they are in the other models presented here.

Let’s plot the coefficients for this logistic mixed effects regression to see how these different predictors and their interactions bias responses.

We can see that there are reliable effects of log likelihood ratio (level of evidence for the data generating scenario) and the interaction between visualization condition and starting condition. Let’s try to understand these effects one at a time.

How people respond depending on the log likelihood ratio (level of evidence in the stimulus) is perhaps the easiest to understand. Here we plot response frequency against log likelihood ratio.

We can see that participants’ responses roughly follow the pattern we would expect. Correct responses become more frequent as log likelihood ratio moves away from 0 (the point of no evidence). As the logistic regression coefficient for log likelihood ratio indicates, participants respond ‘no growth’ more frequenty as log likelihood ratio increases, representing greater evidence that ‘no growth’ is the data-generating scenario.

Next, we try to understand the interaction between visualization condition and starting condition. Here, we plot frequencies for each response under each visualization condition, faceted by starting condition.

We can see a couple interesting things here. First, participants answer ‘no growth’ slightly more than ‘growth’ in their second block of trials regardless of visualization condition. Second, participants answer ‘growth’ more frequently only when using HOPs in the first block of the experiment. Since these differences are small, we doubt that they impact the findings we present in the paper.