Experiment 2 Results

Data

We start by loading participant-level data containing estimates of psychometric function (PF) parameters per participant. These parameter estimates were computed in Matlab using a combination of custom analysis scripts and a library of PF fitting functions from Geoffrey Boynton. PF fitting code is available in our repo. Custom scripts are available upon request but are not included in supplemental materials becasue they contain non-anonymized MTurk WorkerIDs.

A power analysis based on the data from experiment 1 suggested we would need 50 participants per visualization condition to detect between-subjects differences in just-noticable differences (JNDs) for regular HOPs, fast HOPs, and line ensembles with 80% power. This power analysis assumes that we are trying detect effects of similar magnitude to the effect in experiment 1. Following our preregistered analysis plan, we iteratively collected data and excluded PF fits based on poor fit quality and poor performance. Overall, we recruited 62 participants. Six of these participants were excluded per our preregistered exclusion criteria. Data for the sample of 150 participants used for the statistical inferences presented in the paper are in the files “E2-AnonymousStats.csv” and “E2-AnonymousRawData.csv”.

We’ll focus mostly on the estimates of PF parameters in the file “E2-AnonymousStats.csv” in order to reproduce the analyses presented in the paper.

statsDf = read.csv("E2-AnonymousStats.csv")

The variables in this data sets are as follows.

  1. Subject: MTurk workerIDs
  • These are anonymized identifiers (not actual worker IDs) in order to maintain privacy.
  • Each participant has one row in the data frame; there are 150 participants.
  1. Visualization: the visualization condition under which data were collected
  • Coding: c = line ensembles; h = HOPs (2.5 hz; 400 ms per sample); hf = fast HOPs (10 hz; 100 ms per sample)
  • Each participant completed two blocks of trials under one of three visualization conditions (between-subjects).
  1. StartCond: the visualization condition on which a worker started
  • Coding of conditions is identical to the Visualization variable.
  • This variable is redundant with the Visualization variable, a leftover in our analysis pipeline from experiment 1.
  1. Threshold: the JND fit to each observer’s data under each visualization condition
  • JDNs are in units of the absolute value of the log likelihood ratio that a stimulus was produced by the no growth vs the growth trend.
  • The JND measures the level of evidence at which the participant is expected to answer with their mean accuracy.
  • The JND is the point on the x-axis which corresponds to the mean value of the psychometric function (PF) on the y-axis.
  1. Spread: the standard deviation of the psychometric function (PF) fit to an observer’s data under each visualization condition
  • The Spread parameter of the PF shares the same units as the JND.
  • This is a measure of the width of the PF.
  • This parameter estimate is inversely proportional to the incline of the PF at its inflection point (aka the JND).
  • PF spread represents the noise in the observer’s perception of the evidence presented in a stimulus.
  1. ConfidenceFitness: a mixing parameter describing the degree to which reported confidence values are predicted by a statistical formulation of confidence vs randomly sampled confidence values
  • Units range from 0 (totally random confidence reporting) to 1 (confidence reporting is in sync with statistical confidence).
  1. CompletionTime: the number of milliseconds the participant spent completing the trials used to fit each psychometric function
  • This is the entire time participants had the webpage open between the beginning of the task and their answer on the last trial, so this should not be considered a controlled measure of time spent attending to the task. This time does not include time spent reading the instructions.

We also load in the raw trial-level response data for reference in our analysis.

rawDf = read.csv("E2-AnonymousRawData.csv")

Linear Models

We use linear models for statistical inference. Details can be found in our preregistered analysis plan.

# linear models for each outcome variable
tMdl <- lm(Threshold ~ Visualization, data = statsDf)
sMdl <- lm(Spread ~ Visualization, data = statsDf)
cMdl <- lm(ConfidenceFitness ~ Visualization, data = statsDf)

Results Per Measure

Thresholds

A summary of our linear model on JND estimates.

summary(tMdl)
## 
## Call:
## lm(formula = Threshold ~ Visualization, data = statsDf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5703 -1.1068 -0.3573  0.2918 11.8948 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.8411     0.3429  11.202   <2e-16 ***
## Visualizationh   -1.2211     0.4849  -2.518   0.0129 *  
## Visualizationhf  -0.7358     0.4849  -1.517   0.1313    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.425 on 147 degrees of freedom
## Multiple R-squared:  0.04191,    Adjusted R-squared:  0.02887 
## F-statistic: 3.215 on 2 and 147 DF,  p-value: 0.043
confint(tMdl)
##                     2.5 %     97.5 %
## (Intercept)      3.163410  4.5187047
## Visualizationh  -2.179433 -0.2627567
## Visualizationhf -1.694182  0.2224947

We plot the regression coeficients as dots with error bars. This shows estimated effect sizes (differences between the two HOPs conditions and the line ensembles condition) with 95% CIs.

We see smaller JNDs in the regular HOPs condition relative to the control condition, but we do not see the same magnitude of effect for fast HOPs.

Pairwise comparisons adjusted for multiple comparisons (i.e., Tukey’s HSD) confirm our interpretation of the linear model above. The only reliable difference in JNDs is between the regular HOPs and line ensembles visualization conditions. The JNDs for fast HOPs are intermediate between regular HOPs and line ensembles but not reliably different.

tAov <- aov(Threshold ~ Visualization, data = statsDf)
TukeyHSD(tAov)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Threshold ~ Visualization, data = statsDf)
## 
## $Visualization
##            diff        lwr         upr     p adj
## h-c  -1.2210949 -2.3692646 -0.07292527 0.0342053
## hf-c -0.7358436 -1.8840133  0.41232607 0.2857174
## hf-h  0.4852513 -0.6629183  1.63342101 0.5776187

It is noteworthy that this effect is driven by a small group of observers who performed much worse in the line ensembles and fast HOPs conditions than observers in the regular HOPs condition. In light of this data, it seems that our effect of visualization on JNDs is best characterized as a difference of consistency between observers in the ability to use these visualizations to do the task rather than a difference in performance among all observers.

Reviewers asked whether this subgroup of participants with poor performance (larger JNDs) on the line ensembles condition is accounted for by the time spent completing the task. To check this, we compare the model of JNDs presented in the paper to a similar model including the time spent to complete the trials used to fit each PF as a predictor.

# convert completion time from milliseconds to minutes
statsDf$CompletionTime <- statsDf$CompletionTime / 1000 / 60
# specify the model with completion time as a predictor
tMdl2 <- lm(Threshold ~ Visualization + CompletionTime, data = statsDf)

We can see that adding completion time to the model doesn’t impact the model coefficients.

Spreads

A summary of our linear model on PF spread estimates.

summary(sMdl)
## 
## Call:
## lm(formula = Spread ~ Visualization, data = statsDf)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.194 -2.016 -0.977  0.104 49.458 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.1978     0.7730   5.430 2.27e-07 ***
## Visualizationh   -1.7309     1.0932  -1.583    0.115    
## Visualizationhf  -0.9561     1.0932  -0.875    0.383    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.466 on 147 degrees of freedom
## Multiple R-squared:  0.01683,    Adjusted R-squared:  0.003452 
## F-statistic: 1.258 on 2 and 147 DF,  p-value: 0.2872
confint(sMdl)
##                     2.5 %    97.5 %
## (Intercept)      2.670109 5.7254039
## Visualizationh  -3.891330 0.4295095
## Visualizationhf -3.116538 1.2043020

Again, we plot estimated effect sizes as dots with error bars representing 95% CIs.

We see no effect of visualization condition on noise in the perception of evidence, as measured by PF spreads.

Confidence Fitness

A summary of our linear model on confidence fitness estimates.

summary(cMdl)
## 
## Call:
## lm(formula = ConfidenceFitness ~ Visualization, data = statsDf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.2878 -0.2209 -0.1986  0.2798  0.7848 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.28784    0.04561   6.311 3.09e-09 ***
## Visualizationh  -0.06443    0.06450  -0.999    0.319    
## Visualizationhf -0.08249    0.06450  -1.279    0.203    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3225 on 147 degrees of freedom
## Multiple R-squared:  0.01215,    Adjusted R-squared:  -0.001291 
## F-statistic: 0.9039 on 2 and 147 DF,  p-value: 0.4072
confint(cMdl)
##                      2.5 %     97.5 %
## (Intercept)      0.1977043 0.37796658
## Visualizationh  -0.1918985 0.06303086
## Visualizationhf -0.2099499 0.04497940

We visualize estimated effect sizes as dots with error bars representing 95% CIs. We are modeling confidnece fitness, a mixing parameter from 0 to 1 describing the degree to which reported confidence corresponds to a statistical formulation of confidence.

We see no effect of visualization condition on confidence fitness. Similarly we saw a null result for confidence fitness in the first experiment.

Now we look at a mixed effects linear model raw confidence data. In the first experiment, this analysis was exploratory, but we preregistered this model as a secondary analysis for experiment 2.

# Log likelihood ratio (Ratio) is stored in the raw data with signs (negative vs positive) indicating the data-generating model for the stimulus, where positive log ratios indicate no growth and negative log ratios indicate a growth trend.
# We need to take the absolute value of this log likelihood ratio in order to model confidence as a function of evidence regardless of the data generating model, as we do in the paper.
rawDf$Evidence <- abs(rawDf$Ratio)
# specify linear model
rawConfMdl <-lmer(Confidence ~ Evidence * Correct + Visualization + (1|WorkerID), data = rawDf)
summary(rawConfMdl)
## Linear mixed model fit by REML t-tests use Satterthwaite approximations
##   to degrees of freedom [lmerMod]
## Formula: Confidence ~ Evidence * Correct + Visualization + (1 | WorkerID)
##    Data: rawDf
## 
## REML criterion at convergence: 139433.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.6803 -0.4871  0.1744  0.6754  3.6152 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  WorkerID (Intercept)  49.09    7.006  
##  Residual             130.24   11.412  
## Number of obs: 18017, groups:  WorkerID, 150
## 
## Fixed effects:
##                        Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)           7.872e+01  1.081e+00  1.990e+02  72.803  < 2e-16 ***
## Evidence              1.879e-01  9.198e-02  1.795e+04   2.043   0.0410 *  
## CorrectTRUE           1.958e+00  4.689e-01  1.787e+04   4.176 2.99e-05 ***
## Visualizationh       -2.324e+00  1.417e+00  1.470e+02  -1.640   0.1031    
## Visualizationhf      -2.937e+00  1.417e+00  1.470e+02  -2.073   0.0399 *  
## Evidence:CorrectTRUE  1.340e+00  9.735e-02  1.791e+04  13.766  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Evidnc CrTRUE Vislztnh Vslztnhf
## Evidence    -0.323                                
## CorrectTRUE -0.329  0.697                         
## Visualiztnh -0.657  0.007  0.002                  
## Visualztnhf -0.655  0.001  0.001  0.500           
## Evdnc:CTRUE  0.297 -0.914 -0.836 -0.005   -0.002
confint(rawConfMdl)
## Computing profile confidence intervals ...
##                             2.5 %     97.5 %
## .sig01                6.200662242  7.8182546
## .sigma               11.293899945 11.5305453
## (Intercept)          76.610605115 80.8324469
## Evidence              0.007408729  0.3679769
## CorrectTRUE           1.038832027  2.8770126
## Visualizationh       -5.090291031  0.4426345
## Visualizationhf      -5.703302918 -0.1705382
## Evidence:CorrectTRUE  1.149540836  1.5311574

Here, we see main effects for evidence, correctness, and the fast HOPs visualization condition as well as a significant interaction between evidence and correctness. The increase in reported confidence on correct trials and the interaction between stimulus intensity and correctness were expected based on the findings of Sanders et al. (2016), who created the confidence fitness model. Confidence goes up with stimulus intensity for trials where the participant was correct, but confidence goes down with increasing stimulus intensity on trials where participants were wrong. We can see this trend by looking at our raw confidence reports, although the plot is crowded.

Sanders et al. (2016) found that the expected confidence generated by their model predicted this interaction between correctness and level of evidence in a stimulus. As they interpretted it, this behavior comports with the statistical formulation of confidence used in our model. Let’s check for this predictive behavior in our expected confidence estimations.

For many observers, reported confidence (dots) covers a wider range of the y-axis than expected confidence (lines). We’ve traced the origin of this difference in variability to the Monte Carlo simulation. For subjects with narrow PFs, the amount of noise added to evidence on each trial to generate simulated percepts is small. This means that the simulated observer only gets trials wrong where the evidence is really close to 0 (indicating that the stimulus conveys minimal information to disambiguate the underlying trend). A concequence of the simulated observer perceiving most stimuli correctly is that the model predicts values of confidence \[Pr(correct \mid perceivedEvidence)\] which are constant and high across most values of perceived evidence. In other words, low noise in simulated percepts leads to low variability in predicted confidence. This might explain the lack of good predictive behavior for subjects with small PF spreads.

See the file “JobsReport_ConfidenceFitness_Supplement.Rmd” for a detailed explanation of the confidence fitness algorithm and additional remarks on the model’s strengths and limitations.

Next we consider the main effect of the fast HOPs visualization on confidence reporting. It is important to acknowledge that this is a small effect on average, no greater than -6 units on our confidence scale. Let’s visualize our confidence reporting data and try to see this effect.

We can see that participants are more confident on average in the line ensemble condition than in either of the HOPs conditions, although only the difference between fast HOPs and line ensembles achieves statistical significance. Interestingly, this effect is in the opposite direction from our E1, where HOPs were associated with higher reported confidence on average than errorbars.