Replication of Experiment 1 by Xiang, Vélez & Gershman (2023, JEP:G)

Author
Affiliation

Jacob C. Zimmerman

UC San Diego

Published

December 21, 2025

Introduction

How do individuals perceive, represent, and judge potential collaborators’ affordances for collaboration? Xiang et al. (2023) propose and empirically validate a probabilistic model in which jointly inferred estimates of competence and effort, given observed outcomes (“belief–desire–competence framework”), predict individuals’ judgments of collaborators, across various cognitive tasks which are common within collaboration. In particular, in Experiment 1, their model best predicts participants’ judgments about whether joint activity will succeed, compared to plausible alternative models. I sought to replicate this finding to help establish the robustness of this model of judgments about collaborators, before I might extend it in future work.

Experiment 1 involves observing contestants try to lift a heavy box in a set of contests (six total), each containing three rounds. In each contest, there are two unique avatars playing as the contestants. In rounds 1 and 2, each contestant attempts to lift the box by themselves. In round 3, the contestants attempt to lift the box together. In each round, after observing each contestant succeed or fail to lift the box, the participant judges the strength of each contestant (1–10), and for each contestant that successfully lifted the box, the participant judges that contestant’s allocated effort (0%–100%). At the start of round 2, the participant judges each contestant’s probability of lifting the box successfully (0%–100%), prior to observing the lift attempts. At the start of round 3, the participant judges the probability that the contestants will successfully lift the box together. Further, an incentive to the contestants for lifting the box is specified in each round: in the second and third rounds, the specified reward for lifting the box is double that of the first round. Participants are also told that the heavy box had a constant weight across all contests, which effectively requires at least 5 strength points applied (e.g., on average, 100% effort for a contestant with strength 5, or 50% effort for a contestant with strength 10, etc.). Last, participants see a progressively filled-out table of all the lift outcomes during each contest, visible throughout, including when making judgments.

Example illustrations shown during instructions; note that the incentive displayed for round 1 is $10, not $20.

An example table for a contest, as it would appear during the probability judgment at the beginning of round 3.

I sought to replicate that A) when rounds 1 and 2 feature no individual success, participants still judge the probability of joint success in round 3 as non-zero. In addition to this confirmatory analysis, I sought to qualitatively replicate that B) participants judge higher probability of joint success as cases of individual success increases, and that the proposed model is the only model evaluated which predicts both patterns A and B. See Analysis Plan below for more details.

NoteLinks

Repository: https://github.com/psyc-201/xiang2023
OSF Component with Preregistration: https://osf.io/89ae4/
Original paper: 2023_xiang_effort.pdf (Retrieved on Sep 30, 2025 from velezlab.org)
Hosted experiment: https://psyc-201.github.io/xiang2023/exp/index1.html
Hosted version of this report: https://psyc-201.github.io/xiang2023/writeup/
Presentation accompanying this report: https://psyc-201.github.io/xiang2023/presentation/

Methods

Power & Precision Analysis

Code
# adapted from ./original_code/competence_effort/Code/main_text/regression.R

suppressPackageStartupMessages(library(tidyverse))

original.dat <- local({
  dat <- read.csv('./original_code/competence_effort/Data/exp1.csv', header = T, stringsAsFactors = T)
  dat$agent <- factor(dat$agent, labels = c('A', 'B'))
  dat$round <- as.factor(dat$round)
  dat$scenario <- factor(dat$scenario, levels = c('F,F;F,F','F,F;F,L','F,L;F,L','F,F;L,L','F,L;L,L','L,L;L,L'), ordered = T)
  dat$prob[dat$round==3 & dat$agent=='B'] <- NaN # so that the probabilities are only counted once for joint lifting
  dat <- arrange(dat,subject,scenario,round,agent)
  dat$subject <- as.factor(dat$subject)
  dat$model <- 'data'
  dat
})

# Test lift probability
original.ttest_result <- local({
  t.test(
    # filter one agent so that the probability is only counted once. Round 3 is a joint lift.
    original.dat$prob[original.dat$round == 3 & original.dat$scenario == 'F,F;F,F' & original.dat$agent == 'A'],
    mu = 0,
    alternative = 'two.sided'
  )
})

original.cohensd <- effectsize::effectsize(original.ttest_result)

The original effect size is \(d = 1.47\). To achieve 80%, 90%, or 95% power to detect that effect size, I needed:

Target Power Necessary N Estimated Power
80% 6 81.9866034%
90% 8 94.4324728%
95% 9 97.0458054%

Collecting 9 participants was completely feasible (with each spending ~16 minutes; see Procedure). To compensate for the original study’s measured effect size being inflated (e.g., due to publication bias), 15 participants would be necessary for sufficient power for the primary effect of interest, and this count was also feasible.

However, for the sake of qualitatively replicating the overall pattern of estimated probabilities increasing as scenarios feature more successful individual lifts (general increase in probability from left to right in Figure 2 (a); which I’ve labeled “B” above), and that the joint effort model is the only considered model which captures both the key effect and this pattern, I found it prudent to retain the same sample size so that I might attempt to replicate the original qualitative finding with comparable precision. (Targeting higher precision could potentially counter aforementioned effect inflation present in the original paper, but as it is unclear how to quantify the necessary precision for these qualitative analyses, targeting the original precision appears to be a sensible option.)

Further, in the original study, given 50 participants, the key effect appeared quite robust, with the CI for the data from scenario “F,F;F,F” being quite far from the line LiftProbability=0 (see Figure 2 (a)), suggesting that with 50 participants, I was nearly guaranteed to replicate this effect if it is true.

Given my understanding of course funding available, collecting 50 participants was feasible.

Planned Sample

Based on my power and precision considerations, I planned to collect 50 participants on Prolific (from the U.S., per the original authors’ private note), following the specific exclusion strategy reported below.

I followed precisely the data collection strategy in the original Experiment 1, as described on p. 1569:

“Participants’ demographic information was not collected. Participants completed a comprehension check before they moved on to the experiment. They were not allowed to proceed until they answered all the comprehension check questions correctly. … To ensure data quality, we included two attention-check questions in the experiment. Participants who failed one attention check were warned immediately to pay closer attention. Participants who failed both attention checks were asked to leave the experiment and they were not counted among the 50 participants we recruited. A total of 10 participants failed one attention check, and we did not exclude their data in our analysis.”

In the original study, per the experiment code, participants who fail both attention checks are asked to return their submission (to self-exclude). Per our current IRB approval, I did not ask these participants to do so and would instead manually exclude them from my target sample and analysis.

Materials

I followed precisely the original Experiment 1. The materials for the experiment are available at https://github.com/jczimm/competence_effort. This repository is a clone of the authors’ own code repository for the paper, with the experimental task code copied from the authors’ hosted version linked in their README.md file.

See Procedure below for more details and a figure from the paper visualizing the task.

Procedure

I would follow precisely the original Experiment 1, as described on pp. 1569-1570:

“… participants provided informed consent prior to the experiment. … Participants observed six contests between different pairs of contestants (see Table 2 for a description of the contests; the order was randomized). In each contest, the contestants were given three attempts to lift a box, corresponding to three rounds. In the first two rounds, the contestants tried lifting the box themselves. The reward for lifting the box was $10 in Round 1 and $20 in Round 2. In the third round of each contest, the two contestants tried to lift thebox together for a reward of $20 each. Participants first saw the lift outcome of Round 1 and made strength judgments (1–10; 1 means extremely weak and 10 means extremely strong) and effort judgments (0%–100%) for each contestant. For Rounds 2 and 3, they predicted the probability of the contestants lifting the box (0%–100%) before seeing the outcome, then observed the actual outcome and made strength and effort judgments. Note that participants made effort judgments only when the outcome was Lift. Participants were informed that the weight of the box was always the same and equivalent to a strength of 5 (i.e., an average contestant with strength 5 exerting all of their efforts would be able to lift the box). Participants also saw a table of all the previous outcomes when making their guesses. Figure 2 shows an illustration of the task.” Table 2 Figure 2

To summarise: there are six contests, each with three rounds, and multiple judgments during each round.

According to the consent form in the original experiment, the study is estimated to take 15 minutes. Accordingly, adhering to Prolific’s recommended rate of compensation ($12/hr), base compensation will be $3.

For determining bonus compensation, I followed precisely the original scheme as described on p. 1569:

“Participants received … a potential bonus payment of up to $1. The amount of bonus they received was equal to the probability they put on the realized lift outcome on a randomly picked round.”

Analysis Plan

I conducted a one-sample \(t\) test on the participants’ reported round-3 lift probability for scenario “F,F;F,F” (i.e., each contestant failed to lift the box in both rounds 2 and 3), with the null hypothesis that the true lift probability is zero. See Planned Sample for the data exclusion rule.

In addition to this confirmatory analysis, I would attempt to replicate the qualitative pattern that participants judge higher probability of joint success as cases of individual success increases. I would also reimplement their proposed model (joint effort) and alternative models (solitary effort and compensatory effort) in memo (Chandra et al., 2025) (see Differences from Original Study) and attempt to replicate that the proposed model is the only model evaluated which predicts both the qualitative pattern and the other result (that participants estimate round-3 lift probability as non-zero after two rounds of failures). See original paper’s Figure 3C and accompanying explanation, pp. 1570-1571; I would recreate that figure and attempt to replicate that explanation on my own data. Further, I would also estimate and statistically test the difference in the probability judgments plotted in Figure 3C between the original data and my new data.

I must note that while the \(t\) test alone does not justify the paper’s central claim that the joint effort model is qualitatively predictive, in the absence one such statistical test, this \(t\) test is a good alternative: this test demonstrates one of two key behavioral effects (p. 1570) which the model was qualitatively evaluated to predict. (In other words, I should attempt to replicate this behavioral effect before I could try to qualitatively replicate that the joint effort model is the only considered model which can predict it.)

Differences from Original Study

The sample differed in that the original was from Amazon Mechanical Turk, while the new sample is from Prolific. Accordingly, base compensation was be $3 instead of $2.

The only difference in setting is that the task was hosted at https://psyc-201.github.io/xiang2023/exp/index1.html rather than https://gershmanlab.com/experiments/yang/toc/Experiment/index1.html. Accordingly, I also replaced the original authors’ data-saving step (a php page) with DataPipe.

The only known visible differences in procedure are:

  1. The consent form is updated according to the details provided by the UCSD course PSYC 201A
  2. In task instructions, “HIT” is renamed “submission”
  3. In task instructions, “Different pairs of contestants will come in to lift the box” becomes “For each contest, a new pair of contestants will come in to lift the box” (to clarify first two questions of comprehension check)
  4. See Differences from pre-data collection methods plan for an additional change

While the confirmatory analysis was rerun using the original R code provided by the authors, for the qualitative replication I used a reimplementation of the joint effort model and the alternative models (solitary effort and compensatory effort models) in memo (Chandra et al., 2025), based on the authors’ original implementations in WebPPL. I expected that reimplementation of the continuous probabilistic models in WebPPL as discrete probabilistic models in memo would produce higher estimate precision as the model predictions will be deterministic, rather than stochastic. (I also discretized the distribution at a high resolution such that there was no meaningful systematic difference in estimates.)

I did not anticipate that these differences are meaningful regarding the original paper’s claims and the analyses of interest. However, it is possible that 3 and 4 - improving comprehension-check fairness and reducing testing fatigue - may slightly improve measurement precision and reduce measurement bias, enhancing the claims of my replication.

Reliability and Validity

The key measure is the participant’s report of the probability of the contestants successfully lifting the box in round 3. For this measure, the latent construct of interest is the participant’s internal estimate of lift probability.

The reliability of this measure is unclear, given a lack of referenced evidence of the reliability of an explicit probability judgment. However, in the current model of the participant’s judgment as Bayesian inference, I infer that the reliability is negatively correlated with the noisiness of the participant’s internal translation from implicit posterior distribution (over lift probability) to explicit report; in other words, this measure can be no more reliable than the reliability of participants’ own estimation of the expected value of their internal posterior distribution. As there may also be individual differences in the noisiness of this translation process reliability could be reduced by this.

Likewise, validity is unclear. And following the same logic, I infer that the validity may vary across participants, given that there could be individual differences in the accuracy of participations’ estimation of the expected value of their internal posterior distribution.

Methods Addendum (Post Data Collection)

Actual Sample

I collected data from 50 participants.

On Prolific there were 51 participants collected, not 50, since one returned the study but was also approved (since they returned the study, it opened up the slot again). However, I only have 50 participants’ data, since one participant’s data (prolificId starting with 5c6) wasn’t recorded; I think they didn’t press “Next” at the very end of the study, which is necessary to save the data. (In any future iterations, saving should happen before the completion code is given.)

Demographic information was not collected. By design of the experiment, participants who failed the comprehension check twice were excluded (see below). Following the planned data exclusion rule, 0 participants were excluded for failing both attention checks. A total of 10 participants failed one attention check, and I did not exclude their data in my analysis.

Differences from pre-data collection methods plan

After piloting, to align with Prolific’s guidelines, I modified the task such that no more than two attempts of the comprehension check were allowed. Participants who failed the comprehension check twice were asked to return the submission. Accordingly, before the comprehension check, I added: “You will have two chances to complete the comprehension questions correctly before we ask you to return your submission, per Prolific guidelines.”; and upon comprehension check failure, I added: “Please note that you have only X attempt(s) left.”

Unless marked otherwise, all results used only the sample collected after this change was made.

Results

Data preparation

The data collected in this experiment included task-level data:

  • subjectId, encryptedProlificId
  • pre-set round outcomes (round1_o, round2_o, round3_o)
  • pre-set round incentives (round1_reward, round2_reward, round3_reward)
  • number of failed attention checks (attention_sum)

They also included round-level data:

  • contest index (contest)
  • round index (round)
  • strength (r1_strength_a/r2_strength_a/r3_strength_a, and r1_strength_b/r2_strength_b/r3_strength_b)
  • effort (r1_effort_a/r2_effort_a/r3_effort_a, and r1_effort_b/r2_effort_b/r3_effort_b)
  • outcomes (r1_outcome_a/r2_outcome_a/r3_outcome_a, and r1_outcome_b/r2_outcome_b/r3_outcome_b)
  • probability (r1_prob_a/r2_prob_a/r3_prob_a, and r1_prob_b/r2_prob_b/r3_prob_b)

I would manually exclude any participants who failed both attention checks.

Code
### Data Preparation

#### Load Relevant Libraries and Functions
suppressPackageStartupMessages(library(tidyverse))
library(lmerTest)

# Helper
Dt <- function(data, options=list(autoWidth=TRUE)) DT::datatable(data, extensions = c('Responsive'), rownames = FALSE, options = options) 

#### Import data

filename_to_subjectId <- function(filename) str_extract(filename, '(?<=index1_).*(?=_output(-anon)?\\.csv)')

if (str_equal(Sys.getenv('OSF_PAT'), '')) {
  warning("OSF_PAT not found in .Renviron; cannot download data from OSF. Continuing, assuming that anonymized data is already accessible")
} else {
  message("Retrieving data on OSF...")
  # (data downloaded from OSF, to which is was saved by datapipe)
  # OSF is authenticated automatically using OSF_PAT in .Renviron
  OSF_NODE_REPLICATION_PROJECT <- '89ae4'
  osf_data_files <- osfr::osf_retrieve_node(OSF_NODE_REPLICATION_PROJECT) %>%
    osfr::osf_ls_nodes(pattern = "Prolific") %>%
    osfr::osf_ls_files(n_max=Inf)

  message("Downloading any new data...")
  osf_data_files %>%
    osfr::osf_download(path="./data", conflicts="skip", progress=interactive()) %>%
    invisible() # if sanity checks point to any data being missing (possibly due to the csvs being malformed), use conflicts="overwrite"

  message("Extracting metadata...")
  dat_filemeta <- osf_data_files %>%
    rename(datapipe_meta = meta) %>%
    unnest_wider(datapipe_meta, names_sep=".") %>%
    unnest_wider(datapipe_meta.attributes, names_sep=".") %>%
    select(name, datapipe_meta.attributes.date_modified) %>%
    mutate(subjectId = filename_to_subjectId(name)) %>%
    select(-name) %>%
    distinct() # in case there's any duplicates for any reason

  message("Saving anonymous data...")
  warning("**Assuming data contents are already anonymous! (i.e., all prolificIDs are still encrypted and the written responses contain no identifying information). If they're not anonymous, you should anonymize them now.")
  local({
    # Read the data now, add in the used metadata, and resave it with new filenames tagging them as anonymous
    pattern_match_original_raw_data <- 'index1_(.*)_output.csv'
    original_filenames <- list.files(path = './data', pattern = pattern_match_original_raw_data, full.names = T)
    dat_raw_anonymized <- original_filenames %>%
      map(~ {
        read.csv(.x, header = T, stringsAsFactors = F) %>%
          mutate(subjectId = filename_to_subjectId(basename(.x))) %>%
          left_join(dat_filemeta, by="subjectId")
      })
    
    pattern_name_anonymized_raw_data <- 'index1_\\1_output-anon.csv'
    new_filenames <- stringr::str_replace(original_filenames, pattern_match_original_raw_data, pattern_name_anonymized_raw_data)
    walk2(dat_raw_anonymized, new_filenames, ~ {
      write.csv(.x %>% select(-subjectId), file = .y, row.names = F)
    })
  })
  message("Saved anonymous data!")
}

# read all files, save subjectId to a column, and bind rows
pattern_match_anonymized_raw_data <- 'index1_.*_output-anon.csv'

dat_raw <- list.files(path = './data', pattern = pattern_match_anonymized_raw_data, full.names = T) %>%
  map_dfr(~ {
    read.csv(.x, header = T, stringsAsFactors = F) %>%
      mutate(subjectId = filename_to_subjectId(basename(.x)))
  })

dat_raw_annotated <- dat_raw %>%
  group_by(subjectId) %>%
  mutate(passed_an_attention_check = all(attention_sum >= 1, na.rm=T)) %>%
  ungroup() %>%
  mutate(collection_phase = ifelse(datapipe_meta.attributes.date_modified > lubridate::ymd("2025-11-20", tz="PST"), "main", "pilot"), collection_phase=ifelse(is.na(collection_phase), "pilot", collection_phase), collection_phase=factor(collection_phase, levels=c("pilot", "main")))

#### Data exclusion / filtering
# based on attention checks criterion
# (also only include participants who have a session ID, as those who don't were not real Prolific participants; and only include participants who have done a trial, otherwise they have failed the comprehension check the maximum number of times)
# and discard pilot data
dat_raw_filtered <- dat_raw_annotated %>%
  filter(passed_an_attention_check) %>%
  group_by(subjectId) %>%
  filter(any(!is.na(sessionId))) %>%
  filter(any(trial_type == "survey-html-form")) %>%
  ungroup() %>%

  filter(collection_phase == "main")

#### Prepare data for analysis - create columns etc.
# Munge to generate the rows/columns I need for the analysis, which are per-round, per-agent data: 
# - prob
# - round
# - scenario
# - agent
# - subject
# For reference (from original_code/competence_effort/Data/README.txt):
# agent column: 1 refers to the weaker contestant, 2 refers to the stronger contestant (if applicable) based on individual lifting outcomes.
# prob column: Participants' lift probability judgments.

dat_clean <- dat_raw_filtered %>%
  filter(trial_type == "survey-html-form") %>%
  select(subject=subjectId, contest, round, outcome_a, outcome_b, probability_a, probability_b, strength_a, strength_b, effort_a, effort_b) %>%
  arrange(subject, contest, round) %>%
  
  # collapse across round Stage, like in the open data (exp1.csv)
  group_by(subject, contest, round) %>%
  summarise(
    across(everything(), ~{
      ifelse(any(!is.na(.x)), .x[!is.na(.x)], NA)
    })
  ) %>%

  # prepare to generate scenario column, which uses letters to represent the outcomes
  ungroup() %>%
  mutate(
    outcome_a = ifelse(outcome_a == 1, "L", "F"),
    outcome_b = ifelse(outcome_b == 1, "L", "F")
  ) %>%

  # relabel agents where needed: for any contest, if the two agents don't have the same outcome in any round, the agent that lifted should be called agent b for the whole contest
  group_by(subject, contest) %>%
  pivot_longer(cols = c(ends_with("_a"), ends_with("_b")),
               names_to = c(".value", "agent"),
               names_sep = "_") %>%
  group_by(subject, contest, round) %>%
  mutate(
    needs_switch =
      length(unique(outcome)) > 1 & 
      ((outcome == "L" & agent == "a") | (outcome == "F" & agent == "b"))
  ) %>%
  group_by(subject, contest) %>%
  mutate(needs_switch = any(needs_switch)) %>%
  ungroup() %>%
  mutate(
    agent = ifelse(needs_switch, ifelse(agent == "a", "b", "a"), agent)
  ) %>%
  select(-needs_switch) %>%
  pivot_wider(values_from = c(outcome, probability, strength, effort), 
              names_from = agent, 
              names_sep = "_") %>%

  # now in wide format, with the agents labeled as intended, define scenario
  group_by(subject, contest, round) %>%
  mutate(outcome = paste(outcome_a, outcome_b, sep=","), .after="round") %>%
  group_by(subject, contest) %>%
  mutate(scenario = paste(outcome[round == 1 | round == 2], collapse=";"), .after="round") %>%
  select(-starts_with("outcome")) %>%

  # get into format used in the open data (exp1.csv)
  pivot_longer(cols = c(ends_with("_a"), ends_with("_b")),
               names_to = c(".value", "agent"),
               names_sep = "_") %>%
  rename(prob = probability) %>%
  mutate(agent = ifelse(agent == 'a', 1, 2))
Code
# Decrypt prolific IDs to determine bonuses

if (interactive()) { # safeguard to help ensure this is not output by quarto
  private_key_path <- './new_code/competence_effort/Experiment/private.pem'
  if (!file.exists(private_key_path)) {
    warning("Private key file not found at specified path. Cannot decrypt Prolific IDs without it.")
  } else {
    decrypt_rsa <- function(col, private_key_path) {
      private_key <- openssl::read_key(private_key_path)
      decrypt_one <- function(enc) {
        if (is.na(enc) || enc == "") return(NA_character_)
        tryCatch({
          raw_enc <- base64enc::base64decode(enc)
          raw_dec <- openssl::rsa_decrypt(raw_enc, private_key)
          rawToChar(raw_dec)
        }, error = function(e) NA_character_)
      }
      vapply(col, decrypt_one, FUN.VALUE = character(1))
    }

    dat_raw_annotated %>%
      select(subjectId, encryptedProlificId, bonus, passed_an_attention_check, datapipe_meta.attributes.date_modified) %>%
      group_by(subjectId) %>%
      filter(!is.na(bonus)) %>%
      ungroup() %>%
      # restore any subjects in case we lost them due to the filtering
      complete(subjectId = dat_raw_annotated$subjectId) %>%
      complete(encryptedProlificId = dat_raw_annotated$encryptedProlificId) %>%
      mutate(
        prolificId = decrypt_rsa(
          encryptedProlificId,
          private_key_path=private_key_path
        )
      ) %>%
      # filter to show only post-pilot participants
      filter(collection_phase == "main") %>%  
      arrange(desc(prolificId)) %>%
      select(prolificId, bonus)
      
      # filter(!is.na(prolificId), bonus > 0) %>% format_csv(col_names=F)
    
    # NOTE: only do one bulk-bonus per batch of data collection, since might accumulate in Prolific if bulk-bonus multiple times!
  }
}
Code
# Sanity checks

message("count all participants with a sessionId")
count all participants with a sessionId
Code
(real_participants <- dat_raw_annotated %>%
  group_by(subjectId) %>% filter(any(!is.na(sessionId))) %>%
  ungroup() %>% distinct(subjectId) %>%
  nrow())
[1] 97
Code
message("count participants included")
count participants included
Code
(filtered_participants <- dat_raw_filtered %>%
  ungroup() %>% distinct(subjectId) %>%
  nrow())
[1] 50
Code
(clean_participants <- dat_clean %>%
  ungroup() %>% distinct(subject) %>%
  nrow())
[1] 50
Code
stopifnot(clean_participants == filtered_participants)

message("count participants excluded due to failing both attention checks")
count participants excluded due to failing both attention checks
Code
(excluded_for_2inattention <- dat_raw_annotated %>%
  filter(!passed_an_attention_check) %>%
  group_by(subjectId) %>%
  filter(any(!is.na(sessionId))) %>%
  filter(any(trial_type == "survey-html-form")) %>%
  ungroup() %>% distinct(subjectId) %>%
  nrow())
[1] 0
Code
message("count participants kept because only failed one attention check")
count participants kept because only failed one attention check
Code
(kept_despite_1inattention <- dat_raw_filtered %>%
  group_by(subjectId) %>%
  filter(all(attention_sum < 2, na.rm=T)) %>%
  ungroup() %>% distinct(subjectId) %>%
  nrow())
[1] 10
Code
message("check that contest vs scenario is uniformly distributed, and that each participant did each contest and each scenario once")
check that contest vs scenario is uniformly distributed, and that each participant did each contest and each scenario once
Code
set.seed(1)
dat_clean %>%
  filter(agent == 1, round == 3) %>%
  ggplot(aes(x = scenario, y = contest)) +
  geom_jitter(aes(color = subject), width=.1, height=.1) +
  theme_minimal() + theme(legend.position = "none")

Code
message("clearer way to check that each subject did each scenario once")
clearer way to check that each subject did each scenario once
Code
dat_clean %>%
  filter(agent == 1, round == 3) %>%
  ggplot(aes(x = scenario)) +
  geom_col(stat="count") +
  theme_minimal() + theme(legend.position = "none")

Code
message("clearer way to check that each subject did each contest once")
clearer way to check that each subject did each contest once
Code
dat_clean %>%
  filter(agent == 1, round == 3) %>%
  mutate(contest = factor(contest)) %>%
  ggplot(aes(x = contest)) +
  geom_col(stat="count") +
  theme_minimal() + theme(legend.position = "none")

Code
message("check that effort and strength plots have missing data in the right places")
check that effort and strength plots have missing data in the right places
Code
dat_clean %>%
  mutate(round = factor(round), agent = factor(agent, labels=c("Agent A", "Agent B"))) %>%
  ggplot(aes(x = round, y=effort, color=subject, group=1)) + # set group since needed for geom_smooth to connect across the factor-type x=round
  geom_point() + geom_line(aes(group=subject)) +
  geom_smooth(method=lm, se=F, color="black") +
  facet_wrap(scenario~agent) +
  theme_minimal() + theme(legend.position = "none")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 800 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 800 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 500 rows containing missing values or values outside the scale range
(`geom_line()`).

Code
dat_clean %>%
  mutate(round = factor(round), agent = factor(agent, labels=c("Agent A", "Agent B"))) %>%
  ggplot(aes(x = round, y=strength, color=subject, group=1)) +
  geom_point() + geom_line(aes(group=subject)) +
  geom_smooth(method=lm, se=F, color="black") +
  facet_wrap(scenario~agent) +
  theme_minimal() + theme(legend.position = "none")
`geom_smooth()` using formula = 'y ~ x'

Code
dat_clean %>%
  mutate(round = factor(round), agent = factor(agent, labels=c("Agent A", "Agent B"))) %>%
  ggplot(aes(x = round, y=prob, color=subject, group=1)) +
  geom_point() + geom_line(aes(group=subject)) +
  geom_smooth(method=lm, se=F, color="black") +
  facet_wrap(scenario~agent) +
  theme_minimal() + theme(legend.position = "none")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 600 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 600 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 50 rows containing missing values or values outside the scale range
(`geom_line()`).

Code
message("review written responses")
review written responses
Code
dat_raw_annotated |>
  filter(trial_type == "survey-text") |>
  select(subjectId, internal_node_id, responses, collection_phase) |>
  rowwise(subjectId) |>
  mutate(responses = jsonlite::fromJSON(responses)$Q0) |>
  group_by(subjectId) |>
  mutate(question = ifelse(internal_node_id == min(internal_node_id), "DescribeTask", "Feedback")) |>
  select(-internal_node_id) |>
  pivot_wider(names_from = "question", values_from = "responses") |>
  Dt()
Code
message("review comprehension checks")
review comprehension checks
Code
dat_raw_annotated |>
  filter(trial_type == "survey-multi-choice") |>
  group_by(collection_phase, subjectId) |>
  summarise(checks = n()) |>
  group_by(collection_phase, checks) |>
  count()
`summarise()` has grouped output by 'collection_phase'. You can override using
the `.groups` argument.
# A tibble: 7 × 3
# Groups:   collection_phase, checks [7]
  collection_phase checks     n
  <fct>             <int> <int>
1 pilot                 1     8
2 pilot                 2    10
3 pilot                 3     2
4 pilot                 5     1
5 pilot                 9     2
6 main                  1    23
7 main                  2    58
Code
dat_raw_annotated |>
  filter(trial_type == "survey-multi-choice") |>
  group_by(subjectId) |> mutate(attempts = n()) |>
  rowwise() |> mutate(responses = list(jsonlite::fromJSON(responses))) |>
  select(subjectId, responses, attempts, collection_phase) |> unnest_longer(responses) |>
  group_by(subjectId, responses, attempts) |> mutate(times_answered = n()) |>
  ungroup() |>
  ggplot(aes(x=responses, group=paste(times_answered, subjectId), fill=as.factor(times_answered))) +
  geom_bar() +
  geom_line(aes(y=after_stat(count)), stat="count", position=position_stack(vjust=.5), alpha=.5) +
  facet_grid(collection_phase~responses_id, scales = "free") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = -20, hjust=0)) +
  scale_fill_viridis_d() +
  guides(color = "none") +
  labs(
    title = "Comprehension check question responses (lines connect different attempts by same participant)",
    x = "Response",
    y = "Count",
    fill = "Times answered"
  )
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?

Confirmatory analysis

Code
# (Adapted from original_code/competence_effort/Code/main_text/regression.R)
## Experiment 1
replication.dat <- local({
  dat <- dat_clean # read.csv('./data/exp1.csv', header = T, stringsAsFactors = T)
  dat$agent <- factor(dat$agent, labels = c('A', 'B'))
  dat$round <- as.factor(dat$round)
  dat$scenario <- factor(dat$scenario, levels = c('F,F;F,F','F,F;F,L','F,L;F,L','F,F;L,L','F,L;L,L','L,L;L,L'), ordered = T)
  dat$prob[dat$round==3 & dat$agent=='B'] <- NaN # so that the probabilities are only counted once for joint lifting
  dat <- arrange(dat,subject,scenario,round,agent)
  dat$model <- 'data'
  dat
})

# Lift probability
replication.ttest_result <- local({
  t.test(
    # filter one agent so that the probability is only counted once. Round 3 is a joint lift.
    replication.dat$prob[replication.dat$round == 3 & replication.dat$scenario == 'F,F;F,F' & replication.dat$agent == 'A'],
    mu = 0,
    alternative = 'two.sided'
  )
})

replication.cohensd <- effectsize::effectsize(replication.ttest_result)

I found that A) participants rated the probability of joint success in round 3 as non-zero (\(M = 43.22\), 95% CI \([33.84, 52.60]\), \(t(49) = 9.26\), \(p < .001\)). See the black point range (data) for scenario “F,F;F,F” under Exploratory analyses.

The effect size found is \(d = 1.31\).

Code
# dat$prob[dat$round == 3 & dat$scenario == 'F,F;F,F' & dat$agent == 'A']

alpha <- .05
critical_tvalue <- qt(1 - (alpha/2), df=replication.ttest_result$parameter) # dividing alpha by 2 since this is a two-tailed test

dat <- replication.dat
dat %>%
  filter(round == 3, scenario == 'F,F;F,F', agent == 'A') %>%
  {ggplot(., aes(x=prob)) +
    geom_histogram(bins=floor(sqrt(nrow(.)))) +
    coord_cartesian(xlim = c(0, 100)) +
    geom_vline(xintercept=critical_tvalue, color='red') +
    theme_minimal() + labs(y='Density', x='Lift probability in Round 3 (%)')}

Figure 1: The red line shows the critical \(t\) value for detecting a statistically significant effect with \(\alpha = 0.05\).

Exploratory analyses

I found that B) the probability of joint success increased as individual success increased. Further, the joint effort model was the only model which predicted both pattern A (that joint success has non-zero probability; see Confirmatory analysis) and pattern B (joint success probability increases as individual success increases).

Code
# adapted from original_code/competence_effort/Code/main_text/plt.R and original_code/competence_effort/Code/main_text/plot_figures.R
simulation <- read_csv("new_code/memo-sandbox/webppl vs memo/xiang2023-exp1-round3-model_fits_results.csv") |>
  select(model, scenario, P) |>
  mutate(prob = 100*P)
simulation$scenario <- factor(simulation$scenario, levels = c('F,F;F,F','F,F;F,L','F,L;F,L','F,F;L,L','F,L;L,L','L,L;L,L'), ordered = T)

set.seed(1)
fig3c <- dat %>%
  ggplot(aes(scenario, prob, group = model, color = model)) +

  geom_line(data = simulation %>% filter(model=='compensatory')) +
  geom_line(data = simulation %>% filter(model=='solitary')) +
  geom_line(data = simulation %>% filter(model=='joint')) +
  geom_point(data = simulation %>% filter(model=='compensatory'), size=.8) +
  geom_point(data = simulation %>% filter(model=='solitary'), size=.8) +
  geom_point(data = simulation %>% filter(model=='joint'), size=.8) +

  stat_summary(data = dat %>% filter(round == 3 & agent == 'A'), fun.data = 'mean_cl_boot', geom = 'errorbar', width = .1) +
  stat_summary(data = dat %>% filter(round == 3 & agent == 'A'), fun = 'mean', geom = 'point', size = 1.5) +

  scale_color_manual(
    name = NULL,
    labels = c('Data','Joint effort model','Solitary effort model','Compensatory effort model'),
    values = c('#000000','#e35d5e','#004385','#05b2dc'),
    limits = c('data','joint','solitary','compensatory'),
    guide = ggh4x::guide_stringlegend(position = "top")
  ) +
  theme_minimal() + 
  theme(legend.text = element_text(face = "bold.italic", size = 18 / .pt)) +

  coord_cartesian(ylim = c(0, 100)) +
  labs(x='Round 1 and Round 2 outcome', y='Lift probability in Round 3 (%)', color=NULL)

# write temporarily to file to render in qmd
ggsave('./writeup/replication-fig3c.png', plot = fig3c, width=4, height=3)

original.fig3c <- local({
  # adapted from original_code/competence_effort/Code/main_text/plot_figures.R

  inference <- read.csv('./original_code/competence_effort/Code/main_text/exp1_simulation.csv', header = T, stringsAsFactors = T)
  inference$round <- as.factor(inference$round)
  inference$model <- factor(inference$model, levels = c('joint','solitary','compensatory','maximum'), ordered = T)
  inference$scenario <- factor(inference$scenario, levels = c('F,F;F,F','F,F;F,L','F,L;F,L','F,F;L,L','F,L;L,L','L,L;L,L'), ordered = T)
  inference$effort[inference$outcome==0] <- NaN
  inference$prob[inference$round==3 & inference$agent=='B'] <- NaN

  inference_subset <- inference %>% filter(round == 3 & agent == 'A')

  set.seed(1)
  original.dat %>%
    ggplot(aes(scenario, prob, group = model, color = model)) +

    geom_line(data = inference_subset %>% filter(model=='compensatory')) +
    geom_line(data = inference_subset %>% filter(model=='solitary')) +
    geom_line(data = inference_subset %>% filter(model=='joint')) +
    geom_point(data = inference_subset %>% filter(model=='compensatory'), size=.8) +
    geom_point(data = inference_subset %>% filter(model=='solitary'), size=.8) +
    geom_point(data = inference_subset %>% filter(model=='joint'), size=.8) +

    stat_summary(data = original.dat %>% filter(round == 3 & agent == 'A'), fun.data = 'mean_cl_boot', geom = 'errorbar', width = .1) +
    stat_summary(data = original.dat %>% filter(round == 3 & agent == 'A'), fun = 'mean', geom = 'point', size = 1.5) +

    scale_color_manual(
      name = NULL,
      labels = c('Data','Joint effort model','Solitary effort model','Compensatory effort model'),
      values = c('#000000','#e35d5e','#004385','#05b2dc'),
      limits = c('data','joint','solitary','compensatory'),
      guide = ggh4x::guide_stringlegend(position = "top")
    ) +
    theme_minimal() + 
    theme(legend.text = element_text(face = "bold.italic", size = 18 / .pt)) +
    

    coord_cartesian(ylim = c(0, 100)) +
    labs(x='Round 1 and Round 2 outcome', y='Lift probability in Round 3 (%)', color=NULL)
})

# write temporarily to file to render in qmd
ggsave('./writeup/original-fig3c.png', plot = original.fig3c, width=4, height=3)

set.seed(1)
fig3c_w_raw_data <- fig3c +
  geom_line(aes(group=subject), linetype="solid", linewidth=.25, alpha=.2, data = dat %>% filter(round == 3 & agent == 'A'), position=position_jitter(width=.05, seed=0)) +
  geom_point(aes(group=subject), alpha=.2, data = dat %>% filter(round == 3 & agent == 'A'), size=.8, position=position_jitter(width=.05, seed=0))
  
# write temporarily to file to render in qmd
ggsave('./writeup/replication-fig3c-withrawdata.png', plot = fig3c_w_raw_data, width=6, height=4.5)

fig3c_w_safejoint <- fig3c +
  geom_line(data = simulation %>% filter(model=='safe_joint_w_gini')) +
  geom_point(data = simulation %>% filter(model=='safe_joint_w_gini'), size=.8) +
  scale_color_manual(
    name = NULL,
    labels = c('Data','Joint effort','Safe joint effort','Solitary effort','Compensatory effort'),
    values = c('#000000','#e35d5e','#b60cc5','#004385','#05b2dc'),
    limits = c('data','joint','safe_joint_w_gini','solitary','compensatory'),
    guide = ggh4x::guide_stringlegend(position = "top")
  )
  
# write temporarily to file to render in qmd
ggsave('./writeup/replication-fig3c-withsafejoint.png', plot = fig3c_w_safejoint, width=6, height=4.5)
(a) Original Figure 3C. “Model simulations averaged over 10 runs. Error bars indicate bootstrapped 95% confidence intervals.” (p. 1571) Regenerated error bars using a fixed seed.
(b) Figure 3C Replication. Model simulations consider strengths from 1 to 10 (inclusive) in steps of 0.03 (determined by machine memory constraints). Error bars indicate bootstrapped 95% confidence intervals.
Figure 2: Comparison of Figure 3C from Xiang et al. (2023) and my replication.
Figure 3: Figure 3C Replication with Raw Data
Figure 4: Figure 3C Replication with Safe Joint Model

Discussion

Summary of Replication Attempt

The primary, confirmatory result fully replicated: A) when rounds 1 and 2 feature no individual success, participants still judge the probability of joint success in round 3 as non-zero. The exploratory (qualitative) result also replicated: B) participants judge higher probability of joint success as cases of individual success increases. Further, I replicated that the authors’ proposed model was the only model evaluated which predicts both patterns A and B, comparing it to plausible alternative models. In conjunction, my replication supports the robustness of this model of reasoning about collaborators.

Commentary

Replication success was expected due to the large effect size and clear qualitative pattern. However, the replication success of the confirmatory result nevertheless reinforces my confidence in the reliability of the task paradigm, and the replication of the qualitative pattern using a model reimplementation reinforces my confidence in both that pattern and the modeling approach.

In my replication, I observed slightly larger CIs, and an effect size that was lower than the original. However, these may be expected due to publication bias. Further, I observed a slope inversion in the data means (i.e., estimated probability of round 3 succcess is lower for F,L;F,L than for F,F;F,L) which was not present in the data from the original study. Speculatively, this might be explainable by the safe joint effort model (see Figure 4), a modified version of the authors’ proposed joint effort model available in their supplement. In this case, the replication provides some support for that modified model.

References

Chandra, K., Chen, T., Tenenbaum, J. B., & Ragan-Kelley, J. (2025). A domain-specific probabilistic programming language for reasoning about reasoning (or: A memo on memo). Proc. ACM Program. Lang., 9(OOPSLA2). https://doi.org/10.1145/3763078
Xiang, Y., Vélez, N., & Gershman, S. J. (2023). Collaborative decision making is grounded in representations of other people’s competence and effort. J. Exp. Psychol. Gen., 152(6), 1565–1579.