Pilot 1, cleaning the data

Authors
Affiliations

Elias Bouacida

University Paris 8

Renaud Foucart

Lancaster University

Published

October 7, 2024

Clean the raw data file, and lays out the assumption behind the cleaning.

import CSV
using DataFrames
# Functions
include("PilotFunctions.jl")
transform_labels

Constants

const n_pages = 7;

Reading the Data

Time Spent

Read the table where the time spent on each page is kept. Then apply the function treatment to each subject.

timespent = CSV.read(joinpath("..", "Data", "OriginalData", "TimeSpent (accessed 2020-04-30).csv"), DataFrame, normalizenames = true, stringtype=String)
timedata = combine(groupby(timespent, :participant_code), x->treatment(x, n_pages - 1));

# timedata = by(timespent, :participant_code, x -> treatment(x, n_pages - 1));

Raw Data

Read the raw data (after the anonymization)

rawdata = CSV.read(joinpath("..", "Data", "OriginalData", "Pilot1RawData.csv"), DataFrame, normalizenames=true, stringtype=String);

Aggregate Time and Raw Data

data = innerjoin(rawdata, timedata, on = :participant_code);

Cleaning the Data

Finishing the Experiments

Removed subjects who have not finished the experiment

Check that the raw data and time date coincide on who finished the experiment.

Then keep only subjects who have finished the experiment, and remove the finished column as a consequence.

@assert (data[!, :participant_index_in_pages] .== n_pages) == data[!, :finished] "The method used to determine who has finished is not correct"
data = data[data[!, :finished], :]
select!(data, Not([:finished])); 

Selecting a subset of columns

We keep on the columns from the following relevantcolumns variable.

const relevantcolumns = [
    :participant_code, :time_in_experiment, :participant_payoff, :participant_payoff_plus_participation_fee, :session_code, 
    :survey_1_player_age, :survey_1_player_gender_rev, :survey_1_player_gender, :survey_1_player_employment,
    :survey_1_player_region, :survey_1_player_urn_rev, 
    :survey_1_player_urn, :survey_1_player_colour_rev, :survey_1_player_colour, :survey_1_player_colour_drawn, 
    :survey_1_player_urn_winner, :survey_1_player_round1, :survey_1_player_round2, :survey_1_player_round3, 
    :survey_1_player_round4, :survey_1_player_round5, :survey_1_player_total_time, :survey_1_player_probability, 
    :survey_1_player_mechanism_winner, :survey_1_player_belief_winner, :survey_1_player_mechanism_rev,
    :survey_1_player_mechanism, :survey_1_player_ability, :survey_1_player_knowledge, :survey_1_player_belief, 
    :survey_1_player_reasons, :time_page_1, :time_page_2, :time_page_3, :time_page_4, :time_page_5, :time_page_6,
];
irrelevantcolumns = [col for col = names(data) if !in(col, relevantcolumns)]
select!(data, relevantcolumns)
irrelevantcolumns
57-element Vector{String}:
 "participant_id_in_session"
 "participant_code"
 "participant_label"
 "participant_is_bot"
 "participant_index_in_pages"
 "participant_max_page_index"
 "participant_current_app_name"
 "participant_current_page_name"
 "participant_time_started"
 "participant_visited"
 "participant_payoff"
 "participant_payoff_plus_participation_fee"
 "session_code"
 ⋮
 "survey_1_player_belief"
 "survey_1_player_reasons"
 "survey_1_player_payoff"
 "survey_1_group_id_in_subsession"
 "survey_1_subsession_round_number"
 "time_in_experiment"
 "time_page_1"
 "time_page_2"
 "time_page_3"
 "time_page_4"
 "time_page_5"
 "time_page_6"

Striping the column names from survey_1_player_

for col = names(data)
    m = match(r"survey_1_player_(?<column>\w+)", string(col))
    if !(m === nothing)
        rename!(data, col => Symbol(m[:column]))
    end
end

Transform Columns Types

Transform some string columns with only two possibilities into boolean columns.

booleancolumns = [:ability, :knowledge, :gender_rev, :urn_rev, :colour_rev, :urn_winner, :mechanism_winner, :belief_winner, :mechanism_rev]
for col = booleancolumns
    data[!, col] = (data[!, col] .== 1)
end

Transform the mechanism column into a boolean. Value of 1 if RPS, 0 if Coin Toss.

data[!, :mechanism] = (data[!, :mechanism] .== "Rock, Paper, Scissors");

Transform the colours of the balls in the ambiguity choice into boolean values. The colour black is transformed in 1, on red into 0.

data[!, :colour] = (data[!, :colour] .== "Black");

Create a boolean telling whether RPS was better than Coin Toss or not.

data[!, :female] = (data[!, :gender] .== "Female")
data[!, :male] = (data[!, :gender] .== "Male");
data[!, :other] = (data[!, :gender] .== "Other")
select!(data, Not(:gender));
data[!, :RPSbetter] = (data[!, :probability] .> 0.5);

Consider that everyone that chose the left urn is ambiguity averse. This is a debatable assumption, but the best we can do with the available data. The difference in the number of ambiguity averse / ambiguity loving subjects is correct if ambiguity neutral subjects randonly choose between the right and left urns.

Remove then the :urn column that encodes exactly the same information.

data[!, :ambiguity_averse] = (data[!, :urn] .== "Urn Left");
select!(data, Not(:urn));

Create a variable characterizing the four different combination of treatments that are possible.

data[!, :treatment] .= "A, P"
data[.!data[!, :knowledge] .& data[!, :ability], :treatment] .= "A, NP"
data[.!data[!, :knowledge] .& .!data[!, :ability], :treatment] .= "C, NP"
data[data[!, :knowledge] .& .!data[!, :ability], :treatment] .= "C, P";
# Compute the observed probability of winning the mechanism of each subject.
data[!, :probability_winning] .= 0.5
data[data[!, :mechanism], :probability_winning] .= data[data[!, :mechanism], :probability];
# Transform the belief of number of win into a probability
data[!, :belief] = data[!, :belief] ./ n_profiles;
sum(x ->  ismissing(x) ? false : x, data[!, :belief] .<= 5 / n_profiles) / sum(.!data[!, :knowledge])
0.14583333333333334

A substantial number of subjects (15%) do not seem to have understood our belief question, as they report belief between 0 and 5, which seems clearly wrong.

Creating a measure of overconfidence: when people don’t know the probability (which is the only case where we have the belief in Experiment 1), it is the difference between their probability of winning the RPS and their belief. If positive, subjects are overconfident, if negative, they are underconfident.

data[!, :overconfidence] .= 0.
data[.!data[!, :knowledge], :overconfidence] = data[.!data[!, :knowledge], :belief] .- data[.!data[!, :knowledge], :probability];
data[!, :overconfident] .= (data[!, :overconfidence] .> 0);
data[!, :rps_believed_really_better] = Bool.(data[!, :RPSbetter]);
data[.!data[!, :knowledge], :rps_believed_really_better] .= (data[.!data[!, :knowledge], :belief] .>= 0.5);
# Unify the definition of a country and remove the previous column `:region` that encoded the same data.
data[!, :country] .= "USA";
for (i, region) = enumerate(data[!, :region])
    m = match(r"(?<country>US|States|california|georgia|arizona|america)"i, region)
    if m === nothing
        m2 = match(r"(?<country>india)"i, region)
        if m2 === nothing
            data[i, :country] = titlecase(data[i, :region], strict = true)
        else
            data[i, :country] = "India"
        end
    end
end
select!(data, Not(:region));

Transform the different RPS rounds choice into numbers. The translation is Rock = 1, Paper = 2, Scissors = 3

for i = 1:5
    data[!, Symbol("round$(i)")] = translaterpstoint.(data[!, Symbol("round$i")]);
end
## Hourly Wage
data[!, :wage] = data[!, :participant_payoff_plus_participation_fee] ./ data[!, :time_in_experiment] .* 3600;

Comments Encoding

We have encoded the comments according to three dummy variables probability_driving_choice, preference_driving_choice, doubt_interpretation and a string of comment additional_notes.

  • probability_driving_choices means that we read in the comment that higher probabilities of winning are what drives the choices of a mechanism over another (even if the belief/understanding and subsequent choice may not reflect that).
  • preference_driving_choice means that we read in the comment a intrinsic preference for one or the other mechanism.
  • doubt_interpretation means that we are not sure of the interpretation of the comment, so that our interpretation should be taken with a grain of salt. In the data, it is originally encoded as ?, we replace that by a dummy of value 1 when we have a doubt.
  • additional_notes are additional remarks we have on the comments. In particular, if a preference is expressed for one or the other of RPS / CT, we mention it.
comments = CSV.read(joinpath("..", "Data", "OriginalData", "Pilot1Comments.csv"), DataFrame, normalizenames = true, stringtype=String);
comments[!, :doubt_interpretation] = (comments[!, :doubt_interpretation] .== "?")
comments[ismissing.(comments[!, :doubt_interpretation]), :doubt_interpretation] .= false;
comments[!, :probability_driving_choice] .= (comments[!, :probability_driving_choice] .== 1)
comments[!, :preference_driving_choice] .= (comments[!, :preference_driving_choice] .== 1);

Join together the data and our comments.

data = innerjoin(data, comments, on = :participant_code);

Take the reasons to transform anything that looks like an absence of comment into an empty comment. Remove the previous column :reasons that contained the comments. This filter is very imperfect.

data[!, :comments] .= ""
for (i, comment) = enumerate(data[!, :reasons])
    m = match(r"^(?<comment>no comment|na|none|nothing)"i, comment)
    if m === nothing
        data[i, :comments] = data[i, :reasons]
    end
end
select!(data, Not(:reasons));
# Saving the Cleaned Data
data |> CSV.write(joinpath("..", "Data", "Input", "Pilot1CleanedData.csv"), delim = ',')
"../Data/Input/Pilot1CleanedData.csv"