Missing Draws

Published

May 5, 2026

Abstract

Documenting draw data, in particular for the unpredictable households

This document explores the issue of missing draw data and more broadly, inconsistencies that exist in the pickup and dropoff data. To do, I look at discrepancies between the pickup and dropoff data. I then use survey data to investigate these cases and as an additional check on the pickup / dropoff data. The survey data has self-reported study payments which can be used to verify whether pickups occurred (and implicitly, dropoffs). Finally we have info dissemination data, which is basically when the participants got told about how the study will work.

Below, I explain some of the nuances of the datasets in case it comes up in the future.

The document will refer to raw (input) data which is what IPA shared with us and generated (output) data which is the result of our cleaning and processing.

All the data lives in this folder: Dropbox/consumption smoothing/09. Main study/09. Data. The generated data referred here are found here: Dropbox/consumption smoothing/09. Main study/09. Data/20. Generated/tidy (which contains the first pass of data cleaning) and Dropbox/consumption smoothing/09. Main study/09. Data/20. Generated/simon_analysis which creates analysis-ready datasets. See Data Catalogue for more info if needed.

Pickup and Dropoff Data

The raw pickup and dropoff data is found in the 07. Drops_and_pickups raw data folder. These datasets are pre-cleaned by IPA before they share them with us. Due to the inconsistencies described below, I (Simon) requested IPA share the raw, unclean data as well. That can be found in the 16. Raw folder. In the end, I was not able to recover new observations and thus, I did not integrate the uncleaned raw data in the data cleaning used to the generate the final analysis datasets.

However, I had written the code to do so; it is found in 00_1_prep_dropoff_pickup.do. Additionally, there is a README in that folder with a bit more detail.

Info Dissemination Data

One source of issues in the data is that in period 1, dropoffs were conducted during info dissemintation. Info Dissemintaiton refers to the initial process where the study work and payment where explaining to participants and they gave consent. After the training, the first dropoff was conducted. For a few households, the information about their period 1 draws is only available in the raw info dissemination data and not dropoff data. As part of generating the analysis-ready datasets, I integrate dropoff information from the info dissemination data for period 1.

Household Survey Data

The household survey data include self-reported study bag payments, payment counts (i.e. number of pickup visits), and hours spent making bags. These variables are not direct records of field pickup or dropoff activity, but they can help catch inconsistencies.

Inconsistencies

In the document below, I look at three main cases: dropoffs but no pickup, pickups but no dropoffs, and cases where pickup or dropoff data does not agree with self-reported survey data.

Overall Counts

Code
```{python}
panel["dropoff_no_pickup"] = (
    (panel["dropoff_completed"] == 1) & (panel["pickup_completed"] == 0)
).astype(int)
panel["pickup_no_dropoff"] = (
    (panel["pickup_completed"] == 1) & (panel["dropoff_completed"] == 0)
).astype(int)


panel["no_admin_record"] = (panel["pickup_completed"] == 0) & (
    panel["dropoff_completed"] == 0
)

panel["no_survey_payment"] = (panel["study_bag_pay_total"].fillna(0) == 0) | (
    panel["study_bag_pay_count"].fillna(0) == 0
)

panel["survey_admin_issue"] = (
    (panel["no_survey_payment"] != panel["no_admin_record"])
    & (panel["survey_completed"] == 1)
    & (panel["period"] > 0)
    & (panel["period"] < 6)
)

inconsistency_by_arm = (
    panel.groupby("arm", observed=False)
    .agg(
        hh_period_obs=("hh_id", "size"),
        dropoff_no_pickup=("dropoff_no_pickup", "sum"),
        pickup_no_dropoff=("pickup_no_dropoff", "sum"),
        survey_admin_issue=("survey_admin_issue", "sum"),
    )
    .reset_index()
)

inconsistency_by_arm["Any Inconsistency"] = (
    inconsistency_by_arm["dropoff_no_pickup"]
    + inconsistency_by_arm["pickup_no_dropoff"]
    + inconsistency_by_arm["survey_admin_issue"]
)
inconsistency_by_arm["Inconsistency Rate (%)"] = (
    100
    * inconsistency_by_arm["Any Inconsistency"]
    / inconsistency_by_arm["hh_period_obs"]
).round(2)

inconsistency_by_arm = inconsistency_by_arm.rename(
    columns={
        "arm": "Arm",
        "dropoff_no_pickup": "Dropoff but No Pickup",
        "pickup_no_dropoff": "Pickup but No Dropoff",
        "survey_admin_issue": "Survey/Admin Discrepancy",
    }
)

inconsistency_by_arm[
    [
        "Arm",
        "Dropoff but No Pickup",
        "Pickup but No Dropoff",
        "Survey/Admin Discrepancy",
        "Any Inconsistency",
        "Inconsistency Rate (%)",
    ]
].style.hide(axis="index").format(
    {
        "Dropoff but No Pickup": fmt_int,
        "Pickup but No Dropoff": fmt_int,
        "Survey/Admin Discrepancy": fmt_int,
        "Any Inconsistency": fmt_int,
        "Inconsistency Rate (%)": fmt_pct_1dp,
    }
)
```
Arm Dropoff but No Pickup Pickup but No Dropoff Survey/Admin Discrepancy Any Inconsistency Inconsistency Rate (%)
Control 0 0 0 0 0.0%
Stable 5 6 45 56 2.3%
Predictable 9 1 61 71 2.9%
Risky 28 4 154 186 2.9%
Code
```{python}
inconsistency_by_period = (
    panel.groupby("period")
    .agg(
        hh_period_obs=("hh_id", "size"),
        dropoff_no_pickup=("dropoff_no_pickup", "sum"),
        pickup_no_dropoff=("pickup_no_dropoff", "sum"),
        survey_admin_issue=("survey_admin_issue", "sum"),
    )
    .reset_index()
)

inconsistency_by_period["any_inconsistency"] = (
    inconsistency_by_period["dropoff_no_pickup"]
    + inconsistency_by_period["pickup_no_dropoff"]
    + inconsistency_by_period["survey_admin_issue"]
)
inconsistency_by_period["inconsistency_rate_pct"] = (
    100
    * inconsistency_by_period["any_inconsistency"]
    / inconsistency_by_period["hh_period_obs"]
).round(2)

inconsistency_by_period = inconsistency_by_period.rename(
    columns={
        "period": "Period",
        "dropoff_no_pickup": "Dropoff but No Pickup",
        "pickup_no_dropoff": "Pickup but No Dropoff",
        "survey_admin_issue": "Survey/Admin Discrepancy",
        "any_inconsistency": "Any Inconsistency",
        "inconsistency_rate_pct": "Inconsistency Rate (%)",
    }
)

inconsistency_by_period[
    [
        "Period",
        "Dropoff but No Pickup",
        "Pickup but No Dropoff",
        "Survey/Admin Discrepancy",
        "Any Inconsistency",
        "Inconsistency Rate (%)",
    ]
].style.hide(axis="index").format(
    {
        "Period": fmt_int,
        "Dropoff but No Pickup": fmt_int,
        "Pickup but No Dropoff": fmt_int,
        "Survey/Admin Discrepancy": fmt_int,
        "Any Inconsistency": fmt_int,
        "Inconsistency Rate (%)": fmt_pct_1dp,
    }
)
```
Period Dropoff but No Pickup Pickup but No Dropoff Survey/Admin Discrepancy Any Inconsistency Inconsistency Rate (%)
1 28 1 49 78 3.4%
2 2 1 22 25 1.1%
3 1 3 52 56 2.5%
4 8 2 100 110 4.8%
5 1 3 37 41 1.8%
6 2 1 0 3 0.1%
Code
```{python}
inconsistent_hh_periods = panel[
    (panel["dropoff_no_pickup"] == 1) | (panel["pickup_no_dropoff"] == 1)
].copy()

inconsistent_hh_periods["issue_type"] = np.select(
    [
        (inconsistent_hh_periods["dropoff_no_pickup"] == 1)
        & (inconsistent_hh_periods["pickup_no_dropoff"] == 0),
        (inconsistent_hh_periods["pickup_no_dropoff"] == 1)
        & (inconsistent_hh_periods["dropoff_no_pickup"] == 0),
        (inconsistent_hh_periods["survey_admin_issue"] == 1),
    ],
    ["Dropoff but No Pickup", "Pickup but No Dropoff", "Survey/Admin Discrepancy"],
    default="Both",
)
```

Type 1: Dropoff completed but no pickup completed

These could represent cases where the household simply decided not to make bags after receving material. To see how common this is, we will use self-reported survey data.

Code
```{python}
drop_no_pickup_cases = inconsistent_hh_periods[
    inconsistent_hh_periods["dropoff_no_pickup"] == 1
].copy()
drop_no_pickup_cases["hh_id"] = drop_no_pickup_cases["hh_id"].astype(str)

drop_no_pickup_cases["no_payment"] = (
    drop_no_pickup_cases["study_bag_pay_total"] == 0
) | (drop_no_pickup_cases["study_bag_pay_total"].isna())

drop_no_pickup_cases["no_survey"] = ~(
    drop_no_pickup_cases["survey_completed"].astype(bool)
)

drop_no_pickup_cases["no_issue"] = (
    (drop_no_pickup_cases["no_payment"]) | (drop_no_pickup_cases["no_survey"])
)
drop_no_pickup_cases["has_issue"] = ~drop_no_pickup_cases["no_issue"]
drop_no_pickup_cases["hh_id"] = drop_no_pickup_cases["hh_id"].astype(str)

drop_no_pickup_cases = drop_no_pickup_cases.sort_values(
    ["has_issue", "arm", "period", "hh_id"]
)



bad_cases_drop_no_pick = drop_no_pickup_cases[
    drop_no_pickup_cases["has_issue"] == True
].copy()

bad_cases_drop_no_pick.to_stata(
    PROJ_ROOT / "dropoff_no_pickup_cases_with_issues.dta", write_index=False
)
```
Code
```{python}
drop_no_pickup_total = len(drop_no_pickup_cases)
drop_no_pickup_no_payment = int(drop_no_pickup_cases["no_payment"].sum())
drop_no_pickup_no_survey = int(drop_no_pickup_cases["no_survey"].sum())
drop_no_pickup_has_issue = int(drop_no_pickup_cases["has_issue"].sum())

Markdown(
    f"There are {drop_no_pickup_total:,} cases with a completed dropoff but no "
    f"completed pickup. In {drop_no_pickup_no_payment:,} of these, participants "
    "report no payment, which is consistent with not making any bags. In "
    f"{drop_no_pickup_no_survey:,} cases, the participant did not complete the "
    "survey, so the survey cannot confirm whether bag work happened. The "
    f"remaining {drop_no_pickup_has_issue:,} cases have reported income from "
    "making bags but no entry in the pickup data."
)
```

There are 42 cases with a completed dropoff but no completed pickup. In 30 of these, participants report no payment, which is consistent with not making any bags. In 7 cases, the participant did not complete the survey, so the survey cannot confirm whether bag work happened. The remaining 12 cases have reported income from making bags but no entry in the pickup data.

Code
```{python}
# Sort by arm hh_id period
drop_no_pickup_cases_display = (
    drop_no_pickup_cases[
        ["hh_id", "arm", "period", "no_payment", "no_survey", "has_issue"]
    ]
    .copy()
    .rename(
        columns={
            "hh_id": "HH ID",
            "arm": "Arm",
            "period": "Period",
            "no_survey": "No Survey",
            "no_payment": "No Payment",
            "has_issue": "Should have pickup",
        }
    )
)

# Format HH ID as string
drop_no_pickup_cases_display.style.hide(axis="index").format(
    {"Period": fmt_int, "HH ID": lambda x: "" if pd.isna(x) else str(x)}
)
```
HH ID Arm Period No Payment No Survey Should have pickup
2054576447 Stable 1 True True False
3069749544 Stable 1 True False False
3071732445 Stable 1 True True False
3080258847 Stable 1 True True False
1007994076 Predictable 1 True False False
1009141269 Predictable 1 True False False
1011611433 Predictable 1 True True False
1014599930 Predictable 1 True False False
2041595061 Predictable 1 True False False
3068666023 Predictable 1 True False False
4091425970 Predictable 1 True False False
4091956538 Predictable 1 True False False
1009160224 Risky 1 True False False
1010895220 Risky 1 True False False
1011693364 Risky 1 True False False
2036100744 Risky 1 True False False
2044369144 Risky 1 True True False
3081390327 Risky 1 True False False
3083205127 Risky 1 True False False
3083437374 Risky 1 True False False
3083571578 Risky 1 True False False
4091210810 Risky 1 True False False
4091281998 Risky 1 True False False
4110309159 Risky 1 True False False
2042136116 Risky 4 True True False
4107720453 Risky 4 True True False
4109481438 Risky 4 True False False
4113394600 Risky 5 True False False
3076152427 Risky 6 True False False
4118188574 Risky 6 True False False
3062140657 Stable 1 False False True
2047288837 Predictable 3 False False True
3062868753 Risky 1 False False True
3069697866 Risky 1 False False True
4117959465 Risky 1 False False True
1008160831 Risky 2 False False True
4109481438 Risky 2 False False True
3073779373 Risky 4 False False True
4109173646 Risky 4 False False True
4109181488 Risky 4 False False True
4109530874 Risky 4 False False True
4109891684 Risky 4 False False True

Type 2: Pickup completed but no dropoff completed

These cases are more concerning, particularly in the risky arm. In such cases, we are truly missing data on the draw, although self-reported survey payment data can often help infer what happened.

Code
```{python}
pickup_no_dropoff_cases = inconsistent_hh_periods[
    inconsistent_hh_periods["pickup_no_dropoff"] == 1
].copy()

pickup_no_dropoff_cases["has_survey_payment"] = (
    (pickup_no_dropoff_cases["study_bag_pay_total"].fillna(0) > 0)
    & (pickup_no_dropoff_cases["study_bag_pay_count"].fillna(0) > 0)
)

bad_cases_pick_no_drop = pickup_no_dropoff_cases.copy()

pickup_no_dropoff_total = len(pickup_no_dropoff_cases)
pickup_no_dropoff_period_1 = int((pickup_no_dropoff_cases["period"] == 1).sum())
pickup_no_dropoff_risky = int((pickup_no_dropoff_cases["arm"] == "Risky").sum())
pickup_no_dropoff_with_payment = int(
    pickup_no_dropoff_cases["has_survey_payment"].sum()
)

Markdown(
    f"There are {pickup_no_dropoff_total:,} cases with a completed pickup but no "
    f"completed dropoff. {pickup_no_dropoff_period_1:,} happen in period 1, "
    "where some issues may be related to the info dissemination process. "
    f"{pickup_no_dropoff_risky:,} are in the risky arm. In "
    f"{pickup_no_dropoff_with_payment:,} cases, the household has positive "
    "self-reported study payment data, which can be used alongside pickup data "
    "to investigate the missing draw."
)
```

There are 11 cases with a completed pickup but no completed dropoff. 1 happen in period 1, where some issues may be related to the info dissemination process. 4 are in the risky arm. In 8 cases, the household has positive self-reported study payment data, which can be used alongside pickup data to investigate the missing draw.

Code
```{python}
pickup_no_dropoff_cases_display = (
    pickup_no_dropoff_cases.copy()
    .sort_values(["arm", "period", "hh_id"])
    .rename(
        columns={
            "hh_id": "HH ID",
            "arm": "Arm",
            "period": "Period",
            "study_bag_pay_total": "Study Payment",
            "study_bag_pay_count": "Payment Count",
            "study_bag_hours_spent": "Hours Spent",
            "number_bags": "Number of Bags in pickup",
        }
    )
)

pickup_no_dropoff_cases_display[
    [
        "HH ID",
        "Arm",
        "Period",
        "Study Payment",
        "Payment Count",
        "Number of Bags in pickup",
    ]
].style.hide(axis="index").format(
    {
        "HH ID": lambda x: "" if pd.isna(x) else str(x),
        "Period": fmt_int,
        "Study Payment": fmt_int,
        "Payment Count": fmt_int,
        "Hours Spent": fmt_int,
        "Number of Bags in pickup": fmt_int,
    }
)
```
HH ID Arm Period Study Payment Payment Count Number of Bags in pickup
4116180988 Stable 2 108 2 9
1009110519 Stable 3 108 3 9
2039868375 Stable 4 0 0 0
2057820428 Stable 5 36 3 9
3069749544 Stable 5 36 3 6
4115783203 Stable 5 108 3 9
2024102217 Predictable 3 180 3 5
3062277575 Risky 1 60 1 0
2055429495 Risky 3 1 2 1
4100157199 Risky 4 0 0 2
3068204507 Risky 6 0

Type 3: Survey/Admin Discrepancy

These are cases where the cleaned pickup/dropoff records and the household survey payment reports disagree. In particular, we look at cases where people report a study payment but we don’t have a pickup / dropoff on record or vice versa.

Code
```{python}
type3_cases = panel[panel["survey_admin_issue"]].copy()

type3_cases["mismatch_direction"] = np.select(
    [
        (~type3_cases["no_admin_record"]) & (type3_cases["no_survey_payment"]),
        (type3_cases["no_admin_record"]) & (~type3_cases["no_survey_payment"]),
    ],
    [
        "Admin record but no survey payment",
        "Survey payment but no admin record",
    ],
    default="Other",
)

type3_total = len(type3_cases)
type3_admin_no_payment = int(
    (
        (~type3_cases["no_admin_record"])
        & (type3_cases["no_survey_payment"])
    ).sum()
)
type3_payment_no_admin = int(
    (
        (type3_cases["no_admin_record"])
        & (~type3_cases["no_survey_payment"])
    ).sum()
)

Markdown(
    f"There are {type3_total:,} household-periods where the survey and "
    "pickup/dropoff records disagree. In "
    f"{type3_admin_no_payment:,} cases, the administrative data show a pickup "
    "or dropoff record but the survey reports no study-bag payment. In "
    f"{type3_payment_no_admin:,} cases, the survey reports a study-bag payment "
    "but the administrative data have neither a pickup nor a dropoff record."
)
```

There are 260 household-periods where the survey and pickup/dropoff records disagree. In 204 cases, the administrative data show a pickup or dropoff record but the survey reports no study-bag payment. In 56 cases, the survey reports a study-bag payment but the administrative data have neither a pickup nor a dropoff record.

Code
```{python}
type3_by_arm = (
    type3_cases.groupby("arm", observed=False)
    .agg(
        survey_admin_discrepancies=("hh_id", "size"),
        unique_hh=("hh_id", "nunique"),
    )
    .reset_index()
    .rename(
        columns={
            "arm": "Arm",
            "survey_admin_discrepancies": "Survey/Admin Discrepancies",
            "unique_hh": "Unique HH",
        }
    )
)

type3_by_arm.style.hide(axis="index").format(
    {
        "Survey/Admin Discrepancies": fmt_int,
        "Unique HH": fmt_int,
    }
)
```
Arm Survey/Admin Discrepancies Unique HH
Control 0 0
Stable 45 37
Predictable 61 53
Risky 154 136
Code
```{python}
type3_by_period = (
    type3_cases.groupby("period")
    .agg(
        survey_admin_discrepancies=("hh_id", "size"),
        unique_hh=("hh_id", "nunique"),
    )
    .reset_index()
    .rename(
        columns={
            "period": "Period",
            "survey_admin_discrepancies": "Survey/Admin Discrepancies",
            "unique_hh": "Unique HH",
        }
    )
)

type3_by_period.style.hide(axis="index").format(
    {
        "Period": fmt_int,
        "Survey/Admin Discrepancies": fmt_int,
        "Unique HH": fmt_int,
    }
)
```
Period Survey/Admin Discrepancies Unique HH
1 49 49
2 22 22
3 52 52
4 100 100
5 37 37
Code
```{python}
type3_by_direction = (
    type3_cases.groupby("mismatch_direction")
    .agg(
        survey_admin_discrepancies=("hh_id", "size"),
        unique_hh=("hh_id", "nunique"),
    )
    .reset_index()
    .rename(
        columns={
            "mismatch_direction": "Mismatch Direction",
            "survey_admin_discrepancies": "Survey/Admin Discrepancies",
            "unique_hh": "Unique HH",
        }
    )
)

type3_by_direction.style.hide(axis="index").format(
    {
        "Survey/Admin Discrepancies": fmt_int,
        "Unique HH": fmt_int,
    }
)
```
Mismatch Direction Survey/Admin Discrepancies Unique HH
Admin record but no survey payment 204 185
Survey payment but no admin record 56 44
Further Investigation

As of May 5, 2026, Simon is still investigating these cases. Any further update will be provided here. In all likelihood, we will not be able to perfectly resolve these; additionally, the majority happened in Period 3 and 4 where the study pause likely caused issues.

Appendix

Completion by Arm

Code
```{python}
completion_by_arm = (
    panel.groupby("arm", observed=False)
    .agg(
        hh_period_obs=("hh_id", "size"),
        unique_hh=("hh_id", "nunique"),
        pickups_completed=("pickup_completed", "sum"),
        dropoffs_completed=("dropoff_completed", "sum"),
    )
    .reset_index()
)

completion_by_arm["pickup_completion_rate"] = (
    completion_by_arm["pickups_completed"] / completion_by_arm["hh_period_obs"]
)
completion_by_arm["dropoff_completion_rate"] = (
    completion_by_arm["dropoffs_completed"] / completion_by_arm["hh_period_obs"]
)

completion_by_arm = completion_by_arm.assign(
    pickup_completion_rate=lambda d: (100 * d["pickup_completion_rate"]).round(1),
    dropoff_completion_rate=lambda d: (100 * d["dropoff_completion_rate"]).round(1),
).rename(
    columns={
        "arm": "Arm",
        "unique_hh": "Unique HH",
        "pickups_completed": "Pickups Completed",
        "dropoffs_completed": "Dropoffs Completed",
        "pickup_completion_rate": "Pickup Completion Rate (%)",
        "dropoff_completion_rate": "Dropoff Completion Rate (%)",
    }
)

completion_by_arm[
    [
        "Arm",
        "Unique HH",
        "Pickups Completed",
        "Pickup Completion Rate (%)",
        "Dropoffs Completed",
        "Dropoff Completion Rate (%)",
    ]
].style.hide(axis="index").format(
    {
        "Unique HH": fmt_int,
        "Pickups Completed": fmt_int,
        "Pickup Completion Rate (%)": fmt_pct_1dp,
        "Dropoffs Completed": fmt_int,
        "Dropoff Completion Rate (%)": fmt_pct_1dp,
    }
)
```
Arm Unique HH Pickups Completed Pickup Completion Rate (%) Dropoffs Completed Dropoff Completion Rate (%)
Control 401 0 0.0% 0 0.0%
Stable 402 2,311 95.8% 2,310 95.8%
Predictable 402 2,288 94.9% 2,296 95.2%
Risky 1,067 5,828 91.0% 5,852 91.4%

Raw Data

I was not able to recover any new observations, especially to address the inconsistencies. See the above section describing pickup and dropoff data for more discussion or where to investigate further if desired.