Missing Draws

Published

May 5, 2026

Abstract

Documenting draw data, in particular for the unpredictable households

This document explores the issue of missing draw data and more broadly, inconsistencies that exist in the pickup and dropoff data. To do, I look at discrepancies between the pickup and dropoff data. I then use survey data to investigate these cases and as an additional check on the pickup / dropoff data. The survey data has self-reported study payments which can be used to verify whether pickups occurred (and implicitly, dropoffs). Finally we have info dissemination data, which is basically when the participants got told about how the study will work.

Below, I explain some of the nuances of the datasets in case it comes up in the future.

Data Paths

The document will refer to raw (input) data which is what IPA shared with us and generated (output) data which is the result of our cleaning and processing.

All the data lives in this folder: Dropbox/consumption smoothing/09. Main study/09. Data. The generated data referred here are found here: Dropbox/consumption smoothing/09. Main study/09. Data/20. Generated/tidy (which contains the first pass of data cleaning) and Dropbox/consumption smoothing/09. Main study/09. Data/20. Generated/simon_analysis which creates analysis-ready datasets. See Data Catalogue for more info if needed.

Pickup and Dropoff Data

The raw pickup and dropoff data is found in the 07. Drops_and_pickups raw data folder. These datasets are pre-cleaned by IPA before they share them with us. Due to the inconsistencies described below, I (Simon) requested IPA share the raw, unclean data as well. That can be found in the 16. Raw folder. In the end, I was not able to recover new observations and thus, I did not integrate the uncleaned raw data in the data cleaning used to the generate the final analysis datasets.

However, I had written the code to do so; it is found in 00_1_prep_dropoff_pickup.do. Additionally, there is a README in that folder with a bit more detail.

Info Dissemination Data

One source of issues in the data is that in period 1, dropoffs were conducted during info dissemintation. Info Dissemintaiton refers to the initial process where the study work and payment where explaining to participants and they gave consent. After the training, the first dropoff was conducted. For a few households, the information about their period 1 draws is only available in the raw info dissemination data and not dropoff data. As part of generating the analysis-ready datasets, I integrate dropoff information from the info dissemination data for period 1.

Household Survey Data

The household survey data include self-reported study bag payments, payment counts (i.e. number of pickup visits), and hours spent making bags. These variables are not direct records of field pickup or dropoff activity, but they can help catch inconsistencies.

Inconsistencies

In the document below, I look at three main cases: dropoffs but no pickup, pickups but no dropoffs, and cases where pickup or dropoff data does not agree with self-reported survey data.

Code

```{python}
panel["dropoff_no_pickup"] = (
    (panel["dropoff_completed"] == 1) & (panel["pickup_completed"] == 0)
).astype(int)
panel["pickup_no_dropoff"] = (
    (panel["pickup_completed"] == 1) & (panel["dropoff_completed"] == 0)
).astype(int)


panel["no_admin_record"] = (panel["pickup_completed"] == 0) & (
    panel["dropoff_completed"] == 0
)

panel["no_survey_payment"] = (panel["study_bag_pay_total"].fillna(0) == 0) | (
    panel["study_bag_pay_count"].fillna(0) == 0
)

panel["survey_admin_issue"] = (
    (panel["no_survey_payment"] != panel["no_admin_record"])
    & (panel["survey_completed"] == 1)
    & (panel["period"] > 0)
    & (panel["period"] < 6)
)

inconsistency_by_arm = (
    panel.groupby("arm", observed=False)
    .agg(
        hh_period_obs=("hh_id", "size"),
        dropoff_no_pickup=("dropoff_no_pickup", "sum"),
        pickup_no_dropoff=("pickup_no_dropoff", "sum"),
        survey_admin_issue=("survey_admin_issue", "sum"),
    )
    .reset_index()
)

inconsistency_by_arm["Any Inconsistency"] = (
    inconsistency_by_arm["dropoff_no_pickup"]
    + inconsistency_by_arm["pickup_no_dropoff"]
    + inconsistency_by_arm["survey_admin_issue"]
)
inconsistency_by_arm["Inconsistency Rate (%)"] = (
    100
    * inconsistency_by_arm["Any Inconsistency"]
    / inconsistency_by_arm["hh_period_obs"]
).round(2)

inconsistency_by_arm = inconsistency_by_arm.rename(
    columns={
        "arm": "Arm",
        "dropoff_no_pickup": "Dropoff but No Pickup",
        "pickup_no_dropoff": "Pickup but No Dropoff",
        "survey_admin_issue": "Survey/Admin Discrepancy",
    }
)

inconsistency_by_arm[
    [
        "Arm",
        "Dropoff but No Pickup",
        "Pickup but No Dropoff",
        "Survey/Admin Discrepancy",
        "Any Inconsistency",
        "Inconsistency Rate (%)",
    ]
].style.hide(axis="index").format(
    {
        "Dropoff but No Pickup": fmt_int,
        "Pickup but No Dropoff": fmt_int,
        "Survey/Admin Discrepancy": fmt_int,
        "Any Inconsistency": fmt_int,
        "Inconsistency Rate (%)": fmt_pct_1dp,
    }
)
```

Arm	Dropoff but No Pickup	Pickup but No Dropoff	Survey/Admin Discrepancy	Any Inconsistency	Inconsistency Rate (%)
Control	0	0	0	0	0.0%
Stable	5	6	45	56	2.3%
Predictable	9	1	61	71	2.9%
Risky	28	4	154	186	2.9%

Code

```{python}
inconsistency_by_period = (
    panel.groupby("period")
    .agg(
        hh_period_obs=("hh_id", "size"),
        dropoff_no_pickup=("dropoff_no_pickup", "sum"),
        pickup_no_dropoff=("pickup_no_dropoff", "sum"),
        survey_admin_issue=("survey_admin_issue", "sum"),
    )
    .reset_index()
)

inconsistency_by_period["any_inconsistency"] = (
    inconsistency_by_period["dropoff_no_pickup"]
    + inconsistency_by_period["pickup_no_dropoff"]
    + inconsistency_by_period["survey_admin_issue"]
)
inconsistency_by_period["inconsistency_rate_pct"] = (
    100
    * inconsistency_by_period["any_inconsistency"]
    / inconsistency_by_period["hh_period_obs"]
).round(2)

inconsistency_by_period = inconsistency_by_period.rename(
    columns={
        "period": "Period",
        "dropoff_no_pickup": "Dropoff but No Pickup",
        "pickup_no_dropoff": "Pickup but No Dropoff",
        "survey_admin_issue": "Survey/Admin Discrepancy",
        "any_inconsistency": "Any Inconsistency",
        "inconsistency_rate_pct": "Inconsistency Rate (%)",
    }
)

inconsistency_by_period[
    [
        "Period",
        "Dropoff but No Pickup",
        "Pickup but No Dropoff",
        "Survey/Admin Discrepancy",
        "Any Inconsistency",
        "Inconsistency Rate (%)",
    ]
].style.hide(axis="index").format(
    {
        "Period": fmt_int,
        "Dropoff but No Pickup": fmt_int,
        "Pickup but No Dropoff": fmt_int,
        "Survey/Admin Discrepancy": fmt_int,
        "Any Inconsistency": fmt_int,
        "Inconsistency Rate (%)": fmt_pct_1dp,
    }
)
```

Period	Dropoff but No Pickup	Pickup but No Dropoff	Survey/Admin Discrepancy	Any Inconsistency	Inconsistency Rate (%)
1	28	1	49	78	3.4%
2	2	1	22	25	1.1%
3	1	3	52	56	2.5%
4	8	2	100	110	4.8%
5	1	3	37	41	1.8%
6	2	1	0	3	0.1%

Code

```{python}
inconsistent_hh_periods = panel[
    (panel["dropoff_no_pickup"] == 1) | (panel["pickup_no_dropoff"] == 1)
].copy()

inconsistent_hh_periods["issue_type"] = np.select(
    [
        (inconsistent_hh_periods["dropoff_no_pickup"] == 1)
        & (inconsistent_hh_periods["pickup_no_dropoff"] == 0),
        (inconsistent_hh_periods["pickup_no_dropoff"] == 1)
        & (inconsistent_hh_periods["dropoff_no_pickup"] == 0),
        (inconsistent_hh_periods["survey_admin_issue"] == 1),
    ],
    ["Dropoff but No Pickup", "Pickup but No Dropoff", "Survey/Admin Discrepancy"],
    default="Both",
)
```

Type 1: Dropoff completed but no pickup completed

These could represent cases where the household simply decided not to make bags after receving material. To see how common this is, we will use self-reported survey data.

Code

```{python}
drop_no_pickup_cases = inconsistent_hh_periods[
    inconsistent_hh_periods["dropoff_no_pickup"] == 1
].copy()
drop_no_pickup_cases["hh_id"] = drop_no_pickup_cases["hh_id"].astype(str)

drop_no_pickup_cases["no_payment"] = (
    drop_no_pickup_cases["study_bag_pay_total"] == 0
) | (drop_no_pickup_cases["study_bag_pay_total"].isna())

drop_no_pickup_cases["no_survey"] = ~(
    drop_no_pickup_cases["survey_completed"].astype(bool)
)

drop_no_pickup_cases["no_issue"] = (
    (drop_no_pickup_cases["no_payment"]) | (drop_no_pickup_cases["no_survey"])
)
drop_no_pickup_cases["has_issue"] = ~drop_no_pickup_cases["no_issue"]
drop_no_pickup_cases["hh_id"] = drop_no_pickup_cases["hh_id"].astype(str)

drop_no_pickup_cases = drop_no_pickup_cases.sort_values(
    ["has_issue", "arm", "period", "hh_id"]
)



bad_cases_drop_no_pick = drop_no_pickup_cases[
    drop_no_pickup_cases["has_issue"] == True
].copy()

bad_cases_drop_no_pick.to_stata(
    PROJ_ROOT / "dropoff_no_pickup_cases_with_issues.dta", write_index=False
)
```

Code

```{python}
drop_no_pickup_total = len(drop_no_pickup_cases)
drop_no_pickup_no_payment = int(drop_no_pickup_cases["no_payment"].sum())
drop_no_pickup_no_survey = int(drop_no_pickup_cases["no_survey"].sum())
drop_no_pickup_has_issue = int(drop_no_pickup_cases["has_issue"].sum())

Markdown(
    f"There are {drop_no_pickup_total:,} cases with a completed dropoff but no "
    f"completed pickup. In {drop_no_pickup_no_payment:,} of these, participants "
    "report no payment, which is consistent with not making any bags. In "
    f"{drop_no_pickup_no_survey:,} cases, the participant did not complete the "
    "survey, so the survey cannot confirm whether bag work happened. The "
    f"remaining {drop_no_pickup_has_issue:,} cases have reported income from "
    "making bags but no entry in the pickup data."
)
```

There are 42 cases with a completed dropoff but no completed pickup. In 30 of these, participants report no payment, which is consistent with not making any bags. In 7 cases, the participant did not complete the survey, so the survey cannot confirm whether bag work happened. The remaining 12 cases have reported income from making bags but no entry in the pickup data.

List of Cases

Code

```{python}
# Sort by arm hh_id period
drop_no_pickup_cases_display = (
    drop_no_pickup_cases[
        ["hh_id", "arm", "period", "no_payment", "no_survey", "has_issue"]
    ]
    .copy()
    .rename(
        columns={
            "hh_id": "HH ID",
            "arm": "Arm",
            "period": "Period",
            "no_survey": "No Survey",
            "no_payment": "No Payment",
            "has_issue": "Should have pickup",
        }
    )
)

# Format HH ID as string
drop_no_pickup_cases_display.style.hide(axis="index").format(
    {"Period": fmt_int, "HH ID": lambda x: "" if pd.isna(x) else str(x)}
)
```

HH ID	Arm	Period	No Payment	No Survey	Should have pickup
2054576447	Stable	1	True	True	False
3069749544	Stable	1	True	False	False
3071732445	Stable	1	True	True	False
3080258847	Stable	1	True	True	False
1007994076	Predictable	1	True	False	False
1009141269	Predictable	1	True	False	False
1011611433	Predictable	1	True	True	False
1014599930	Predictable	1	True	False	False
2041595061	Predictable	1	True	False	False
3068666023	Predictable	1	True	False	False
4091425970	Predictable	1	True	False	False
4091956538	Predictable	1	True	False	False
1009160224	Risky	1	True	False	False
1010895220	Risky	1	True	False	False
1011693364	Risky	1	True	False	False
2036100744	Risky	1	True	False	False
2044369144	Risky	1	True	True	False
3081390327	Risky	1	True	False	False
3083205127	Risky	1	True	False	False
3083437374	Risky	1	True	False	False
3083571578	Risky	1	True	False	False
4091210810	Risky	1	True	False	False
4091281998	Risky	1	True	False	False
4110309159	Risky	1	True	False	False
2042136116	Risky	4	True	True	False
4107720453	Risky	4	True	True	False
4109481438	Risky	4	True	False	False
4113394600	Risky	5	True	False	False
3076152427	Risky	6	True	False	False
4118188574	Risky	6	True	False	False
3062140657	Stable	1	False	False	True
2047288837	Predictable	3	False	False	True
3062868753	Risky	1	False	False	True
3069697866	Risky	1	False	False	True
4117959465	Risky	1	False	False	True
1008160831	Risky	2	False	False	True
4109481438	Risky	2	False	False	True
3073779373	Risky	4	False	False	True
4109173646	Risky	4	False	False	True
4109181488	Risky	4	False	False	True
4109530874	Risky	4	False	False	True
4109891684	Risky	4	False	False	True

Type 2: Pickup completed but no dropoff completed

These cases are more concerning, particularly in the risky arm. In such cases, we are truly missing data on the draw, although self-reported survey payment data can often help infer what happened.

Code

```{python}
pickup_no_dropoff_cases = inconsistent_hh_periods[
    inconsistent_hh_periods["pickup_no_dropoff"] == 1
].copy()

pickup_no_dropoff_cases["has_survey_payment"] = (
    (pickup_no_dropoff_cases["study_bag_pay_total"].fillna(0) > 0)
    & (pickup_no_dropoff_cases["study_bag_pay_count"].fillna(0) > 0)
)

bad_cases_pick_no_drop = pickup_no_dropoff_cases.copy()

pickup_no_dropoff_total = len(pickup_no_dropoff_cases)
pickup_no_dropoff_period_1 = int((pickup_no_dropoff_cases["period"] == 1).sum())
pickup_no_dropoff_risky = int((pickup_no_dropoff_cases["arm"] == "Risky").sum())
pickup_no_dropoff_with_payment = int(
    pickup_no_dropoff_cases["has_survey_payment"].sum()
)

Markdown(
    f"There are {pickup_no_dropoff_total:,} cases with a completed pickup but no "
    f"completed dropoff. {pickup_no_dropoff_period_1:,} happen in period 1, "
    "where some issues may be related to the info dissemination process. "
    f"{pickup_no_dropoff_risky:,} are in the risky arm. In "
    f"{pickup_no_dropoff_with_payment:,} cases, the household has positive "
    "self-reported study payment data, which can be used alongside pickup data "
    "to investigate the missing draw."
)
```

There are 11 cases with a completed pickup but no completed dropoff. 1 happen in period 1, where some issues may be related to the info dissemination process. 4 are in the risky arm. In 8 cases, the household has positive self-reported study payment data, which can be used alongside pickup data to investigate the missing draw.

List of Cases

Code

```{python}
pickup_no_dropoff_cases_display = (
    pickup_no_dropoff_cases.copy()
    .sort_values(["arm", "period", "hh_id"])
    .rename(
        columns={
            "hh_id": "HH ID",
            "arm": "Arm",
            "period": "Period",
            "study_bag_pay_total": "Study Payment",
            "study_bag_pay_count": "Payment Count",
            "study_bag_hours_spent": "Hours Spent",
            "number_bags": "Number of Bags in pickup",
        }
    )
)

pickup_no_dropoff_cases_display[
    [
        "HH ID",
        "Arm",
        "Period",
        "Study Payment",
        "Payment Count",
        "Number of Bags in pickup",
    ]
].style.hide(axis="index").format(
    {
        "HH ID": lambda x: "" if pd.isna(x) else str(x),
        "Period": fmt_int,
        "Study Payment": fmt_int,
        "Payment Count": fmt_int,
        "Hours Spent": fmt_int,
        "Number of Bags in pickup": fmt_int,
    }
)
```

HH ID	Arm	Period	Study Payment	Payment Count	Number of Bags in pickup
4116180988	Stable	2	108	2	9
1009110519	Stable	3	108	3	9
2039868375	Stable	4	0	0	0
2057820428	Stable	5	36	3	9
3069749544	Stable	5	36	3	6
4115783203	Stable	5	108	3	9
2024102217	Predictable	3	180	3	5
3062277575	Risky	1	60	1	0
2055429495	Risky	3	1	2	1
4100157199	Risky	4	0	0	2
3068204507	Risky	6			0

Type 3: Survey/Admin Discrepancy

These are cases where the cleaned pickup/dropoff records and the household survey payment reports disagree. In particular, we look at cases where people report a study payment but we don’t have a pickup / dropoff on record or vice versa.

Code

```{python}
type3_cases = panel[panel["survey_admin_issue"]].copy()

type3_cases["mismatch_direction"] = np.select(
    [
        (~type3_cases["no_admin_record"]) & (type3_cases["no_survey_payment"]),
        (type3_cases["no_admin_record"]) & (~type3_cases["no_survey_payment"]),
    ],
    [
        "Admin record but no survey payment",
        "Survey payment but no admin record",
    ],
    default="Other",
)

type3_total = len(type3_cases)
type3_admin_no_payment = int(
    (
        (~type3_cases["no_admin_record"])
        & (type3_cases["no_survey_payment"])
    ).sum()
)
type3_payment_no_admin = int(
    (
        (type3_cases["no_admin_record"])
        & (~type3_cases["no_survey_payment"])
    ).sum()
)

Markdown(
    f"There are {type3_total:,} household-periods where the survey and "
    "pickup/dropoff records disagree. In "
    f"{type3_admin_no_payment:,} cases, the administrative data show a pickup "
    "or dropoff record but the survey reports no study-bag payment. In "
    f"{type3_payment_no_admin:,} cases, the survey reports a study-bag payment "
    "but the administrative data have neither a pickup nor a dropoff record."
)
```

There are 260 household-periods where the survey and pickup/dropoff records disagree. In 204 cases, the administrative data show a pickup or dropoff record but the survey reports no study-bag payment. In 56 cases, the survey reports a study-bag payment but the administrative data have neither a pickup nor a dropoff record.

Code

```{python}
type3_by_arm = (
    type3_cases.groupby("arm", observed=False)
    .agg(
        survey_admin_discrepancies=("hh_id", "size"),
        unique_hh=("hh_id", "nunique"),
    )
    .reset_index()
    .rename(
        columns={
            "arm": "Arm",
            "survey_admin_discrepancies": "Survey/Admin Discrepancies",
            "unique_hh": "Unique HH",
        }
    )
)

type3_by_arm.style.hide(axis="index").format(
    {
        "Survey/Admin Discrepancies": fmt_int,
        "Unique HH": fmt_int,
    }
)
```

Arm	Survey/Admin Discrepancies	Unique HH
Control	0	0
Stable	45	37
Predictable	61	53
Risky	154	136

Code

```{python}
type3_by_period = (
    type3_cases.groupby("period")
    .agg(
        survey_admin_discrepancies=("hh_id", "size"),
        unique_hh=("hh_id", "nunique"),
    )
    .reset_index()
    .rename(
        columns={
            "period": "Period",
            "survey_admin_discrepancies": "Survey/Admin Discrepancies",
            "unique_hh": "Unique HH",
        }
    )
)

type3_by_period.style.hide(axis="index").format(
    {
        "Period": fmt_int,
        "Survey/Admin Discrepancies": fmt_int,
        "Unique HH": fmt_int,
    }
)
```

Period	Survey/Admin Discrepancies	Unique HH
1	49	49
2	22	22
3	52	52
4	100	100
5	37	37

Code

```{python}
type3_by_direction = (
    type3_cases.groupby("mismatch_direction")
    .agg(
        survey_admin_discrepancies=("hh_id", "size"),
        unique_hh=("hh_id", "nunique"),
    )
    .reset_index()
    .rename(
        columns={
            "mismatch_direction": "Mismatch Direction",
            "survey_admin_discrepancies": "Survey/Admin Discrepancies",
            "unique_hh": "Unique HH",
        }
    )
)

type3_by_direction.style.hide(axis="index").format(
    {
        "Survey/Admin Discrepancies": fmt_int,
        "Unique HH": fmt_int,
    }
)
```

Mismatch Direction	Survey/Admin Discrepancies	Unique HH
Admin record but no survey payment	204	185
Survey payment but no admin record	56	44

Further Investigation

As of May 5, 2026, Simon is still investigating these cases. Any further update will be provided here. In all likelihood, we will not be able to perfectly resolve these; additionally, the majority happened in Period 3 and 4 where the study pause likely caused issues.

Appendix

Completion by Arm

Code

```{python}
completion_by_arm = (
    panel.groupby("arm", observed=False)
    .agg(
        hh_period_obs=("hh_id", "size"),
        unique_hh=("hh_id", "nunique"),
        pickups_completed=("pickup_completed", "sum"),
        dropoffs_completed=("dropoff_completed", "sum"),
    )
    .reset_index()
)

completion_by_arm["pickup_completion_rate"] = (
    completion_by_arm["pickups_completed"] / completion_by_arm["hh_period_obs"]
)
completion_by_arm["dropoff_completion_rate"] = (
    completion_by_arm["dropoffs_completed"] / completion_by_arm["hh_period_obs"]
)

completion_by_arm = completion_by_arm.assign(
    pickup_completion_rate=lambda d: (100 * d["pickup_completion_rate"]).round(1),
    dropoff_completion_rate=lambda d: (100 * d["dropoff_completion_rate"]).round(1),
).rename(
    columns={
        "arm": "Arm",
        "unique_hh": "Unique HH",
        "pickups_completed": "Pickups Completed",
        "dropoffs_completed": "Dropoffs Completed",
        "pickup_completion_rate": "Pickup Completion Rate (%)",
        "dropoff_completion_rate": "Dropoff Completion Rate (%)",
    }
)

completion_by_arm[
    [
        "Arm",
        "Unique HH",
        "Pickups Completed",
        "Pickup Completion Rate (%)",
        "Dropoffs Completed",
        "Dropoff Completion Rate (%)",
    ]
].style.hide(axis="index").format(
    {
        "Unique HH": fmt_int,
        "Pickups Completed": fmt_int,
        "Pickup Completion Rate (%)": fmt_pct_1dp,
        "Dropoffs Completed": fmt_int,
        "Dropoff Completion Rate (%)": fmt_pct_1dp,
    }
)
```

Arm	Unique HH	Pickups Completed	Pickup Completion Rate (%)	Dropoffs Completed	Dropoff Completion Rate (%)
Control	401	0	0.0%	0	0.0%
Stable	402	2,311	95.8%	2,310	95.8%
Predictable	402	2,288	94.9%	2,296	95.2%
Risky	1,067	5,828	91.0%	5,852	91.4%

Raw Data

I was not able to recover any new observations, especially to address the inconsistencies. See the above section describing pickup and dropoff data for more discussion or where to investigate further if desired.