Disentangling Double Reported Income

Published

March 4, 2026

Abstract

Some households likely reported study income as work income in early periods; this document explores the issue and potential fixes

We noticed in our pilot study that participants tend to report income they earned from participating in the study as work income. We do not want them to do so, thus we added warnings to the survey that they shouldn’t include study income when telling us money they earned from work. However, due to implementation issues, we only did this correctly in starting in Period 4. The goal of this analysis is to document the fix and its justifications

Code
```{python}
import pandas as pd
from plotnine import (
    ggplot, aes, geom_col, facet_wrap, scale_y_continuous,
    scale_fill_manual, labs, theme_minimal, theme, element_text
)

PROJ_ROOT = "/Users/st2246/Work/Pilot3"
DOUBLE_DATA = f"{PROJ_ROOT}/data/generated/main/transform/02_flag_double_reports-hh_id-period.dta"
EMPLOYMENT = f"{PROJ_ROOT}/data/generated/tidy/main/05_Employment-hh_id-period-member_id.dta"

PANEL = f"{PROJ_ROOT}/data/generated/main/transform/30_merged_panel-hh_id-period.dta"

panel = pd.read_stata(PANEL, convert_categoricals=False)

df = pd.read_stata(DOUBLE_DATA, convert_categoricals=False)

df["period"] = df["period"].astype(int)

# Derive mult_12_corrected for illustrative purposes (not part of the saved dataset)
df["mult_12_corrected"] = (
    (df["money_earn_p_corrected"] % 12 == 0)
    & (df["money_earn_p_corrected"] != 0)
    & (df["money_earn_p_corrected"] != 60)
    & (df["money_earn_p_corrected"] != 120)
).astype(int)
```
Code
```{python}
panel_completed = panel[panel["survey_completed"] == 1].copy()
panel_completed["nonzero"] = (panel_completed["money_work_participant"] > 0).astype(int)

pct_by_group_uc = (
    panel_completed.groupby(["period", "treated"])["nonzero"]
    .agg("mean")
    .mul(100)
    .reset_index()
)
pct_by_group_uc["group"] = pct_by_group_uc["treated"].map({1: "Treated", 0: "Control"})

unfixed = (
    ggplot(pct_by_group_uc, aes(x="factor(period)", y="nonzero", group = "group", fill="group"))
    + geom_col(width=0.6, position="dodge")
    + scale_y_continuous(limits=(0, 50), labels=lambda l: [f"{v:.0f}%" for v in l])
    + labs(
        x="Survey Period",
        y="% Reporting Non-Zero Participant Earnings",
    )
    + theme_minimal()
)

unfixed
```

Fix

Despite not warning participants to only include non-study work income in periods 1–3, we did ask them how much income they received from the study in each period. We can use this to identify likely double reporters by checking if their self-reported work income is identical to their self-reported study income.

Code
```{python}
treated_df = df[(df["treated"] == 1) & (df["survey_completed"] == 1)].copy()

ct = pd.crosstab(
    treated_df["period"],
    treated_df["flag_same_income_self"].astype(int),
    margins=True,
    margins_name="Total",
)
ct.index = ct.index.map(lambda x: str(x) if x != "Total" else x)
ct.columns = ["Not flagged (0)", "Flagged (1)", "Total (survey completed)"]
ct.index.name = "Period"

ct.style.format("{:,}").set_caption(
    "Work earnings exactly match self-reported study income (treated households only)"
)
```
Table 1: Work earnings exactly match self-reported study income (treated households only)
  Not flagged (0) Flagged (1) Total (survey completed)
Period      
0 1,871 0 1,871
1 904 654 1,558
2 980 735 1,715
3 1,003 773 1,776
4 1,756 37 1,793
5 1,754 1 1,755
6 1,848 0 1,848
Total 10,116 2,200 12,316

Effectiveness

Below is a count of participants who report non-zero income, after correcting for double reporting.

Code
```{python}
nonzero = df[df["money_earn_p_corrected"] > 0]
treated_counts = nonzero[nonzero["treated"] == 1]["period"].value_counts().sort_index()
control_counts = nonzero[nonzero["treated"] == 0]["period"].value_counts().sort_index()

# Denominators: survey-completed households per period × treatment group
completed = df[df["survey_completed"] == 1]
treated_denom = completed[completed["treated"] == 1]["period"].value_counts().sort_index()
control_denom = completed[completed["treated"] == 0]["period"].value_counts().sort_index()

periods = treated_counts.index.astype(int)

def fmt(n, denom):
    rate = n / denom * 100 if denom > 0 else float("nan")
    return f"{n:,} ({rate:.1f}%)"

counts = pd.DataFrame({
    "Period": periods,
    "Treated": [fmt(treated_counts[p], treated_denom.get(p, 0)) for p in periods],
    "Control": [fmt(control_counts.get(p, 0), control_denom.get(p, 0)) for p in periods],
})

total_treated_n = treated_counts.sum()
total_control_n = control_counts.sum()
total_treated_d = treated_denom.sum()
total_control_d = control_denom.sum()

total_row = pd.DataFrame([{
    "Period": "Total",
    "Treated": fmt(total_treated_n, total_treated_d),
    "Control": fmt(total_control_n, total_control_d),
}])
counts = pd.concat([counts, total_row], ignore_index=True)
counts.style.hide(axis="index").set_caption(
    "Households reporting non-zero corrected earnings, by survey period and treatment group (rate among survey completers)"
)
```
Table 2: Households reporting non-zero corrected earnings, by survey period and treatment group (rate among survey completers)
Period Treated Control
0 208 (11.1%) 37 (9.2%)
1 58 (3.7%) 21 (7.1%)
2 65 (3.8%) 20 (5.7%)
3 54 (3.0%) 19 (5.2%)
4 99 (5.5%) 18 (4.9%)
5 79 (4.5%) 14 (3.9%)
6 244 (13.2%) 58 (14.5%)
Total 807 (6.6%) 187 (7.3%)

There does seem to be some overcrowding as control participants report some work income more often than treated participants.

Code
```{python}
completed = df[df["survey_completed"] == 1].copy()
completed["nonzero_corrected"] = (completed["money_work_p_corrected"] > 0).astype(int)

pct_by_group = (
    completed.groupby(["period", "treated"])["nonzero_corrected"]
    .mean()
    .mul(100)
    .reset_index()
)
pct_by_group["group"] = pct_by_group["treated"].map({1: "Treated", 0: "Control"})

(
    ggplot(pct_by_group, aes(x="factor(period)", y="nonzero_corrected", group = "group", fill="group"))
    + geom_col(width=0.6, position="dodge")
    + scale_y_continuous(limits=(0, 50), labels=lambda l: [f"{v:.0f}%" for v in l])
    + labs(
        x="Survey Period",
        y="% Reporting Non-Zero Corrected Earnings",
    )
    + theme_minimal()
)
```

Code
```{python}
unfixed
```

Caveats

False Negatives

The current heuristic will not detect people who report work income that includes study income alongside other income. I am not too worried given the reporting pattern similarity compared to the control group and the overall low numbers who report income at all.

Below is a count of households where their reported work income is greater than their self-reported study income (an upper bound on missed double reporting).

Code
```{python}
# Need firststage panel for study income and original money_earn
# The flag file doesn't retain original money_earn_participant
# Use flag_same_income_self == 0 & treatment == 1 as proxy; 
# instead, load the pickup/study income data the .do file uses
# Approximate: rows where flag_same_income_self == 0 but flag_double_report == 0
# A direct upper-bound proxy: treated, not double-reported, work_engaged_p_corrected == 1
# but money_earn_p_corrected > 0 in periods 1-5 (no endline study income question)
partial = df[
    (df["treated"] == 1) &
    (df["flag_double_report"] == 0) &
    (df["money_earn_p_corrected"] > 0) &
    (df["period"].isin([1, 2, 3, 4, 5]))
]

partial_counts = partial["period"].value_counts().sort_index().reset_index()
partial_counts.columns = ["Period", "Freq."]
partial_counts["Period"] = partial_counts["Period"].astype(int)


partial_counts.style.hide(axis="index").set_caption(
    "Treated households with non-zero corrected income not flagged as double-reporters, periods 1–5"
)
```
Table 3: Treated households with non-zero corrected income not flagged as double-reporters, periods 1–5
Period Freq.
1 58
2 65
3 54
4 99
5 79

Endline Double Reporting

We do not ask people at endline to report income from the study — the question appears in the PDF version of the survey but was missed in the programming. However, we do remind people to avoid reporting study income as work income which did solve the problem in periods 4 and 5. There is no reason to assume that it would not have done so at endline as well. Only 23 treated participants report work income that is a multiple of 12 at endline, so double reporting is unlikely.