Set of regressions run on some outputs requested by Nick
1 U-shaped graphs
Code
```{python}merged = pd.read_stata(f"/Users/st2246/Work/Pilot3/data/generated/main/transform/30_merged_panel-hh_id-period.dta", convert_categoricals=False)merged_grouped = merged.groupby(["period", "treatment"])[ "money_work_p_corrected_99"].mean().reset_index()merged_grouped["treatment"] = merged_grouped["treatment"].replace({ 0: "Control", 1: "Stable", 2: "Predictable", 3: "Risky"})# Make gg( ggplot(merged_grouped, aes(x = "period", y = "money_work_p_corrected_99", color = "treatment", group = "treatment")) + geom_line() + labs( title = "Work Income (Participant)", x = "Period", y = "Cedis" ) + # Covert R snippet to python scale_color_manual( name="Treatment", values={"Control": "#F8766D", "Stable": "#00BA38", "Predictable": "#619CFF", "Risky": "#FF61C3"} ) + theme_minimal())```
The goal of this analysis is to investigate whether the u-shaped curve in employment-related outcomes is driven by survey modality (phone vs. in-person) or seasonality. The latter is a viable explanation since the baseline is done during the planting season while the endline is close to the harvest season, thus explaining spikes in labor.
2 Planting and Harvest Dates
Below, we use participants’ self-reported planting and harvest dates to confirm the overlap of the planting and harvest seasons with the baseline and endline respectively.
2.1 Planting
Important
The dates are winsorized. The earliest date in the raw data is in 2006 (likely because of accidentally clicking “months” instead of “days”)
Figure 2: Earliest and latest planting date per household
2.2 Harvest
Important
Harvest dates combine two sources: endline (actual, back-calculated) and phone surveys (expected, forward-calculated). Dates are winsorized to Jun 2025 – Feb 2026.
Figure 4: Earliest and latest harvest date per household
3 Date and Impact on Labor
Given that the overlap is there, we want to disentangle the impact of survey modality from seasonality in explaining the u-shaped curve. To do so, we exploit variation in survey date within each period.
Code
```{python}period_dta = pd.read_stata(f"{TIDY}/16_period_hh_level-hh_id-period.dta")# Survey date is in the startdate variable. Generate a days since start variableperiod_dta["days_since_start"] = ( period_dta["startdate"] - period_dta["startdate"].min()).dt.daysparticipant_employment = pd.read_stata( f"/Users/st2246/Work/Pilot3/data/generated/main/transform/02_flag_double_reports-hh_id-period.dta")employment = pd.read_stata( f"{TIDY}/05_Employment-hh_id-period-member_id.dta", convert_categoricals=False)adults = pd.read_stata( f"{TIDY}/08_inperson_demographics-hh_id-period-member_id.dta", convert_categoricals=False,)adults = adults[adults["age"] >= 18]adults = adults[["hh_id", "member_id", "participant_id"]].drop_duplicates()adults = adults.loc[adults.index.repeat(7)].copy()adults["period"] = adults.groupby(["hh_id", "member_id"]).cumcount()emp = adults.merge( employment[ [ "hh_id", "period", "member_id", "money_work_99", "days_worked_", "work_engaged_", ] ], on=["hh_id", "period", "member_id"], how="left",)# Fill in 0s for money_work_99 and days_worked_ work_engaged_ when missingemp["money_work_99"] = emp["money_work_99"].fillna(0)emp["days_worked_"] = emp["days_worked_"].fillna(0)emp["work_engaged_"] = emp["work_engaged_"].fillna(0)# For rows where member_id == participant_id, replace employment vars with corrected valuesemp = emp.merge( participant_employment[ [ "hh_id", "period", "money_work_p_corrected_99", "days_worked_p_corrected", "work_engaged_p_corrected", ] ], on=["hh_id", "period"], how="left",)is_participant = emp["member_id"] == emp["participant_id"]emp.loc[is_participant, "money_work_99"] = emp.loc[ is_participant, "money_work_p_corrected_99"].valuesemp.loc[is_participant, "days_worked_"] = emp.loc[ is_participant, "days_worked_p_corrected"].valuesemp.loc[is_participant, "work_engaged_"] = emp.loc[ is_participant, "work_engaged_p_corrected"].values.astype(int)# df_allmembers: all members (with participant corrections) merged with period datadf_allmembers = emp.merge( period_dta[["hh_id", "period", "startdate", "days_since_start", "enum_id"]], left_on=["hh_id", "period"], right_on=["hh_id", "period"],)df_allmembers["inperson_survey"] = ( df_allmembers["period"].apply(lambda p: 0 if 1 <= p <= 5 else 1).astype(int))# df_noparticipant: exclude the participant rowdf_noparticipant = df_allmembers[ df_allmembers["member_id"] != df_allmembers["participant_id"]].copy()# Merge the period data with participant-level corrected employment for participant-only regressionsdf = df_allmembers[df_allmembers["member_id"] == df_allmembers["participant_id"]].copy()```
/var/folders/rj/c4rjx52d217gsm3ccfx5q20r0000gr/T/ipykernel_62160/2028874538.py:65: RuntimeWarning: invalid value encountered in cast
3.1 Piecewise Linear Regression to Find Optimal Breakpoint
Under the seasonality hypothesis, we would expect the relationship between days since start and employment outcomes to be non-linear. At the beginning, it should be negative as we move further away from the planting season. However, as we approach the harvest season, the relationship should become positive.
Below, we run a series of piecewise linear regressions to find the optimal breakpoint. That is, we look for the date at which the relationship between days since start and employment outcomes changes by finding the date that minimizes the residual sum of squares.
Code
```{python}import statsmodels.api as smdef find_best_spline(column: str) -> tuple: y = df_allmembers[column].values x = df_allmembers['days_since_start'].values # Search over candidate breakpoints (exclude extremes) candidates = np.arange(x.min(), x.max()) rss_values = [] for bp in candidates: # Piecewise linear: x, and (x - bp)_+ x_spline = np.maximum(x - bp, 0) X = sm.add_constant(np.column_stack([x, x_spline])) model = sm.OLS(y, X).fit() rss_values.append(model.ssr) candidate_dates = [df['startdate'].min() + pd.Timedelta(days=int(bp)) for bp in candidates] return candidates, rss_values, candidate_dates ```
It seems early August is the ideal splitting point. Given the smoothness of the curves, I am not too concerned with picking the exact date. For the purposes of the analysis below, I will pick August 1st
Code
```{python}best_day = candidates[np.argmin(rss_values)] - 8 # Best date is Aug 9, subtract 8 to get Aug 1df_allmembers["days_after_start"] = df_allmembers["days_since_start"].apply(lambda x: min(x, best_day))df_allmembers["days_since_aug1"] = df_allmembers["days_since_start"].apply(lambda x: max(0, x - best_day))df_participant = df_allmembers[df_allmembers["member_id"] == df_allmembers["participant_id"]].copy()df_noparticipant = df_allmembers[df_allmembers["member_id"] != df_allmembers["participant_id"]].copy()```
3.2 Regressions
The regressions below suggest that the u-shaped curve is mostly driven by seasonality
Code
```{python}from stargazer.stargazer import Stargazerdef run_regs(data: pd.DataFrame, dv: str) -> list: sub = data[[dv, "days_since_start", "days_after_start", "days_since_aug1", "inperson_survey", "enum_id"]].dropna() y = sub[dv] # TODO I'd spec where we interact days_since_aug1 and days_after_start with in-person survey. specs = [ ["days_since_start"], ["days_after_start", "days_since_aug1"], ["days_since_start", "inperson_survey"], ["days_after_start", "days_since_aug1", "inperson_survey"], ["days_since_start", "inperson_survey", "enum_id"], ["days_after_start", "days_since_aug1", "inperson_survey", "enum_id"], ] enum_dummies = pd.get_dummies(sub["enum_id"], prefix="enum", drop_first=True, dtype=float) models = [] for spec in specs: if "enum_id" in spec: cols = [c for c in spec if c != "enum_id"] X = sm.add_constant(pd.concat([sub[cols], enum_dummies], axis=1)) else: X = sm.add_constant(sub[spec]) models.append(sm.OLS(y, X).fit()) return modelsmodels_all = run_regs(df_allmembers, "days_worked_")models_p = run_regs(df_participant, "days_worked_p_corrected")```
```{python}#| output: asisdef render_table(models: list, title: str) -> str: sg = Stargazer(models) sg.title(title) sg.custom_columns(["Days Since Start", "w/ split", "Control for Modality", "", "Control for Enum", "(6)"], [1, 1, 1, 1, 1, 1]) # Collect all enum dummy names across all models and hide them enum_cols = [] for m in models: enum_cols += [p for p in m.params.index if p.startswith("enum_")] if enum_cols: all_params = [] for m in models: for p in m.params.index: if not p.startswith("enum_") and p not in all_params: all_params.append(p) sg.covariate_order(all_params) sg.add_line("Enum FE", ["No", "No", "No", "No", "Yes", "Yes"]) return sg.render_html()print(render_table(models_all, "Days Worked — All Household Adults"))```
Days Worked — All Household Adults
Dependent variable: days_worked_
Days Since Start
w/ split
Control for Modality
Control for Enum
(6)
(1)
(2)
(3)
(4)
(5)
(6)
const
0.528***
0.760***
0.326***
0.603***
0.743***
0.969***
(0.016)
(0.018)
(0.018)
(0.036)
(0.131)
(0.132)
days_since_start
-0.002***
-0.001***
-0.002***
(0.000)
(0.000)
(0.000)
days_after_start
-0.006***
-0.004***
-0.006***
(0.000)
(0.000)
(0.000)
days_since_aug1
0.019***
0.013***
0.013***
(0.001)
(0.002)
(0.001)
inperson_survey
0.438***
0.172***
0.447***
0.160***
(0.016)
(0.034)
(0.023)
(0.036)
Enum FE
No
No
No
No
Yes
Yes
Observations
54873
54873
54873
54873
54873
54873
R2
0.001
0.016
0.015
0.016
0.145
0.146
Adjusted R2
0.001
0.016
0.015
0.016
0.143
0.145
Residual Std. Error
1.706 (df=54871)
1.694 (df=54870)
1.695 (df=54870)
1.694 (df=54869)
1.581 (df=54767)
1.579 (df=54766)
F Statistic
69.691*** (df=1; 54871)
438.338*** (df=2; 54870)
411.538*** (df=2; 54870)
300.860*** (df=3; 54869)
88.209*** (df=105; 54767)
88.525*** (df=106; 54766)
Note:
p<0.1; p<0.05; p<0.01
Code
```{python}#| output: asisprint(render_table(models_p, "Days Worked — Participant Only"))```