Below is a tool to help explore the data. If you want to documentiation on how the code is structured, read the coding guide
Variables
The table below contains a list of all the tidy variables. These correspond directly to what we asked in the survey or simple transformations of them.
Variations
Note that for a lot of the variables, there are variations within the dataset that are omitted for brevity from the list below. Variations of a variable are denoted by suffix to the original name. For example, for the hypothetical variable income, income_z would denote the z-score of income.
Common suffixes and their definitions:
_z: z-score of the variable based on the baseline distribution
_99 / _95 : variable winsorized at the 99th (or 95th) percentile
_pc: per capita version of the variable (divided by household size)
_ae: adult equivalent version of the variable (divided by adult equivalent household size)
Some index variables exist. The description clarify how they are constructed. Alternatively, their definitions can be found in the do-file by consulting the do-file that generated the relevant dataset.
The table below contains a list of all the tidy datasets available in the data repository along with their descriptions and paths. By tidy data, I mean data that corresponds very closely to what we asked directly, with grouped by which module of the survey they come from.
Exceptions
While most of the datasets follow the above principle, there are a few exceptions.
First, some datasets are aggregated versions of the raw data. These exist either for convenience and to ensure that aggregation is done close to where we clean the raw data, to avoid mistakes. Additionally, we collect only aggregated data for some modules in the phone survey; in these cases the in-person data is aggregated and joined with the phone survey.
The column, Survey, denotes which survey the data comes from as well as whether it was aggregated or not.
Another exception regards variable grouping. Some set of variables do not have an obvious way to group them. To avoid creating too many do-files and datasets for one or two questions, I grouped those below into the two datasets below:
07_InPerson_hh_level-hh_id-period -> questions asked to the household at the in-person surveys (baseline and endline)
16_period_hh_level-hh_id-period -> questions asked across all periods (some might be phone survey only)
Code
dataset_list =FileAttachment("library/data_dictionary/output/dataset_statistics.csv").csv({ typed:true })viewof ds_search = Inputs.search(dataset_list, {placeholder:"Search datasets",label:"Search"})viewof visibleDatasetColumns = Inputs.checkbox( ["dataset_id","description","variable_count","row_count","observation_level","sources"], {label:"Show Columns",value: ["dataset_id","description","variable_count","row_count","observation_level","sources"],format: x => x ==="dataset_id"?"Dataset": x ==="variable_count"?"Variable Count": x ==="row_count"?"Row Count": x ==="observation_level"?"Observation Level": x ==="sources"?"Survey":"Description" })/* The following dummies indicate what the source of the data is: phone baseline endline fcm timeuse pickup dropoff beliefs census I want to create a filter view that allows users to filter by these sources*/viewof sourceFilter = Inputs.checkbox( ["Phone","Baseline","Endline","FCM","Time Use","Pickup","Dropoff","Beliefs","Census" ], {label:"From Survey",value: ["Phone","Baseline","Endline","FCM","Time Use","Pickup","Dropoff","Beliefs","Census"] })filteredDatasetList = {let result = ds_search;if (sourceFilter.length>0) { result = result.filter(d => {for (let source of sourceFilter) {// Remove spaces and lowercaseslet key = source.toLowerCase().replace(" ","");if (d[key] ===1) {returntrue; } }returnfalse; }); }return result;}