Below is a step by step guidance on how to generate quickly a fully reproducible analysis
Getting data from server - How to retrieve data from the server within R?
Data Analysis Plan within your xlsfrom
- How to extend the xlsform to include your analysis plan?
Using console script - How to use the package without using the GUI in Shiny?
Sampling: how to generate a sample from a registry or to post-stratify data when having a low response rate?
Data Anonymisation and disclosure risk measurement: how to create anonymised data?
Data Cleaning: how to use the package for reproducible and documented data cleaning?
Predicting and scoring: how to use survey in conjunction wiht registration data to build risk prediction and vulnerability scoring?
Dissiminating: how to dissiminate both survey microdata using DDI and variable crosstabulation on CKAN?
From Rstudio, create a new project - then make sure to install the necessary packages:
hcrdata to connect to both Kobo & RIDL API
## Use UNHCR graphical template-
## Perform High Frequency Check
## Process data crunching for survey dataset -
You can now prepare your project
library (koboloadeR) # This loads koboloadeR package
kobo_projectinit() # Creates folders necessary and transfer files needed
This last function creates a structure of folders that is consistent with R regular package structure
where processing scripts are storeddata-raw
where raw data are storeddata
where processed data are keptvignettes
where generated Rmarkdownout
where generated report (knitted markdown) in word/powerpoint or html are pushedThe initial step to start your project is to get your data.
The package is using only csv
files. this is to avoid the limitations linked the number of columns that some version of excel can handle.
One important point to note is related to the limitation in terms of variable names in R: A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as “.2way” or “2.way” are not valid, and neither are the reserved words.
In case your original variable names within your xlsform were starting with a number, you will need to rename manually all variable names both in your xlsform and in the data you downloaded.
In order to complete this step, you can either:
Use the web interface and put the files into the data-raw
pull from API with HCRdata
Open a new R script within a new RStudio project.
You should then be able to launch the “data browser” within Rstudio addins menu or with the following command in your console:
From there you will need to: 1. select the source 2. go to the dataset tab and select the project you want to pull data from 3. go to the files tab and select the specific file you want to retrieve from the project. 4. press the load data button and the R statement to pull this file from your project will be automatically inserted in your blank R script tab
alternatively, if you have the uniqueID of your koboproject: dataset
and the name of your form file in your project, you could use directly the code below - note that
## pulling data from Kobo
dataset <- "dataset-title-in-kobo"
form <- "name-of-the-form.xlsx"
if(!dir_exists("data-raw")) {
hcrdata::hcrfetch(src = "kobo",
dataset = dataset,
file = form)
data <-
src = "kobo",
dataset = "My kobo project",
file = "data.json") %>%
jsonlite::fromJSON() %>%
purrr::pluck("results") %>%
tibble::as_tibble() %>%
purrr::set_names(~stringr::str_replace_all(., "(\\/)", "."))
write.csv(data, "data-raw/MainDataFrame.csv", row.names = FALSE)
file.copy(from = paste0("data-raw/",dataset,"/form.xlsx"),
to = "data-raw/form.xlsx")
Note that for the rest of the process, it is convenient to name your form form.xlsx
and your downloaded data frame MainDataFrame.csv
You need first to make sure that the form is in the xlsx format so that it can be used by the package - if not, open your xls file in LibreOffice or Excel and save it within the right format.
Next step is to extend your xlsform:
## Change here the precise name of the form if required
form <- "form.xlsx"
## Extend xlsform with required column if necessary - done only once!
Once the xlsform as been extended, re-open it in your favorite spreadsheet processing software
for both the questions in survey
and the question modalities in choice
worksheet. Note that label for questions should be less than 80 characters long and modalities should be less than 40 characters.disaggregation
, correlate
and variable
: used to flag variables used to facet
: used to flag variables used for statistical test of independence (for categorical variable) or correlation for numeric variablevariable
: used to flag ordinal
variables so that graphs are not ordered per frequency.clean
A well-designed and tested survey should allow to minimise data cleaning issues. Specifically unconsistent answers can be anticipated and avoided through a series of well set-up constraints. You can learn more on questionnaire design here
However even with the best designed questionnaires, there will still be some issues to fix
Survey data cleaning may involves different steps:
identifying and removing responses from individuals who either don’t match the target audience criteria or did not answer your questions thoughtfully. In case of self-administered questionnaire online, there might be also issues called “speeders” and “flat-liners” (respondents expediting the questionnaire), in such situation, date/time stamp on questions or group of questions can help identifying the records to be removed
Often some people will tend to use this last other options to enter information. The result is an open ended question that is very difficult to analyse. Re-encoding certains select_one list_name or_other variables is therefore quite often a necessary step.
Koboloader has some functions to handle this situation
Insert a column named clean and reference the csv file to use for cleaning.
You have now done the biggest part of the work. You can already push some of those document to the data repository. The standard for this is UNHCR Raw Internal Data Library - RIDL, which is based CKAN servers, the same software being used for HDX - The Humanitarian Data Exchange. More
Once you have generated all potential markdown files, you will end with a lot of visuals. Therefore it is key to carefully select the most relevant visual that will be presented for interpretation. In order to keep participant focused, a typical joint data interpretation session shall not last more than 2 hours and include not more than 60 visuals/slide.
You can create an empty markdown using the unhcRstyle::unhcr_templ_ppt
powerpoint template and copy/paste within this new file the most relevant charts.
In order to guide this selection phase, the data crunching expert and report designer, in collaboration with the data analysis group, can use the following elements:
For numeric value, check the frequency distributions of each variable to average, deviation, including outliers and oddities
For categorical variables, check for unexpected values: any weird results based on common sense expectations
Use correlation analysis to check for potential contradictions in respondents answers to different questions for identified associations (chi-square)
Always, Check for missing data (NA) or “%of respondent who answered” that you cannot confidently explain
Check unanswered questions, that corresponds to unused skip logic in the questionnaire: For instance, did a person who was never displaced answer displacement-related questions? Were employment-related answers provided for a toddler?
Before the session, you need to agree in advance on the note-taker role. That person may potential write the notes directly within the markdown file.
When analyzing those representations in a collective setting during data interpretation sessions, you may:
Reflect: question data quality and/or make suggestions to adjust questions, identify additional cleaning steps;
Interpret: develop qualitative interpretations of data patterns;
Recommend: suggest recommendations in terms of programmatic adjustment;
Classify: define level of sensitivity for certain topics if required.
Peer Review is essential to produce good analysis. Such peer review is performed through the submission of your Rmd files to your data analysis focal point in the Regional Bureau.
Before submitting your markdown files, plug them directly to the correct RIDl container with the following code chunk
## pulling data from RIDL
dataset <- "dataset-title-in-rild"
if(!dir_exists("data-raw")) {
hcrdata::hcrfetch(src = "ridl",
dataset = dataset,
file = "form.xls",
#path= here::here("data-raw", file),
cache = TRUE)
hcrdata::hcrfetch(src = "ridl",
dataset = dataset,
file = "maindataframe.csv",
#path= here::here("data-raw", file),
cache = TRUE)
file.copy(from = paste0("data-raw/",dataset,"/form.xls"),
to = "data-raw/form.xls")
file.copy(from = paste0("data-raw/",dataset,"/maindataframe.csv"),
to = "data-raw/MainDataFrame.csv")
if(!dir_exists("data")) {
if(!dir_exists("R")) {
form <- "form.xls"