Reading the Demographic and Health Survey (DHS) data

USAID’s Demographic and Health Survey Data

The data-sets we want are for Egypt, and two in particular – (1) the 2014 DHS-VI (Standard DHS), and (2) the 2015 DHS-VII (Special). These can be downloaded in two ways, either by logging in and manually selecting and downloading the data-set you want and in the format you want (flat/SPSS/SAS/Stata), or then by using the lodown package. The lodown route is the more efficient way to go, and the one I take below.

library(lodown)

dhs_cat <- get_catalog("dhs",
                       output_dir = "data/DHS", 
                       your_email = "ruhil@ohio.edu", 
                       your_password = "eMg-H5G-VAr-d7H", 
                       your_project = "Fullbright Junior Scholar Training"
                       )

dhs_cat14 <- subset(dhs_cat, country == 'Egypt' & year == 2014)

lodown("dhs", dhs_cat14, output_dir = "data/DHS", 
    your_email = "ruhil@ohio.edu", 
    your_password = "eMg-H5G-VAr-d7H", 
    your_project = "Fullbright Junior Scholar Training")

dhs_cat15 <- subset(dhs_cat, country == 'Egypt' & year == 2015)

lodown("dhs", dhs_cat15, output_dir = "data/DHS", 
    your_email = "ruhil@ohio.edu", 
    your_password = "eMg-H5G-VAr-d7H", 
    your_project = "Fullbright Junior Scholar Training")

If you look inside the data/DHS folder you should see a few data-sets with the *.rds extension. You will have to choose which one of these you want to work with, for example, the household file for 2014 is called EGHR61FL.rds.

All of this assumes, of course, that you have gone over the recode manuals and maps for each of these survey years. If you have not done that prepwork, then you should do so now. The links to the recode manuals and recode maps are given below:

You will also want to have handy a copy of the Final Report for the country. The 2014 Final Report (English) is available here and the 2015 Final Report (English) is available here

Let us read in a specific data-set, the 2014 DHS file for women.

library(survey)

dhs.df14 <- readRDS("data/DHS/Egypt/Standard DHS 2014/EGIR61FL.rds")

dim(dhs.df14)
[1] 21762  4095

We have 21,762 observations (rows) and some 4095 variables (columns). All the more reason to really know what variables are of interest to us.

Labeling Values and Recoding Variables

Say I am interested in the age distribution. This happens to be variable v013 but the responses have to be labeled with the age-groups each represents. This can be done as shown below, and note that I am creating a new variable called agecat rather than overwriting the original variable.

library(sjmisc)

dhs.df14$agecat <- to_label(dhs.df14$v013) # Labeling the values

table(dhs.df14$agecat) # Check the frequencies 

15-19 20-24 25-29 30-34 35-39 40-44 45-49 
  738  3051  4718  4133  3473  2902  2747 

Let us also create a labeled variable that flags the woman’s educatonal attainment.

dhs.df14$educat <- to_label(dhs.df14$v149) # Labeling the values

table(dhs.df14$educat) # Check the frequencies 

        no education   incomplete primary     complete primary 
                4861                 1239                  940 
incomplete secondary   complete secondary               higher 
                2935                 8595                 3192 

How does educational attainment vary by age?

table(dhs.df14$agecat, dhs.df14$educat)
       
        no education incomplete primary complete primary
  15-19           70                 65               46
  20-24          330                 99              110
  25-29          705                149              125
  30-34          806                234              229
  35-39          841                213              209
  40-44          958                219               98
  45-49         1151                260              123
       
        incomplete secondary complete secondary higher
  15-19                  364                185      8
  20-24                  606               1515    391
  25-29                  590               2186    963
  30-34                  422               1710    732
  35-39                  422               1273    515
  40-44                  356                965    306
  45-49                  175                761    277

Maybe you don’t want to use these expansive educational attainment categories and instead rely on fewer categories. One such recoding is shown below.

library(tidyverse)

dhs.df14 <- dhs.df14 %>%
  mutate(
    educat2 = case_when(
      v149 == 0 ~ "No education",
      v149 == 1 ~ "Some primary",
      v149 == 2 ~ "Primary Complete/Some secondary",
      v149 == 3 ~ "Primary Complete/Some secondary",      
      v149 == 4 ~ "Secondary complete/higher",
      v149 == 5 ~ "Secondary complete/higher"),
    educat2 = ordered(educat2, levels = c("No education", "Some primary", "Primary Complete/Some secondary", "Secondary complete/higher"))
    )

Let us generate a few more variables, and clean them up a bit.

dhs.df14$urban.rural <- to_label(dhs.df14$v025) # urban-vs-rural residence

dhs.df14$wealthq  <- to_label(dhs.df14$v190) # wealth quintiles 

dhs.df14$governorates <- to_label(dhs.df14$sgov) # one measure of governorates
table(dhs.df14$governorates)

   urban governorates                 cairo            alexandria 
                    0                  1189                   737 
            port said                  suez           lower egypt 
                  800                   941                     0 
             damietta              dakahlia               sharkia 
                  986                   955                  1011 
             kalyubia        kafr el-sheikh               gharbia 
                  850                   945                   835 
             menoufia                behera              ismailia 
                  855                  1088                   859 
          upper egypt                  giza             beni suef 
                    0                  1076                   875 
               fayoum                 menya                assuit 
                  843                   858                   965 
               souhag                  qena                 aswan 
                  913                  1055                   886 
                luxor frontier governorates               red sea 
                  905                     0                   387 
           new valley                matroh 
                  443                   505 
dhs.df14$governorates <- stringr::str_to_title(dhs.df14$governorates) # changing labels from lowercase to titlecase

dhs.df14$governorates <- factor(dhs.df14$governorates) # Making sure this variable has a unique numeric code for each unique value 

dhs.df14$governorates <- droplevels(dhs.df14$governorates) # Drop levels with 0 frequencies 

table(dhs.df14$governorates) # Checking frequencies 

    Alexandria         Assuit          Aswan         Behera 
           737            965            886           1088 
     Beni Suef          Cairo       Dakahlia       Damietta 
           875           1189            955            986 
        Fayoum        Gharbia           Giza       Ismailia 
           843            835           1076            859 
Kafr El-Sheikh       Kalyubia          Luxor         Matroh 
           945            850            905            505 
      Menoufia          Menya     New Valley      Port Said 
           855            858            443            800 
          Qena        Red Sea        Sharkia         Souhag 
          1055            387           1011            913 
          Suez 
           941 
dhs.df14$nchild <- dhs.df14$v201 # Total number of children ever born. If there are fewer than twenty births then this is the same as V224 (Number of entries in the birth history), but if there are more than twenty births then this gives the full number, while V224 will be 20

dhs.df14$currently.working <- to_label(dhs.df14$v014) # Whether the respondent is currently working

dhs.df14$region <- factor(
  dhs.df14$sgov2, 
  levels = c(0, 1, 4, 7), 
  labels = c("Urban Governorates", "Lower Egypt", "Upper Egypt", "Frontier Governorates")
  )

Survey Weights

The DHS surveys have a complex survey design that is necessary to allow your survey estimates to be generalized to specific entities. This is what the documentation says:

All ever-married women age 15-49 who were usual members of the selected households and those who spent the night before the survey in the selected households were eligible to be interviewed in the survey. The sample for the 2014 EDHS was designed to provide estimates of population and health indicators including fertility and mortality rates for the country as a whole and for six major subdivisions (Urban Governorates, urban Lower Egypt, rural Lower Egypt, urban Upper Egypt, rural Upper Egypt, and the Frontier Governorates).

… Due to the non-proportional allocation of the 2014 EDHS sample to different governorates and to urban and rural areas as well as to the differences in response rates, sampling weights are required for any analysis using the survey data to ensure the actual representativeness of the survey results at national level as well as at the domain level.

Because we have to use sampling weights, we will first have to identify and define the elements that will go into the weighting. These elements will be the primary sampling unit (psu) which is measured by v021, the strata indicator, measured by v022, and the weight assigned to each woman respondent measured by v005.

dhs.df14$weight <- dhs.df14$v005

dhs.df14$psu <- dhs.df14$v021

dhs.df14$strata <- dhs.df14$v022

Now we tell R how to use these elements to weight the data. This is done with the srvyr package. Note that before I specify the survey design I am retaining only a few variables for weighting.

library(survey)
library(srvyr)

dhs_egypt14 <- dhs.df14 %>%
  select(
    c("psu", "strata", "weight", "educat", "educat2",
      "wealthq", "agecat", "urban.rural", "nchild",
      "governorates", "region", "currently.working")
    )

dhs14_design <- svydesign(~psu,
                          strata = ~strata,
                          weights = ~weight,
                          data = dhs_egypt14)

Now R knows how to weight any analysis we carry out. Let us start with a simple example. I want some weighted estimates, just so I can be confident that our code is working. One way to check this would be to look at the estimates in Table 3.1 (page 28) of the DHS 2014 Final Report.

tab.1 <- svymean(~ agecat, dhs14_design, na.rm = TRUE)
tab.1 <- as.data.frame(tab.1) * 100
tab.1 <- round(tab.1, digits = 1)
tab.1
            mean  SE
agecat15-19  3.5 0.2
agecat20-24 14.0 0.3
agecat25-29 21.8 0.3
agecat30-34 19.0 0.3
agecat35-39 16.1 0.3
agecat40-44 13.2 0.3
agecat45-49 12.4 0.3
tab.2 <- svymean(~ urban.rural, dhs14_design, na.rm = TRUE)
tab.2 <- as.data.frame(tab.2) * 100
tab.2 <- round(tab.2, digits = 1)
tab.2
                 mean  SE
urban.ruralurban   35 0.8
urban.ruralrural   65 0.8
tab.3 <- svymean(~ region, dhs14_design, na.rm = TRUE)
tab.3 <- as.data.frame(tab.3) * 100
tab.3 <- round(tab.3, digits = 1)
tab.3
                            mean  SE
regionUrban Governorates    12.7 0.7
regionLower Egypt           49.0 0.8
regionUpper Egypt           37.4 0.8
regionFrontier Governorates  0.9 0.1
tab.4 <- svymean(~ educat2, dhs14_design, na.rm = TRUE)
tab.4 <- as.data.frame(tab.4) * 100
tab.4 <- round(tab.4, digits = 1)
tab.4
                                       mean  SE
educat2No education                    24.0 0.6
educat2Some primary                     6.1 0.2
educat2Primary Complete/Some secondary 17.4 0.4
educat2Secondary complete/higher       52.4 0.8
tab.5 <- svymean(~ wealthq, dhs14_design, na.rm = TRUE)
tab.5 <- as.data.frame(tab.5) * 100
tab.5 <- round(tab.5, digits = 1)
tab.5
               mean  SE
wealthqpoorest 17.9 0.7
wealthqpoorer  19.7 0.5
wealthqmiddle  22.2 0.6
wealthqricher  20.9 0.8
wealthqrichest 19.4 0.7

These were weighted relative frequency distributions for a single variable. How about checking a few against Table 3.2? Here we have the fuller educational attainment variable (educat) broken down by age, urban-rural, and so on. Let us do it by agecat and wealthq.

tab.6 <- svyby(~ educat, ~urban.rural, dhs14_design, svymean, na.rm = TRUE)
tab.6 <- tab.6[, 2:7] * 100 
tab.6 <- round(tab.6, digits = 1) 
tab.6
      educatno education educatincomplete primary
urban               13.8                      4.6
rural               29.6                      7.0
      educatcomplete primary educatincomplete secondary
urban                    4.6                       12.2
rural                    3.9                       13.9
      educatcomplete secondary educathigher
urban                     42.1         22.7
rural                     36.6          9.1
tab.7 <- svyby(~ educat, ~wealthq, dhs14_design, svymean, na.rm = TRUE)
tab.7 <- tab.7[, 2:7] * 100 
tab.7 <- round(tab.7, digits = 1) 
tab.7
        educatno education educatincomplete primary
poorest               47.7                      8.6
poorer                37.4                      8.9
middle                18.9                      6.1
richer                14.0                      5.1
richest                5.4                      2.3
        educatcomplete primary educatincomplete secondary
poorest                    4.3                       13.8
poorer                     4.9                       14.8
middle                     3.7                       14.6
richer                     5.0                       13.8
richest                    2.8                        9.4
        educatcomplete secondary educathigher
poorest                     22.7          3.0
poorer                      30.4          3.6
middle                      46.5         10.2
richer                      44.8         17.3
richest                     45.3         34.8

Cleaner Output

You notice how the row names and column names are, aesthetically speaking, less than desirable. For example, I would rather not see “wealthqpoorest” but rather see “Poorest”. Ditto for “Percent” and “Std. Error” in place of “mean” and “SE”. How can we do that?

tab.5 <- tab.5 %>%
  mutate(
    names = rownames(.),
    Quintile = gsub("wealthq", "", names)
    ) 

Here we are mutating tab.5 by first making a column with the name of each row with names = rownames(.). We are then creating a new column called Quintile after replacing the string “wealthq” with nothing with Quintile = gsub("wealthq", "", names). But the entries in Quintile are all lowercase, so we end up making them titlecase with the code below.

tab.5$Quintile <- stringr::str_to_title(tab.5$Quintile)

We are almost done. What remains is to (a) make the Quintile column the first column, and then to change the names of the “mean” and “SE” columns. We do this as shown below:

tab.5 <- tab.5 %>%
  select("Quintile", "mean", "SE") %>% # Choosing and ordering the columns we want to keep
  rename("Percent" = mean,
         "Std. Error" = SE)

tab.5
  Quintile Percent Std. Error
1  Poorest    17.9        0.7
2   Poorer    19.7        0.5
3   Middle    22.2        0.6
4   Richer    20.9        0.8
5  Richest    19.4        0.7

Let us do another one, just for practice – tab.7

colnames(tab.7) <- c("No Education", "Incomplete Primary", "Completed Primary", "Incomplete Secondary", "Complete Secondary", "Higher") 

tab.7 <- tab.7 %>%
  mutate(
    Quintile = row.names(.)
  )
         
tab.7$Quintile <- stringr::str_to_title(tab.7$Quintile)

tab.7 <- tab.7 %>%
  select("Quintile", "No Education", "Incomplete Primary", "Completed Primary", "Incomplete Secondary", "Complete Secondary", "Higher")

tab.7
  Quintile No Education Incomplete Primary Completed Primary
1  Poorest         47.7                8.6               4.3
2   Poorer         37.4                8.9               4.9
3   Middle         18.9                6.1               3.7
4   Richer         14.0                5.1               5.0
5  Richest          5.4                2.3               2.8
  Incomplete Secondary Complete Secondary Higher
1                 13.8               22.7    3.0
2                 14.8               30.4    3.6
3                 14.6               46.5   10.2
4                 13.8               44.8   17.3
5                  9.4               45.3   34.8

Before we go any farther let us save our survey design data.

saveRDS(dhs14_design, file = "data/dhs14_design.RDS")
saveRDS(dhs.df14, "data/dhs.df14.RDS")