Analyzing Hospital Data: Associations Between Birth Location and Categorical Variables

12 minute read

Published:

Source code

Chi-Square Test of Independence

TL;DR: We explored the relationship between birth location and various categorical variables using the Chi-Square Test of Independence. The analysis reveals significant associations between race, ethnicity, and primary language with the choice of birth location. However, no significant association is found between pregnancy complications and birth location.

This test is used to determine if there is a significant association between two categorical variables. In our case, we can create a Rx2 contingency table where one axis represents the location of birth (in hospital or out of hospital) and the other axis represents whether there was a birth complication (yes or no). Then, we can apply the Chi-Square Test to determine if the occurrence of birth complications is independent of the location of birth.

VariableChi-Square ValueP-ValueDegrees of Freedom
Race23.7753787878787860.00023978688662260245
Age29.670976813590450.3291383784874748527
PrimaryLanguage21.24095163468470.0116224768136298689
Ethnicity16.0462359920258566.181438337887989e-051
Education3.3419383646656380.50231504013116424
Employment1.38538567493112970.50022722316191862
PregnancyComplications0.066097488038277470.79710596420913331
Parity4.214713949843260.75474110636345577

Interpretation:

  • Chi-Square Value: A larger value indicates a greater difference between the observed counts and what would be expected if the variables were independent.
  • P-Value: Tells you whether or not the association is statistically significant. Typically, if the p-value is less than 0.05, we might conclude that there is a significant association between the variables.
  • Degrees of Freedom: This is equal to (number of rows - 1) x (number of columns - 1) in the contingency table. It helps in determining the critical value for the chi-square distribution.

For interpretation, we would usually focus on the P-Value. If it’s below a significance level (commonly 0.05), it suggests that the observed distribution is significantly different from what you’d expect if the variables were independent, indicating an association.

For the variable Race, Ethnicity and PrimaryLanguage, the very low p-value indicates that there is a statistically significant association between race and whether the birth occurred in or out of the hospital. This suggests that race and the language might play a role in the choice or circumstances of the birth location.

For the variable PregnancyComplications, the p-value is much greater than 0.05, suggesting that there is no statistically significant association between pregnancy complications and birth location. Pregnancy complications do not appear to be a determining factor in whether a birth occurs in or out of the hospital.

On the figure bellow you can see the bar plots of the various variables against the birth locations. We highlight the correlation between OutOfHospital with Race, Ethnicity and PrimaryLanguage.

We will note that we explored the idea of using the number of complications as the outcome variable instead of the binary variable. However, we can see in the histogram plot bellow that the 111 of the 138 data points have either 0 or 1 complications, while the 27 remaining having 2, 3 or 4 complications. We did not proceed with using a discrete variable as the benefit remain to be proven with so few points.

Logistic Regression

Logistic regression is a statistical model that uses a logistic function to model a binary dependent variable. In the context of predicting whether a baby is born in or out of a hospital, it provides a method to estimate the probability of an event (birth location) given a set of related variables, such as race, age, primary language, ethnicity, and education of the parent. Features are assigned coefficients, which reflect the strength and direction of their relationship with the outcome. A positive coefficient suggests that as the feature value increases, the likelihood of the event (in this case, a baby being born out of hospital) also increases, while a negative coefficient suggests the opposite. Thus, logistic regression allows us to identify and quantify the influence of various factors on the outcome, providing insights into relevant and correlated features.

Using statsmodels

Statsmodels is primarily designed for statistical modeling and is more suited for traditional, interpretable statistical analyses. It provides extensive options for model statistics, like R-squared, p-values, confidence intervals, and so on. In the case of logistic regression, Statsmodels gives a detailed summary table, which includes various diagnostic measures and statistics about the coefficients.

VariableCoefficientStd Errz-scoreP>z
const0.94070.9710.9690.333
PregnancyComplications-0.39910.433-0.9210.357
Parity-0.14050.121-1.1580.247
Education_high school graduate1.13480.9421.2050.228
Education_less than high school0.81131.0820.7490.454
Education_none listed-0.09340.855-0.1090.913
Education_some college0.44081.6490.2670.789
Ethnicity_not hispanic/latino-1.59000.396-4.0110.000
Employment_not listed0.38780.6090.6370.524
Employment_unemployed0.61880.4981.2420.214

Here are what some of the key columns mean:

  1. Coefficient (coef): The coefficient represents the expected change in the log-odds of the outcome for a one-unit increase in the corresponding predictor, assuming all other predictors are held constant. For example, the coefficient for “PregnancyComplications” is -0.3991, which suggests that for each additional unit increase in “PregnancyComplications,” the log-odds of the outcome (e.g., OutofHospital) decrease by 0.3991, all else being equal.
  2. Standard Error (std err): The standard error measures the variability in the coefficient estimate that would be expected if the study were repeated. Smaller standard errors mean the estimate is more precise. For example, the standard error for the coefficient of “PregnancyComplications” is 0.433, which is relatively high compared to some other variables like “Parity” (std err = 0.121). This implies that the estimate for “PregnancyComplications” might be less reliable.
  3. p-value: The p-value is a measure of the statistical significance of the predictor. A smaller p-value (usually <0.05) suggests that the predictor is statistically significant. For example, “Ethnicity_not hispanic/latino” has a p-value of 0.000, which means it is statistically significant at a conventional level, indicating a strong evidence against the null hypothesis of no effect. On the other hand, “PregnancyComplications” has a p-value of 0.357, which is greater than 0.05, suggesting that the effect of “PregnancyComplications” is not statistically significant at a conventional level.

Using scikit-learn

Scikit-learn, is designed for machine learning and is more focused on prediction accuracy. It provides a wide range of tools for preprocessing data, cross-validation, tuning model hyperparameters, and evaluating prediction performance, which are essential for building and fine-tuning machine learning models. However, Scikit-learn does not provide p-values or similar traditional statistical measures out-of-the-box, as these are less relevant in a machine learning context where the focus is on prediction rather than interpretation.

The logistic regression model reveals a variety of key insights about the influences on whether a baby is born inside or outside of a hospital. The most influential feature is whether the primary language is English, with a negative coefficient of -0.716497. This suggests that when English is the primary language, it is less likely for the baby to be born outside of a hospital. Similar trends are seen with the race of being Caucasian/White and the ethnicity of not being Hispanic/Latino, which also have negative coefficients, indicating these groups are less likely to have out-of-hospital births. On the other hand, certain features increase the likelihood of out-of-hospital birth. For example, if the individual is unemployed, the probability of the birth taking place outside of a hospital increases.

From the model, it can be observed that certain features increase the probability of a baby being born outside of a hospital. Individuals who are unemployed (coefficient of 0.472605) have an increased likelihood of giving birth outside a hospital. Also, people whose primary language is either Bengali, Urdu, or Spanish also show a higher tendency to have out-of-hospital births, as indicated by the positive coefficients of these variables. Similarly, having an education level of a high school graduate or less than high school, also increases the chances of out-of-hospital births. These features suggest that socio-economic factors, language, and education level significantly influence the location of birth, with lower socio-economic status and non-English primary languages increasing the likelihood of out-of-hospital births.

The accuracy of the model, at 0.68, indicates a moderate ability to correctly classify whether a birth takes place in or out of the hospital based on the available features. The F1 score, a balance of precision and recall, is 0.71, which suggests a reasonably balanced model.

Data analysis for the babies

The analysis is technically identical. Each data point refers now to a baby, with the predicted variable of interest “OutofHospital” which refers to a baby being born out of hospital or no. You can find the source code for this part.

Chi-Square Test of Independence

VariableChi-Square ValueP-ValueDegrees of Freedom
Neonatal Survival0.034177715101852390.85332930860532861
Perinatal Survival0.067510548523206760.79499651259402221
Polycythemia3.6980006835269990.054477709161133481
Hypoglycemia24.5552855861244057.220799970750491e-071
Convulsions0.01.01
Hypothermic23.052272727272731.5765603373793762e-061
Bradycardia4.8315972222222230.0279427323165544541
Respiratory distress0.82621860047846920.36336824412259631
PPHN1.85627582644628150.17305523931754991

For the variable Hypoglycemia, Hypothermic, Polycythemia and Bradycardia, the very low p-value indicates that there is a statistically significant association between race and whether the birth occurred in or out of the hospital. This suggests that race and the language might play a role in the choice or circumstances of the birth location.

For the variable Neonatal Survival, the p-value is much greater than 0.05, suggesting that there is no statistically significant association between pregnancy complications and birth location. Pregnancy complications do not appear to be a determining factor in whether a birth occurs in or out of the hospital.

Logistic Regression

This study employs a logistic regression model to identify factors that can predict whether a baby will be born in or out of hospital. The results are represented by the coefficients assigned to each feature or variable in the model, which represent the log odds of a unit change in the respective feature, while all other features are held constant. A positive coefficient indicates that the feature is associated with an increased likelihood of the outcome (i.e., a baby being born in hospital), while a negative coefficient suggests a reduced likelihood.

The following factors significantly increase the odds of a baby being born in hospital:

  1. Hypothermia (coefficient 1.804): This is the most influential factor, implying that the odds of a baby being born in hospital increases substantially if the baby experiences hypothermia.

  2. Hypoglycemia (coefficient 1.671): Similarly, this is a significant predictor, indicating that hypoglycemia also considerably increases the likelihood of in-hospital birth.

  3. Unavailable data on neonatal survival (coefficient 0.686) and Persistent Pulmonary Hypertension of the Newborn (PPHN) (coefficient 0.651) also have a positive influence, albeit less strong.

On the contrary, some factors decrease the odds of a baby being born in hospital:

  1. Birth weight (coefficient -0.540): This result suggests that an increase in birth weight is associated with a decrease in the odds of an in-hospital birth.

  2. Convulsions (coefficient -0.360): The presence of convulsions also reduces the likelihood of in-hospital birth.

Of note, the model accuracy, which represents the proportion of correct predictions made out of all predictions, is 79%, suggesting a reasonable fit to the data. Precision is at 1.00, meaning that all positive predictions made by the model (i.e., all predictions of in-hospital births) were correct. However, the model’s recall (also known as sensitivity) is 0.57, indicating that it identified just over half of the actual in-hospital births correctly. Therefore, while the model is highly reliable when it predicts an in-hospital birth, it is relatively less successful in capturing all such cases. This is further reflected in the F1 score, which is a harmonic mean of precision and recall, and is 0.73.

In conclusion, hypothermia and hypoglycemia are significant predictors of a baby being born in hospital. An increase in birth weight and the presence of convulsions reduce the odds of an in-hospital birth. The model is fairly accurate and highly precise, but its sensitivity is moderate. Further research is needed to improve the model’s ability to correctly identify all cases of in-hospital births.

RankFeatureCoefficient
1Hypothermic_yes1.804393
2Hypoglycemia_yes1.671158
3Neonatal Survival_nan0.686115
4PPHN_yes0.651477
5Birth weight-0.540175
6Convulsions_yes-0.360674
7Perinatal Survival_nan0.354218
8Bradycardia_yes0.247487
9Polycythemia_yes0.204884
10Respiratory distress_yes-0.048227
11Neonatal Survival_deceased0.000000
12Perinatal Survival_deceased0.000000
MetricValue
Accuracy0.79
Precision1.00
Recall0.57
F1 Score0.73