Analyzing Hospital Data: Associations Between Birth Location and Categorical Variables
Published:
Chi-Square Test of Independence
TL;DR: We explored the relationship between birth location and various categorical variables using the Chi-Square Test of Independence. The analysis reveals significant associations between race, ethnicity, and primary language with the choice of birth location. However, no significant association is found between pregnancy complications and birth location.
This test is used to determine if there is a significant association between two categorical variables. In our case, we can create a Rx2 contingency table where one axis represents the location of birth (in hospital or out of hospital) and the other axis represents whether there was a birth complication (yes or no). Then, we can apply the Chi-Square Test to determine if the occurrence of birth complications is independent of the location of birth.
Variable | Chi-Square Value | P-Value | Degrees of Freedom |
---|---|---|---|
Race | 23.775378787878786 | 0.0002397868866226024 | 5 |
Age | 29.67097681359045 | 0.32913837848747485 | 27 |
PrimaryLanguage | 21.2409516346847 | 0.011622476813629868 | 9 |
Ethnicity | 16.046235992025856 | 6.181438337887989e-05 | 1 |
Education | 3.341938364665638 | 0.5023150401311642 | 4 |
Employment | 1.3853856749311297 | 0.5002272231619186 | 2 |
PregnancyComplications | 0.06609748803827747 | 0.7971059642091333 | 1 |
Parity | 4.21471394984326 | 0.7547411063634557 | 7 |
Interpretation:
- Chi-Square Value: A larger value indicates a greater difference between the observed counts and what would be expected if the variables were independent.
- P-Value: Tells you whether or not the association is statistically significant. Typically, if the p-value is less than 0.05, we might conclude that there is a significant association between the variables.
- Degrees of Freedom: This is equal to (number of rows - 1) x (number of columns - 1) in the contingency table. It helps in determining the critical value for the chi-square distribution.
For interpretation, we would usually focus on the P-Value. If it’s below a significance level (commonly 0.05), it suggests that the observed distribution is significantly different from what you’d expect if the variables were independent, indicating an association.
For the variable Race
, Ethnicity
and PrimaryLanguage
, the very low p-value indicates that there is a statistically significant association between race and whether the birth occurred in or out of the hospital. This suggests that race and the language might play a role in the choice or circumstances of the birth location.
For the variable PregnancyComplications
, the p-value is much greater than 0.05, suggesting that there is no statistically significant association between pregnancy complications and birth location. Pregnancy complications do not appear to be a determining factor in whether a birth occurs in or out of the hospital.
On the figure bellow you can see the bar plots of the various variables against the birth locations. We highlight the correlation between OutOfHospital
with Race
, Ethnicity
and PrimaryLanguage
.
We will note that we explored the idea of using the number of complications as the outcome variable instead of the binary variable. However, we can see in the histogram plot bellow that the 111 of the 138 data points have either 0 or 1 complications, while the 27 remaining having 2, 3 or 4 complications. We did not proceed with using a discrete variable as the benefit remain to be proven with so few points.
Logistic Regression
Logistic regression is a statistical model that uses a logistic function to model a binary dependent variable. In the context of predicting whether a baby is born in or out of a hospital, it provides a method to estimate the probability of an event (birth location) given a set of related variables, such as race, age, primary language, ethnicity, and education of the parent. Features are assigned coefficients, which reflect the strength and direction of their relationship with the outcome. A positive coefficient suggests that as the feature value increases, the likelihood of the event (in this case, a baby being born out of hospital) also increases, while a negative coefficient suggests the opposite. Thus, logistic regression allows us to identify and quantify the influence of various factors on the outcome, providing insights into relevant and correlated features.
Using statsmodels
Statsmodels is primarily designed for statistical modeling and is more suited for traditional, interpretable statistical analyses. It provides extensive options for model statistics, like R-squared, p-values, confidence intervals, and so on. In the case of logistic regression, Statsmodels gives a detailed summary table, which includes various diagnostic measures and statistics about the coefficients.
Variable | Coefficient | Std Err | z-score | P>z |
---|---|---|---|---|
const | 0.9407 | 0.971 | 0.969 | 0.333 |
PregnancyComplications | -0.3991 | 0.433 | -0.921 | 0.357 |
Parity | -0.1405 | 0.121 | -1.158 | 0.247 |
Education_high school graduate | 1.1348 | 0.942 | 1.205 | 0.228 |
Education_less than high school | 0.8113 | 1.082 | 0.749 | 0.454 |
Education_none listed | -0.0934 | 0.855 | -0.109 | 0.913 |
Education_some college | 0.4408 | 1.649 | 0.267 | 0.789 |
Ethnicity_not hispanic/latino | -1.5900 | 0.396 | -4.011 | 0.000 |
Employment_not listed | 0.3878 | 0.609 | 0.637 | 0.524 |
Employment_unemployed | 0.6188 | 0.498 | 1.242 | 0.214 |
Here are what some of the key columns mean:
- Coefficient (coef): The coefficient represents the expected change in the log-odds of the outcome for a one-unit increase in the corresponding predictor, assuming all other predictors are held constant. For example, the coefficient for “PregnancyComplications” is -0.3991, which suggests that for each additional unit increase in “PregnancyComplications,” the log-odds of the outcome (e.g., OutofHospital) decrease by 0.3991, all else being equal.
- Standard Error (std err): The standard error measures the variability in the coefficient estimate that would be expected if the study were repeated. Smaller standard errors mean the estimate is more precise. For example, the standard error for the coefficient of “PregnancyComplications” is 0.433, which is relatively high compared to some other variables like “Parity” (std err = 0.121). This implies that the estimate for “PregnancyComplications” might be less reliable.
- p-value: The p-value is a measure of the statistical significance of the predictor. A smaller p-value (usually <0.05) suggests that the predictor is statistically significant. For example, “Ethnicity_not hispanic/latino” has a p-value of 0.000, which means it is statistically significant at a conventional level, indicating a strong evidence against the null hypothesis of no effect. On the other hand, “PregnancyComplications” has a p-value of 0.357, which is greater than 0.05, suggesting that the effect of “PregnancyComplications” is not statistically significant at a conventional level.
Using scikit-learn
Scikit-learn, is designed for machine learning and is more focused on prediction accuracy. It provides a wide range of tools for preprocessing data, cross-validation, tuning model hyperparameters, and evaluating prediction performance, which are essential for building and fine-tuning machine learning models. However, Scikit-learn does not provide p-values or similar traditional statistical measures out-of-the-box, as these are less relevant in a machine learning context where the focus is on prediction rather than interpretation.
The logistic regression model reveals a variety of key insights about the influences on whether a baby is born inside or outside of a hospital. The most influential feature is whether the primary language is English, with a negative coefficient of -0.716497. This suggests that when English is the primary language, it is less likely for the baby to be born outside of a hospital. Similar trends are seen with the race of being Caucasian/White and the ethnicity of not being Hispanic/Latino, which also have negative coefficients, indicating these groups are less likely to have out-of-hospital births. On the other hand, certain features increase the likelihood of out-of-hospital birth. For example, if the individual is unemployed, the probability of the birth taking place outside of a hospital increases.
From the model, it can be observed that certain features increase the probability of a baby being born outside of a hospital. Individuals who are unemployed (coefficient of 0.472605) have an increased likelihood of giving birth outside a hospital. Also, people whose primary language is either Bengali, Urdu, or Spanish also show a higher tendency to have out-of-hospital births, as indicated by the positive coefficients of these variables. Similarly, having an education level of a high school graduate or less than high school, also increases the chances of out-of-hospital births. These features suggest that socio-economic factors, language, and education level significantly influence the location of birth, with lower socio-economic status and non-English primary languages increasing the likelihood of out-of-hospital births.
The accuracy of the model, at 0.68, indicates a moderate ability to correctly classify whether a birth takes place in or out of the hospital based on the available features. The F1 score, a balance of precision and recall, is 0.71, which suggests a reasonably balanced model.
Data analysis for the babies
The analysis is technically identical. Each data point refers now to a baby, with the predicted variable of interest “OutofHospital” which refers to a baby being born out of hospital or no. You can find the source code for this part.
Chi-Square Test of Independence
Variable | Chi-Square Value | P-Value | Degrees of Freedom |
---|---|---|---|
Neonatal Survival | 0.03417771510185239 | 0.8533293086053286 | 1 |
Perinatal Survival | 0.06751054852320676 | 0.7949965125940222 | 1 |
Polycythemia | 3.698000683526999 | 0.05447770916113348 | 1 |
Hypoglycemia | 24.555285586124405 | 7.220799970750491e-07 | 1 |
Convulsions | 0.0 | 1.0 | 1 |
Hypothermic | 23.05227272727273 | 1.5765603373793762e-06 | 1 |
Bradycardia | 4.831597222222223 | 0.027942732316554454 | 1 |
Respiratory distress | 0.8262186004784692 | 0.3633682441225963 | 1 |
PPHN | 1.8562758264462815 | 0.1730552393175499 | 1 |
For the variable Hypoglycemia
, Hypothermic
, Polycythemia
and Bradycardia
, the very low p-value indicates that there is a statistically significant association between race and whether the birth occurred in or out of the hospital. This suggests that race and the language might play a role in the choice or circumstances of the birth location.
For the variable Neonatal Survival
, the p-value is much greater than 0.05, suggesting that there is no statistically significant association between pregnancy complications and birth location. Pregnancy complications do not appear to be a determining factor in whether a birth occurs in or out of the hospital.
Logistic Regression
This study employs a logistic regression model to identify factors that can predict whether a baby will be born in or out of hospital. The results are represented by the coefficients assigned to each feature or variable in the model, which represent the log odds of a unit change in the respective feature, while all other features are held constant. A positive coefficient indicates that the feature is associated with an increased likelihood of the outcome (i.e., a baby being born in hospital), while a negative coefficient suggests a reduced likelihood.
The following factors significantly increase the odds of a baby being born in hospital:
Hypothermia (coefficient 1.804): This is the most influential factor, implying that the odds of a baby being born in hospital increases substantially if the baby experiences hypothermia.
Hypoglycemia (coefficient 1.671): Similarly, this is a significant predictor, indicating that hypoglycemia also considerably increases the likelihood of in-hospital birth.
Unavailable data on neonatal survival (coefficient 0.686) and Persistent Pulmonary Hypertension of the Newborn (PPHN) (coefficient 0.651) also have a positive influence, albeit less strong.
On the contrary, some factors decrease the odds of a baby being born in hospital:
Birth weight (coefficient -0.540): This result suggests that an increase in birth weight is associated with a decrease in the odds of an in-hospital birth.
Convulsions (coefficient -0.360): The presence of convulsions also reduces the likelihood of in-hospital birth.
Of note, the model accuracy, which represents the proportion of correct predictions made out of all predictions, is 79%, suggesting a reasonable fit to the data. Precision is at 1.00, meaning that all positive predictions made by the model (i.e., all predictions of in-hospital births) were correct. However, the model’s recall (also known as sensitivity) is 0.57, indicating that it identified just over half of the actual in-hospital births correctly. Therefore, while the model is highly reliable when it predicts an in-hospital birth, it is relatively less successful in capturing all such cases. This is further reflected in the F1 score, which is a harmonic mean of precision and recall, and is 0.73.
In conclusion, hypothermia and hypoglycemia are significant predictors of a baby being born in hospital. An increase in birth weight and the presence of convulsions reduce the odds of an in-hospital birth. The model is fairly accurate and highly precise, but its sensitivity is moderate. Further research is needed to improve the model’s ability to correctly identify all cases of in-hospital births.
Rank | Feature | Coefficient |
---|---|---|
1 | Hypothermic_yes | 1.804393 |
2 | Hypoglycemia_yes | 1.671158 |
3 | Neonatal Survival_nan | 0.686115 |
4 | PPHN_yes | 0.651477 |
5 | Birth weight | -0.540175 |
6 | Convulsions_yes | -0.360674 |
7 | Perinatal Survival_nan | 0.354218 |
8 | Bradycardia_yes | 0.247487 |
9 | Polycythemia_yes | 0.204884 |
10 | Respiratory distress_yes | -0.048227 |
11 | Neonatal Survival_deceased | 0.000000 |
12 | Perinatal Survival_deceased | 0.000000 |
Metric | Value |
---|---|
Accuracy | 0.79 |
Precision | 1.00 |
Recall | 0.57 |
F1 Score | 0.73 |