Introduction to Logistic Regression

From DIR
Jump to: navigation, search
Author Pujitha Gangarapu
For Dataset HCUP
All Blogs of HCUP Descriptive Stats
Introduction to Logistic Regression
R, a powerful tool


Introduction

Regression is the process of estimating relationship among variables. Regression is the method that has been widely used in many of the health data sets. For example, Firearm related injuries amongst children: Estimates from the nationwide emergency department sample

Factors Associated With In-Hospital Mortality After Administration of Thrombolysis in Acute Ischemic Stroke Patients are some of the popular articles which have used regression analysis on Hcup data set

There are different types of regression. Some of them are Multinomial Regression, Logistic Regression, Cox Regression, Multi variate regression etc...

Data

Load the data set obtained from the HCup. The data set obtained will be in the form of .asc format. The detailed procedure of how to convert ASCII format to CSV format has been discussed in earlier blogs.

In this discussion,the Florida 2006 core file has been taken. The name of the file would be of the FL_SID_2006_CORE. The ASCII file itself is of 2GB in size.After converting it into SAS table the size would be even more of it. R cannot handle such huge files. So I took the sample from the data sets.

data RA;
  set RA.Fl_SID_2006_CORE (obs=1000)
run;


RA here is the library name where the SAS table has been generated.1000 is the number of observations that we chosen to work on.We can subset the data under required conditions as well. For example,

data RA;
set RA.table1;
where DIED=1;
run;


libname RA "F:\HCUP\FL_SIDC_2006_CORE";
run;

data ra.kidney_donors;
	set ra.Fl_sidc_2006_core;
	where DX1 = 'V594' AND DIED= 1;
run;

proc export data=ra.Kidney_donors
outfile='F:\HCUP\FL_SIDC_2006_CORE\Kidney_Donors.csv'
dbms=csv
replace;
run;


one can pass multiple arguments in the where statement. Once SAS table is generated, we can convert it into the CSV file.

The CSV file generated from the SAS table has a numerical codes. Each number code denotes a different meaning. From data description we can understand what each number code indicates. For example, if Died during hospitalization value is 0 it means the person did not die where as the value 1 indicates the patient died.

Similarly each variable in the data set is numerically coded and each number indicates the different meaning.

Logistic Regression

In particular there are many situations where we have binary outcomes (it snows in Charlotte on a given day, or it doesn’t; this person dies, or not; this loan will be paid back, or it won’t; this person will get heart disease in the next five years, or they won’t). In addition to the binary outcome, we have some input variables, which may or may not be continuous. How could we model and analyze such data?

The answer to such problem is logistic regression. It can be implemented on R by the following procedure. For this particular problem, I subset the SID FLORIDA 2006 data for kidney donors.


data<-read.csv("kidneys_donors.csv")# loading data set
head(data)# review of the data set
summary(data)# gives the distribution of each variable in the data 
names(data)#gives the names of the variables
data1<-data[,c(1:22,51:53,87)]# including only those variables which are required
summary(data1)
died<-data1[,9]
age<-data1[,2]
gender<-data1[,26]
No_days<-data[,102]
Los<-data[,93]
logistic_regression<-glm(died~age+gender+No_days+Los,family='binomial')
confint(logistic_regression)#95
predict(logistic_regression,type='response')# gives the prediction values


Remember that in the log-it model the response variable is log odds:

ln(odds)=ln(p/(1-p))=a*x_1+b*x_2+...+z*x_n.
since gender is a dummy variable, being male reduces the log odds by 2.75(say) while a unit increase in age reduces the log odds by 0.037(say).Now we can run the anova() function on the model to analyze the table of deviance

anova(logistic_regression,test='Chisq')