Changes in Health Care Annual Enrollment using Cohort Analysis

From DIR
Jump to: navigation, search
Author Madhusudhan Vaddi
For Dataset MarketScan
All Blogs of MarketScan Changes in Health Care Annual Enrollment using Cohort Analysis


Cohort analysis is a subset of behavioral analytics that takes the data from a given dataset (e.g. an eCommerce platform, web application, or online game) and rather than looking at all users as one unit, it breaks them into related groups for analysis. These related groups, or cohorts, usually share common characteristics or experiences within a defined time-span. Cohort analysis allows a company to “see patterns clearly across the life-cycle of a customer (or user), rather than slicing across all customers blindly without accounting for the natural cycle that a customer undergoes.” By seeing these patterns of time, a company can adapt and tailor its service to those specific cohorts. While cohort analysis is sometimes associated with a cohort study, they are different and should not be viewed as one and the same. Cohort analysis has come to describe specifically the analysis of cohorts in regards to big data and business analytics, while a cohort study is a more general umbrella term that describes a type of study in which data is broken down into similar groups.

1. We can vary our cohort definitions, depending on what we want to test. (For example, if we wanted to see if customers acquired from particular marketing channels were more valuable over their lifetimes, we’d define our cohorts based on the customer acquisition channel. On the other hand, if we wanted to see if we’ve got better at converting freemium users to paid users over time, we’d compare cohorts of users who started with the service a long time ago, with those that started more recently: defining our cohorts by month joined.)

2. We can vary the metric we are comparing between are cohorts. (For example, comparing the average lifetime value of customers in each cohort, or retention levels by cohort)

Data Source

The MarketScan Medicare Supplemental and Coordination of Benefits (COB) Database is created for Medicare-eligible retirees with employer-sponsored Medicare Supplemental plans. This database contains predominantly fee-for-service plan data.

The Medicare Supplemental Database table structure is identical to the Commercial Claims and Encounters table structure.

Both the Medicare-paid amounts and the employer-paid supplemental insurance amounts are included in this database. Only plans where both the Medicare-paid amounts and the employer-paid amounts were available and evident on the claims were selected for this database.

Data used for the analysis were derived from the Truven Health MarketScan® 2014 to 2014 Commercial Claims and Encounters and Medicare Supplemental and Coordination of Benefits Databases. The Medicare Supplemental and COB database contains claims-level information on outpatient prescription drugs for retirees with Medicare supplemental insurance paid for by employers. Pharmacy claims report enrollee, plan, and total payment amounts; dispensing fee, ingredient cost, and average wholesale price; number of days’ supply; an indicator for whether the drug is generic, sole source brand name, or brand name with a generic substitute; and an indicator for whether the drug is predominantly used for the long-term treatment of chronic conditions.

Performing cohort analysis

In order to perform a proper cohort analysis, there are four main stages:

  • Determine what question you want to answer. The point of the analysis is to come up with actionable information on which to act in order to improve business, product, user experience, turnover, etc. To ensure that happens, it is important that the right question is asked. In the gaming example above, the company was unsure why they were losing revenue as lag time increased, despite the fact that users were still signing up and playing games.
  • Define the metrics that will be able to help you answer the question. A proper cohort analysis requires the identification of an event, such as a user checking out, and specific properties, like how much the user paid. The gaming example measured a customer's willingness to buy gaming credits based on how much lag time there was on the site.
  • Define the specific cohorts that are relevant. In creating a cohort, one must either analyze all the users and target them or perform attribute contribution in order to find the relevant differences between each of them, ultimately to discover and explain their behavior as a specific cohort. The above example splits users into "basic" and "advanced" users as each group differs in actions, pricing structure sensitivities, and usage levels.
  • Perform the cohort analysis. The analysis above was done using data visualization which allowed the gaming company to realize that their revenues were falling because their higher-paying advanced users were not using the system as the lag time increased. Since the advanced users were such a large portion of the company's revenue, the additional basic user signups were not covering the financial losses from losing the advanced users. In order to fix this, the company improved their lag times and began catering more to their advanced users.

Cohort Analysis using R for sample dataset of annual enrollment



rowct = nrow(data)index<-sample(1:rowct, rowct/10, replace = FALSE, prob = NULL)
sampledata <- data[index,]

#Making regions table 
make_region = table(sampledata$REGION)
regiondf =

regiondf$name[regiondf$Var1==1] = "Northeast"
regiondf$name[regiondf$Var1==2] = "North Central"
regiondf$name[regiondf$Var1==3] = "South"
regiondf$name[regiondf$Var1==4] = "West"
regiondf$name[regiondf$Var1==5] = "Unknown"

# Making Pivot table - How many enrollments happen based on Enrollment Months(Total number of months during the year in which an individual was enrolled)
pivot_table =, REGION ~ ENRMON))

pivot_table = pivot_table[,-which(colnames(pivot_table) =="NA")]
pivot_table = pivot_table[-which(rownames(pivot_table) =="NA"),]
rownames(pivot_table) = regiondf$name
pivot_table =, keep.rownames = TRUE)[])

Changes in Health Care Annual Enrollment 1.png

#Normalizing data with maximum value in the first row
base_index = which(pivot_table[1,-1] == max(as.numeric(pivot_table[1,-1]))) + 1
x <- pivot_table[,c(2:ncol(pivot_table))]
y <- pivot_table[,base_index]
reten.r <- apply(x, 2, function(x) x/y )
reten.r <- data.frame(cohort=(pivot_table$rn), reten.r)

Changes in Health Care Annual Enrollment 2.png

reten.r <- reten.r[,-base_index] #remove base column, which is at index 2 data because it is always 100%

Changes in Health Care Annual Enrollment 3.png

cohort.chart1 <- melt(reten.r, id.vars = 'cohort')
colnames(cohort.chart1) <- c('cohort', 'monthcount', 'retention')

Changes in Health Care Annual Enrollment 4.png

p <- ggplot(cohort.chart1, aes(x=monthcount, y=retention, group=cohort, colour=cohort))
p + geom_line(size=2, alpha=1/2) +
  geom_point(size=3, alpha=1) +
  geom_smooth(aes(group=1), method = 'loess', size=2, colour='red', se=FALSE) +
  labs(title="Cohorts Retention ratio dynamics")

Changes in Health Care Annual Enrollment 5.png


If we consider west region, for every 6 months enrollments count is increasing enormously and reaching high among all the regions. In other way, we can say that, at the month of June and December, people are showing more interest in claiming health insurance, which means people are more tend to register for every 6 months. In the same way, people in South region are enrolling for health insurance annually.


From above result, we can easily say that how many individuals are enrolling for different periods of time. We can also analyze that which region is mostly looking for what type of plans. In the analysis, we have normalized using 12 months plan, since it has more number of individuals’ enrollment. So, every region has more enrollments happen for 12 month plan of particular year.