# Retention analysis

Last Edited By Krjb Donovan
Last Updated: Mar 05, 2014 09:53 PM GMT

## Question

You have been a great help in answering a previous question that I had, and now I am hoping that you can help me out again.

I am trying to conduct some retention analysis and produce a survival curve based on the number (or percentage) of students that have been retained since starting in Term 1. The survival curve will consist of points showing how many students are enrolled at Term 1 (which will be 100%), then the percentage of the SAME students that are enrolled at Term 2, the percentage of the SAME students that are enrolled at Term 3, and the percentage of the SAME students that are enrolled at Term 4.

I have data that consists of over 150,000 records; however, there may be duplicate records because some students may be enrolled in more than one subject in the same term. I would assume that I need to remove the duplicate records; i.e., if Student1 is enrolled in Maths and Chem in Term 1, then I would assume that I will need to treat it as one record only?

Secondly, there may be other students that have started in Term 2, 3, and 4 - therefore I would assume that I would need to remove these since I will not be able to compare them with the information from Term 1 because they will not be enrolled then. I am not sure how to do this step, or whether this can be incorporated into the formula.

Thirdly, I would like to produce a formula to track the retention rates. My example data are below; assuming that the first student ID starts at A2 and the first term is B2. I have access to2003, 2007, and 2010, but would prefer to know how to do it in 2007.

What I aim to produce is a table such as:

T1 T2 T3 T4 100% ?% ?% ?%

I really hope you can help with this one! If it is too difficult in Excel, would it be easier using some other program such as SPSS or Access?

Kind Nicole

Student ID Term 2423412 T1 2423412 T2 2423412 T3 2423423 T1 2432545 T1 2432545 T2 3453425 T1 4534534 T1 4535324 T2 5345243 T1 5345243 T2 5345243 T3 5465475 T1 5465475 T1 5465475 T2 5465475 T3 5465475 T4 5546345 T1 5569786 T1 5569786 T2 6454534 T1 6454534 T2 6787687 T1 6787687 T1 6787687 T1 6787687 T2 6787687 T2 6787687 T2 6787687 T2 6787687 T2 6787687 T3 6787687 T3 6787687 T3 6787687 T3 6787687 T3 6787687 T4 6787687 T4 6787687 T4 7869786 T1 7869786 T1 7869786 T2 7869786 T2 7869786 T3 7869786 T4 7987067 T2 7987067 T3 7987068 T1 7987068 T2 7987068 T3 7987068 T4 7987068 T4 7987098 T1 7987098 T2 7987098 T3 8970987 T2 8970987 T2 8970987 T3

In c2 enter

=A2&B2

In d2 enter

=IF(AND(COUNTIF(\$C:\$C,LEFT(\$C2,7)&"T1")>0,RIGHT(\$C2,1)+0=D\$1,COUNTIF(\$C\$1:\$C2,\$C2)=1),1,0)

In D1, E1, F1, enter 1, 2, 3

copy c2:F2 down the columns.

Col d (starting d2) is the count of unique students in term 1. col e is the count of unique students in both term 1 and 2. col f is the count of unique students in both term 1 and 3.

In the future, when you ask a question, there's no need to give an example with dozens of rows. An example of just a few rows is enough to illustrate the principle, so don't make it more complicated than necessary. Also, always give a CONCRETE example. That is, with the example you've given, what results PRECISELY are you trying to get?

## Question

I am trying to conduct some retention analysis and produce a survival curve based on the number (or percentage) of students that have been retained since starting in Term 1. The survival curve will consist of points showing how many students are enrolled at Term 1 (which will be 100%), then the percentage of the SAME students that are enrolled at Term 2, the percentage of the SAME students that are enrolled at Term 3, and the percentage of the SAME students that are enrolled at Term 4.

I have data that consists of over 150,000 records; however, there may be duplicate records because some students may be enrolled in more than one subject in the same term. I would assume that I need to remove the duplicate records; i.e., if Student1 is enrolled in Maths and Chem in Term 1, then I would assume that I will need to treat it as one record only?

Secondly, there may be other students that have started in Term 2, 3, and 4 - therefore I would assume that I would need to remove these since I will not be able to compare them with the information from Term 1 because they will not be enrolled then. I am not sure how to do this step, or whether this can be incorporated into the formula.

Thirdly, I would like to produce a formula to track the retention rates. My example data are below; assuming that the first student ID starts at A2 and the first term is B2. I have access to2003, 2007, and 2010, but would prefer to know how to do it in 2007.

What I aim to produce is a table such as:

T1 T2 T3 T4 100% ?% ?% ?%

I really hope you can help with this one! If it is too difficult in Excel, would it be easier using some other program such as SPSS or Access?

Kind Nicole

Student ID Term 2423412 T1 2423412 T2 2423412 T3 2423423 T1 2432545 T1 2432545 T2 3453425 T1 4534534 T1 4535324 T2 5345243 T1 5345243 T2 5345243 T3 5465475 T1 5465475 T1 5465475 T2 5465475 T3 5465475 T4 5546345 T1 5569786 T1 5569786 T2 6454534 T1 6454534 T2 6787687 T1 6787687 T1 6787687 T1 6787687 T2 6787687 T2 6787687 T2 6787687 T2 6787687 T2 6787687 T3 6787687 T3 6787687 T3 6787687 T3 6787687 T3 6787687 T4 6787687 T4 6787687 T4 7869786 T1 7869786 T1 7869786 T2 7869786 T2 7869786 T3 7869786 T4 7987067 T2 7987067 T3 7987068 T1 7987068 T2 7987068 T3 7987068 T4 7987068 T4 7987098 T1 7987098 T2 7987098 T3 8970987 T2 8970987 T2 8970987 T3

Nicole

Knowing as little about it as you have told me, this is what I would do.

1. Get rid of duplicates

2. Get rid of all students who were not present in T1 This is a little trickier. With the student number in column A and T* in column B, I used this formula:

=SUMPRODUCT(--(\$A\$2:\$A\$41=A2),--(\$B\$2:\$B\$41="T1"))

This returns a zero if the number in column A never appears with a T1 in column B. Delete the rows with zero.

3. Insert a pivot table (your two columns will need to have headers, mine were student and term)

4. Put term as the column label and count of either term or student as the data. That gave me 14, 10, 7, 4 as the survival numbers, which you could easily turn into the desired percentages.

Instead of a pivot table you could use something like

=COUNTIF(\$G\$2:\$G\$36,H2)

where the T1, T2, ... are in column G (the 35 unique Student-Term combinations) and H2 is T1. Copy that down 4 rows, with "T2", "T3", and "T4" in H3, H4, and H5 and you should get the same 14, 10, 7, 4 as you got with the pivot table. Again you will have to use another formula to get your percentages. I used

=I2/\$I\$2

where the 14, 10, 7, 4 where in I2, I3, I4, and I5.

## Question

You have been great in helping me out before and I am hoping you can do it again...

I am trying to conduct some retention analysis and produce a survival curve based on the number (or percentage) of students that have been retained since starting in Term 1. The survival curve will consist of points showing how many students are enrolled at Term 1 (which will be 100%), then the percentage of the SAME students that are enrolled at Term 2, the percentage of the SAME students that are enrolled at Term 3, and the percentage of the SAME students that are enrolled at Term 4.

I have data that consists of over 150,000 records; however, there may be duplicate records because some students may be enrolled in more than one subject in the same term. I would assume that I need to remove the duplicate records; i.e., if Student1 is enrolled in Maths and Chem in Term 1, then I would assume that I will need to treat it as one record only?

Secondly, there may be other students that have started in Term 2, 3, and 4 - therefore I would assume that I would need to remove these since I will not be able to compare them with the information from Term 1 because they will not be enrolled then. I am not sure how to do this step, or whether this can be incorporated into the formula.

Thirdly, I would like to produce a formula to track the retention rates. My example data are below; assuming that the first student ID starts at A2 and the first term is B2. I have access to2003, 2007, and 2010, but would prefer to know how to do it in 2007.

What I aim to produce is a table such as:

T1 T2 T3 T4 100% ?% ?% ?%

I really hope you can help with this one! If it is too difficult in Excel, would it be easier using some other program such as SPSS or Access?

Kind Nicole

Student ID Term 2423412 T1 2423412 T2 2423412 T3 2423423 T1 2432545 T1 2432545 T2 3453425 T1 4534534 T1 4535324 T2 5345243 T1 5345243 T2 5345243 T3 5465475 T1 5465475 T1 5465475 T2 5465475 T3 5465475 T4 5546345 T1 5569786 T1 5569786 T2 6454534 T1 6454534 T2 6787687 T1 6787687 T1 6787687 T1 6787687 T2 6787687 T2 6787687 T2 6787687 T2 6787687 T2 6787687 T3 6787687 T3 6787687 T3 6787687 T3 6787687 T3 6787687 T4 6787687 T4 6787687 T4 7869786 T1 7869786 T1 7869786 T2 7869786 T2 7869786 T3 7869786 T4 7987067 T2 7987067 T3 7987068 T1 7987068 T2 7987068 T3 7987068 T4 7987068 T4 7987098 T1 7987098 T2 7987098 T3 8970987 T2 8970987 T2 8970987 T3

Nicole,

<revised>---------

in the last formulas, I had T1 by itself rather than in quotes - this was a typo. See the revised formulas below. I have tested the revisions with your data it it worked for me (I hadn't tested those very last formulas previously because I was running late for work - thus the typo)

This works for2007 and 2010 but not in Excel 2003 sincedoesn't have the COUNTIFS worksheet function

in C2: =IF(AND(COUNTIFS(\$A\$2:\$A2,\$A2,\$B\$2:\$B2,"T1")=1,\$B2="T1"),"T1","")

in D2: =IF(AND(COUNTIFS(\$A\$2:\$A2,\$A2,\$B\$2:\$B2,"T2")=1,COUNTIFS(\$A:\$A,\$A2,\$C:\$C,"T1")=1,\$B2="T2"),"T2","")

in E2:

=IF(AND(COUNTIFS(\$A\$2:\$A2,\$A2,\$B\$2:\$B2,"T3")=1,COUNTIFS(\$A:\$A,\$A2,\$C:\$C,"T1")=1,\$B2="T3"),"T3","")

in F2:

=IF(AND(COUNTIFS(\$A\$2:\$A2,\$A2,\$B\$2:\$B2,"T4")=1,COUNTIFS(\$A:\$A,\$A2,\$C:\$C,"T1")=1,\$B2="T4"),"T4","")

then drag fill these down the columns. Note I am using COUNTIFS with an S on the end and not the old COUNTIF function.

Now you can get your percentages with (now I am using the Old countif without an S)

1 ' T1 %

=countif(D:D,"T2")/Countif(C:C,"T1") T2% <== revised

=countif(E:E,"T3")/Countif(C:C,"T1") T3% <== revised

=countif(F:F,"T4").countif(C:C,"T1") T4% <== revised

## Question

I am trying to conduct some retention analysis and produce a survival curve based on the number (or percentage) of students that have been retained since starting in Term 1. The survival curve will consist of points showing how many students are enrolled at Term 1 (which will be 100%), then the percentage of the SAME students that are enrolled at Term 2, the percentage of the SAME students that are enrolled at Term 3, and the percentage of the SAME students that are enrolled at Term 4.

I have data that consists of over 150,000 records; however, there may be duplicate records because some students may be enrolled in more than one subject in the same term. I would assume that I need to remove the duplicate records; i.e., if Student1 is enrolled in Maths and Chem in Term 1, then I would assume that I will need to treat it as one record only?

Secondly, there may be other students that have started in Term 2, 3, and 4 - therefore I would assume that I would need to remove these since I will not be able to compare them with the information from Term 1 because they will not be enrolled then. I am not sure how to do this step, or whether this can be incorporated into the formula.

Thirdly, I would like to produce a formula to track the retention rates. My example data are below; assuming that the first student ID starts at A2 and the first term is B2. I have access to2003, 2007, and 2010, but would prefer to know how to do it in 2007.

What I aim to produce is a table such as:

T1 T2 T3 T4 100% ?% ?% ?%

I really hope you can help with this one! If it is too difficult in Excel, would it be easier using some other program such as SPSS or Access?

Kind Nicole

Student ID Term 2423412 T1 2423412 T2 2423412 T3 2423423 T1 2432545 T1 2432545 T2 3453425 T1 4534534 T1 4535324 T2 5345243 T1 5345243 T2 5345243 T3 5465475 T1 5465475 T1 5465475 T2 5465475 T3 5465475 T4 5546345 T1 5569786 T1 5569786 T2 6454534 T1 6454534 T2 6787687 T1 6787687 T1 6787687 T1 6787687 T2 6787687 T2 6787687 T2 6787687 T2 6787687 T2 6787687 T3 6787687 T3 6787687 T3 6787687 T3 6787687 T3 6787687 T4 6787687 T4 6787687 T4 7869786 T1 7869786 T1 7869786 T2 7869786 T2 7869786 T3 7869786 T4 7987067 T2 7987067 T3 7987068 T1 7987068 T2 7987068 T3 7987068 T4 7987068 T4 7987098 T1 7987098 T2 7987098 T3 8970987 T2 8970987 T2 8970987 T3

I need more info as I don't understand how you want to calculate the percentages for T2, T3, or T4. It'd help if you use the exact date from the example you gave and how you come up with the answer. And is this supposed to be one line for each student?

## Question

I am trying to conduct some retention analysis and produce a survival curve based on the number (or percentage) of students that have been retained since starting in Term 1. The survival curve will consist of points showing how many students are enrolled at Term 1 (which will be 100%), then the percentage of the SAME students that are enrolled at Term 2, the percentage of the SAME students that are enrolled at Term 3, and the percentage of the SAME students that are enrolled at Term 4.

I have data that consists of over 150,000 records; however, there may be duplicate records because some students may be enrolled in more than one subject in the same term. I would assume that I need to remove the duplicate records; i.e., if Student1 is enrolled in Maths and Chem in Term 1, then I would assume that I will need to treat it as one record only?

Secondly, there may be other students that have started in Term 2, 3, and 4 - therefore I would assume that I would need to remove these since I will not be able to compare them with the information from Term 1 because they will not be enrolled then. I am not sure how to do this step, or whether this can be incorporated into the formula.

Thirdly, I would like to produce a formula to track the retention rates. My example data are below; assuming that the first student ID starts at A2 and the first term is B2. I have access to2003, 2007, and 2010, but would prefer to know how to do it in 2007.

What I aim to produce is a table such as:

T1 T2 T3 T4 100% ?% ?% ?%

I really hope you can help with this one! If it is too difficult in Excel, would it be easier using some other program such as SPSS or Access?

Kind Nicole

Student ID Term 2423412 T1 2423412 T2 2423412 T3 2423423 T1 2432545 T1 2432545 T2 3453425 T1 4534534 T1 4535324 T2 5345243 T1 5345243 T2 5345243 T3 5465475 T1 5465475 T1 5465475 T2 5465475 T3 5465475 T4 5546345 T1 5569786 T1 5569786 T2 6454534 T1 6454534 T2 6787687 T1 6787687 T1 6787687 T1 6787687 T2 6787687 T2 6787687 T2 6787687 T2 6787687 T2 6787687 T3 6787687 T3 6787687 T3 6787687 T3 6787687 T3 6787687 T4 6787687 T4 6787687 T4 7869786 T1 7869786 T1 7869786 T2 7869786 T2 7869786 T3 7869786 T4 7987067 T2 7987067 T3 7987068 T1 7987068 T2 7987068 T3 7987068 T4 7987068 T4 7987098 T1 7987098 T2 7987098 T3 8970987 T2 8970987 T2 8970987 T3

Nicole

First I suggest you do some manipulation of the data to give you the information needed to do your analysis. I assume your data is column A=student number, column B =term and the data starts with row 75 First--I would sort the data by student number.

Second--I would put a formulas in as follows

Column C +B75 Column D +=IF(A75=A76,B76,0) Column E +=IF(A75=A77,B77,0) Column F +=IF(A75=A78,B78,0) These formulas can be copied down to each row. What these formulas do is, for each student, put the terms that each attended.

Third--copy the data to another sheet and use the paste special. Copy only the values not the formulas.

Fourth--Sort the data by column B, the term column, this will segregate the data putting those that began in the first term at the top of the list. These are the only ones you want to work with. Eliminate those that have other than T1 in column B

Fifth--Sort the emaining data by column D eliminate those that have have other than T2 or 0 in column D.

Sixth--Sort the remaining data by column E eliminate those that have other than T3 or zero in column E.

Seventh-Sort the remaining data by column F, eliminate those that have other than T4 or zero in column F.

After all of those sorts you now have the data you want to work with.

You can now use a formula to count the values in each column. I.E. in column C we want to count the values that are equal to T1, so the formula might be (Depending on the rows you are using) +=COUNTIF(C100:C1000,"T1") The formul for column D would be +=COUNTIF(D100:D1000,"T2") The formula for column E would be +=COUNTIF(E100:E1000,"T3") The formula for column F would be3 +=COUNTIF(F100:F1000,"T4")

Now you can divide the total calculated for each column by the total calculated for column C this will give you the percentage of those who started in term 1 that continued for each following term.

## Question

I am trying to conduct some retention analysis and produce a survival curve based on the number (or percentage) of students that have been retained since starting in Term 1. The survival curve will consist of points showing how many students are enrolled at Term 1 (which will be 100%), then the percentage of the SAME students that are enrolled at Term 2, the percentage of the SAME students that are enrolled at Term 3, and the percentage of the SAME students that are enrolled at Term 4.

I have data that consists of over 150,000 records; however, there may be duplicate records because some students may be enrolled in more than one subject in the same term. I would assume that I need to remove the duplicate records; i.e., if Student1 is enrolled in Maths and Chem in Term 1, then I would assume that I will need to treat it as one record only?

Secondly, there may be other students that have started in Term 2, 3, and 4 - therefore I would assume that I would need to remove these since I will not be able to compare them with the information from Term 1 because they will not be enrolled then. I am not sure how to do this step, or whether this can be incorporated into the formula.

Thirdly, I would like to produce a formula to track the retention rates. My example data are below; assuming that the first student ID starts at A2 and the first term is B2. I have access to2003, 2007, and 2010, but would prefer to know how to do it in 2007.

What I aim to produce is a table such as:

T1 T2 T3 T4 100% ?% ?% ?%

I really hope you can help with this one! If it is too difficult in Excel, would it be easier using some other program such as SPSS or Access?

Kind Nicole

Student ID Term 2423412 T1 2423412 T2 2423412 T3 2423423 T1 2432545 T1 2432545 T2 3453425 T1 4534534 T1 4535324 T2 5345243 T1 5345243 T2 5345243 T3 5465475 T1 5465475 T1 5465475 T2 5465475 T3 5465475 T4 5546345 T1 5569786 T1 5569786 T2 6454534 T1 6454534 T2 6787687 T1 6787687 T1 6787687 T1 6787687 T2 6787687 T2 6787687 T2 6787687 T2 6787687 T2 6787687 T3 6787687 T3 6787687 T3 6787687 T3 6787687 T3 6787687 T4 6787687 T4 6787687 T4 7869786 T1 7869786 T1 7869786 T2 7869786 T2 7869786 T3 7869786 T4 7987067 T2 7987067 T3 7987068 T1 7987068 T2 7987068 T3 7987068 T4 7987068 T4 7987098 T1 7987098 T2 7987098 T3 8970987 T2 8970987 T2 8970987 T3

I'm sorry, I had a long response ready and then allexperts said I wasn't logged in and I lost it.

You are asking a bunch of questions all in one question. I can't design your project but I can suggest some things. yes you can probably use Excel, but I can't make sense out of how your data is structured in terms of columns. Is all you have, 150000 rows of data with a student id and a term (1, 2, 3 or 4) next to it? What is that meaning? a past term, present term, term completed? I can't really suggest what formulas you might use long term because I can't really tell how all of your data is and how it will be added-to, going forward.

But it sounds like your first step might be to remove duplicates (people who show multi-records, for the same term, because of several classes). But this is your decision not mine-do you want to count duplicate classes or not? If not, run this code after reviewing it and changing as necessary to meet your worksheet names, ranges, etc.

Sub RemoveDuplicates()

Dim myrange As Range For Each myrange In ThisWorkbook.Worksheets("Sheet1").Range("C2:C58") 'change sheet name and ranges as necessary

myrange.Value = myrange.Offset(0, -2).Value & myrange.Offset(0, -1).Value 'this temporarily concatenates A and B values and places the single combined 'value into a "helper" column we will use, in this case, col C. ' 'if you have data in col C, change this accordingly...but I recommend 'leaving it as is, and just inserting a column to the left of where C is now, (which will become c) 'if necessary, so that we can leave it be C, or else other changes may 'be required as well and get confusing.

Next myrange

'now we will start back again at the top, and loop through them again 'this time, deleting rows where the value in c (a+b combined) occurs more than once For Each myrange In ThisWorkbook.Worksheets("Sheet1").Range("C2:C58") If Application.WorksheetFunction.CountIf(ThisWorkbook.Worksheets("Sheet1").Range("c2:c58"), myrange.Value) _ > 1 Then

```   'the above line says, (in each row the code checks),
'if the value in C occurs more than once, delete the entire row.
```

myrange.EntireRow.Delete End If Next myrange

End Sub