Saturday, December 9, 2017

Customer Analytics using Base SAS




MIS 6334 Advanced BA
Project 2: Customer Analytics using Base SAS







GROUP 11
Chaithanya Sai Srinivas Sabnaveesu – cxs163030
Ritesh Kunisseri Puliyakote – rxk166230
Sindhuja Masiragani – sxm164031
Tejasvi Ramadas Sagar – txs161330

Vinod Premkumar– vmp160030




Table of Contents
Exploring the dataset
Part I. Modeling count data
1. Creation of Count Dataset
2. NBD Model
3. Reach, Average Frequency and GRPs
4. Poisson Regression Model
5. LL Formula - NBD Regression Model
6. NBD Regression Model using customer characteristics
7. Managerial takeaways between Poisson Regression and NBD Regression
8. Comparing the models - LR Test 
Part II. Improving the Model
9. Variable Selection
10. Constructing New Variables 
11. Interaction Effects 
Part III. Why Certain Customers Prefer Amazon Over BN? 
12. Logistic Regression 
Part IV. Summary
13. Insights and Learnings
Appendix










Exploring the dataset:
Data:
The provided data consists of customer purchase behavior from Amazon and BARNESandNOBLE. The data is provided with customer demographics along with some other variables like quantity, price, product, date and domain. As with every other data we have missing values and, so we had to preprocess even before analyzing the data.
Preprocessing:
While observing the data characteristics, we observed that the Education has more than 73% missing values, Age has 3 missing values and region has 46 missing values. So, to handle these variables, we removed education variable and imputed the age and region variables with its respective mode values.
Dataset:
Total records: 40,945
We can classify the variables as below:
Demographic variables: Education, Region, Household size(hhsz), Age, Income, Child, Race, Country
ID or Index variable: UserID
Purchase variables: Quantity(qty), Price, Product, Domain
Analyzing the variables for and finding missing values:
PROC CONTENTS: The PROC CONTENTS procedure generates summary information about the contents of a dataset. This procedure is useful if you want to know basic information about the dataset itself, such as the variables' names, types, and attributes.
PROC CONTENTS DATA=P2.ABN;
RUN;
Below is the screenshot showing a part of contents obtained for PROC CONTENTS.


PROC UNIVARIATE: This is a procedure within BASE SAS used primarily for examining the distribution of data, including an assessment of normality and discovery of outliers.

PROC UNIVARIATE DATA = P2.ABN;
RUN;


PROC FREQ: The PROC FREQ statement is the only required statement for the FREQ procedure. If you specify the following statements, PROC FREQ produces a one-way frequency table for all the variables in our data.
PROC FREQ DATA = P2.ABN;
RUN;

Education: Age consists of more than 50% missing values which are represented as 99 in the original dataset.  So, we must decide on this variable whether to include in the rest of the data or impute these missing values
PROC FREQ DATA = P2.ABN;
TABLE EDUCATION;
RUN;


Age:  Age has 3 missing values which are represented by value 99.
PROC FREQ DATA = P2.ABN;
TABLE age;
RUN;


Region: Region consists of 46 missing values which are represented by value a ‘*’.
PROC FREQ DATA = P2.ABN;
TABLE region;
RUN;

Modifying original data:
Based on the above observations,
LIBNAME proj 'C:\Users\Vinod Prem Kumar\Documents\My SAS Files\project 2';
data assign1.newfile;
set assign1.data_books;
if education = "99" then education = '.';
if age = "99" then age = ".";
if region = "*" then region = ".";
run;
Part I: Modeling Count Data
1.     Count number of books purchased from BN in 2007
proc sql;
create table assign1.count as
select distinct userid, region, hhsz, age, income, child, race, country, domain,
sum(qty) as total_qty, sum(price) as total_sum
from assign1.newfile
where domain = "barnesandnoble.com"
group by userid, region, hhsz, age, income, child, race, country;
quit;
Proc Print data=assign1.count (obs=10);
Run;


1.     NBD Model
LIBNAME assign1 'C:\Users\Vinod Prem Kumar\Downloads';
Proc Print Data=assign1.data_books;
Run;
LIBNAME assign1 'C:\Users\Vinod Prem Kumar\Documents\My SAS Files\project 2';
Proc Print Data=assign1.newfile;
Run;
Proc sql;
Select count(unique(userid)) from assign1.newfile;
quit; /*Unique User IDs that made purchase from B&N or both*/
Proc sql;
Select count(unique(userid)) from assign1.count;
quit/*Creating table with total books and number of people*/
Proc sql;
Create table assign1.nbdModel as Select total_qty, count(userid) as num_people from assign1.count group by total_qty;
quit; /*Inserting People with 0 number of purchases from B&N*/
Proc sql;
Insert into assign1.nbdModel (total_qty, num_people) values (0,7639);
quit; /*Sorting the data and printing the first 10 observations for reference*/
Proc sort data=assign1.nbdModel; by total_qty;
run;
Proc print data=assign1.nbdModel (obs=10);
Run;


1.     Report Reach, Average Frequency and GRPs
p(X(t)=0)= ( a/a+t ) r = ( 0.1299/0.1299+1 ) 0.09723 = 0.810324813
E(X(t)) = rt/a = 0.09723*1/0.1299 = 0.748498845
Reach = 1-p(X(t)=0) = 1 – 0.8103 = 18.97%
Avg. Frequency = E(X(t))/Reach*100 = 3.946213827
GRPs = 100*E(X(t)) = 100*0.748498845 = 74.8498845

2.     Poisson Regression Model
LIBNAME assign1 'C:\Users\Vinod Prem Kumar\Documents\My SAS Files\project 2';
Proc Print Data=assign1.newfile;
Run;
proc sql;
Create Table assign1.table1 as
Select userid, region, hhsz, age, income, child, race, country, domain, sum(qty) as total_books
from assign1.newfile
group by userid, region, hhsz, age, income, child, race, country, domain;
quit;
/*---Users that bought from Barsne&Noble---*/
proc sql;
Create Table assign1.barn as
Select userid from assign1.table1
where domain = "barnesandnoble.com";
quit;
/*---Users that bought from Amazon---*/
proc sql;
Create Table assign1.amazon as
Select userid from assign1.table1
where domain = "amazon.com";
quit;
/*---Users that bought from both Amazon and BarnesAndNoble---*/
proc sql;
Create Table assign1.both as
Select * from assign1.amazon intersect
Select * from assign1.barn;
quit;
/*---Users that bought only from Amazon and not from BarnesAndNoble---*/
proc sql;
Create table assign1.justamazon as
Select * from assign1.amazon
except Select * from assign1.both;
quit;
/*---Adding Demographic details to AmazonOnly Table---*/
proc sql;
Create Table assign1.justamazondemo as
Select all.userid, all.region, all.hhsz, all.age, all.income, all.child, all.race, all.country
from assign1.table1 as all, assign1.justamazon as amazon1
where all.userid = amazon1.userid;
quit;
/*---For the above data, since no one bought from BarnesAndNoble, total_books=0---*/
data assign1.justamazondemo;
set assign1.justamazondemo;
domain = "amazon.com";
total_books = 0;
run;
/*---Creating Table for people who bought from BarnesAndNoble along with demographic details---*/
Proc sql;
Create Table assign1.Barndemo as
Select * from assign1.table1
where domain = "barnesandnoble.com";
quit;

/*Combining the Bndemographic table with amazononlydemographic table to obtain the final table required for regression*/
Proc sql;
Create Table assign1.final as
Select * from assign1.justamazondemo
union Select* from assign1.Barndemo;
quit;

/*First 10 observations for the resulting data obtained for reference*/
Proc print data=assign1.final (obs=10);
run;

/*--------------Poisson Regression for the data abi.final-------------------- --*/
Proc NLMixed Data=assign1.Final;
Parms m0=1 b1=0 b2=0 b3=0 b4=0 b5=0 b6=0 b7=0;
m = m0*exp(b1*region+b2*hhsz+b3*age+b4*income+b5*child+b6*race+b7*country);
ll=log((m**total_books)*(exp(-m))/fact(total_books));
Model total_books ~ general(ll);
Run;





Maximum LL = 18819
Value of λ0 = 0.9497
Managerial Implications:
As per the above results, it is observed that region (b1), age (b3), income (b4), child (b5), race (b6) and country (b7) are significant parameters considering p value at 5%. We can also observe that the p value for hhsz is more than 0.05.

5. LL Formula - NBD Regression Model
NBD Regression Model is defined given below:

LL (Log Likelihood) is summation of (Total number of books purchased * P(Yi=y)) where exp(β’xi) = exp(b1* education + b2*region +b3 *hhsz + b4*age + b5*income + b6*child + b7*race + b8*country)

6. NBD Regression Model using customer characteristics
NBD Regression Model be represented using SAS code as follows:
Proc NLMixed Data=assign1.Final;
Parms alpha=1 r=1 b1=0 b2=0 b3=0 b4=0 b5=0 b6=0 b7=0;
expB= exp(b1*region+b2*hhsz+b3*age+b4*income+b5*child+b6*race+b7*country);
ll= log((GAMMA(r+total_books)/(GAMMA(r)*fact(total_books)))*((alpha/(alpha+expB)) **r)*((expB/(alpha+expB))**total_books));
Model total_books ~ general(ll);
Run;



Maximum LL = 8371.0292
α = 0.1074
r = 0.09814
As per the above observation, we can say that region (b1), race(b6) are significant. Hhsz (b2), age
(b3), income (b4), child (b5) and country (b7) are not significant.

7. Managerial takeaways between Poisson Regression and NBD Regression
/*---Predicting the values using Poisson Regression Results---*/
LIBNAME assign1 'C:\Users\Vinod Prem Kumar\Documents\My SAS Files\project 2';
%let m0 = 0.9497;
%let b1 = -0.1022;
%let b2 = -0.01499;
%let b3 = 0.02511;
%let b4 = 0.01499;
%let b5 = 0.07290;
%let b6 = -0.2083;
%let b7 = -0.1179;
Data PoiFinal (drop = y);
set assign1.Final;
m = &m0*exp(&b1*region+&b2*hhsz+&b3*age+&b4*income+&b5*child+&b6*race+&b7*country );
array prob(11) prob0 - prob10;
prob0 = poisson(m,0);
prob10 = 1-prob0;
do y=1 to 9;
prob(y+1)=poisson(m,y)-poisson(m,y-1);
prob10=prob10-prob(y+1);
end;
Run;

Proc means data=PoiFinal mean;
var prob0-prob10;
 Run;



Books
Actual
Predicted
0
7639
4501.02
1
753
3307.826
2
362
1246.635
3
175
320.576
4
126
63.18372
5
82
10.16739
6
74
1.389722
7
30
0.165752
8
48
0.017586
9
31
0.001684
10
131
0.00016
Total
9451























Books
Actual
Predicted
0
7639
7658.841
1
753
661.8573
2
362
320.4371
3
175
197.7017
4
126
135.1417
5
82
97.79706
6
74
73.39552
7
30
56.49713
8
48
44.30912
9
31
35.25412
10
131
169.7693
Total
9451

  


/*---Predicting the values using NBD Regression Results---*/
LIBNAME assign1 'C:\Users\Vinod Prem Kumar\Documents\My SAS Files\project 2';
%let alpha = 0.1078;
%let r = 0.09804;
%let b1 = -0.1035;
%let b2 = -0.00850;
%let b3 = 0.02854;
%let b4 = 0.01730;
%let b5 = 0.05816;
%let b6 = -0.2099;
%let b7 = -0.1015;
Data NBDProb (drop=y);
set assign1.Final;
expB = exp(&b1*region+&b2*hhsz+&b3*age+&b4*income+&b5*child+&b6*race+&b7*country);
array prob(11) prob0-prob10;
prob0 = (GAMMA(&r+0)/(GAMMA(&r)*fact(0)))*((&alpha/(&alpha+expB))**&r)*((expB/(&alpha +expB))**0);
prob10 = 1 - prob0;
do y=1 to 9;
prob(y+1)= (GAMMA(&r+y)/(GAMMA(&r)*fact(y)))*((&alpha/(&alpha+expB))**&r)*((expB/(&alpha +expB))**y);
prob10=prob10-prob(y+1);
end;
run;
Proc means data=NBDprob mean;
Var prob0-prob10;
run;

Considering the log-likelihood value of Poisson Regression (~ -18819) and NBD Regression (~ -8359) we observe that there is a large difference between the two results. From this, we can infer that NBD Regression is a good model as it has better maximum log likelihood value. As is evident from the graph of the predictions through Poisson Regression and NBD Regression, NBD regression performs much better in predicting total when compared to actual values.
Also, we can observe that only region(b1), race(b6) are significant in NBD regression model whereas region (b1), age (b3), income (b4), child (b5), race(b6) and country (b7) are significant in the Poisson regression model.
Possible explanation:-In the Poisson regression model, we considered only how people differ on a set of available explanatory variables (We keep the lambda constant), while in NBD regression model, we try to capture the unobserved variables which might influence the target variables along with the available explanatory variables. (i.e we varied the 0 across the customers according to a gamma distribution with parameters r and a)
Only Region was a significant variable in NBD Regression and it performed better as compared to Race, Age and Income being a significant variable in Poisson Regression. The explanatory variables are not enough to capture the actual difference among individuals which we used in Poisson Regression.
Further explanation: - As we observed, only two variables are significant in the NBD regression model (region(b1), race(b6)). This change in output of NBD regression might be a phenomenon of omitted variables’ bias (unobserved variables). These omitted variables might be correlated with the included independent variables and the dependent variable. In other words, in the regression the error term might be correlated with the independent variables and hence the difference.
To capture the unobserved component of differences among individuals, we must consider variation of lambda across the population which we used in NBD Regression. Finally, the LL value decreased to 8358.7726 in NBD from 18819.0476 in Poisson Regression.

8. Comparing the models - LR Test
Consider two models A and B, taking thoughts from the concept that we can arrive at model B by placing k constraints on the parameters of model A.
Now considering λ0 constant as a constraint for Poisson regression, let our models be as follows:
Model A: NBD regression
Model B: Poisson regression
Therefore, we use the likelihood ratio test to determine whether model A, which has more parameters (r and α) fits the data better than model B.
The null hypothesis is that model A is not different from model B
Computing test statistic
 LR = −2(LLB − LLA)
     = -2(-18819.0476 -(-8358.7726))
LR = 20920.55
χ2 (.05, k) = χ2 (.05,1)
χ2 =3.84146
Since, LR > χ2 (.05, k), we reject null hypothesis. Hence, we state that Model A is better than model B.
Therefore, NBD model is better than Poisson regression model.


PART 2: Improving the model
9. Variable selection
Explored the variables in SAS Enterprise Miner and found out that Education had more than 75 percent missing values. So, we did not try to impute it as it would skew the data. Hence, we rejected the Education variable.
We then ran the PCA on SAS EM without assigning the data types to the different variables and got the following result:


We ran NBD regression model to find how each variables were significant
LIBNAME assign1 'C:\Users\Vinod Prem Kumar\Documents\My SAS Files\project 2';
data assign1.newfile;
set assign1.data_books;
if age = 99 then age = 6;
if country = 1 then region = 5;
if region = ' ' and country = 0 then region = 3;
DROP education count;
run;

proc genmod data = assign1.data_books;
model domain= region hhsz age income child race;
run;


We found that Region, income and race were the most influential variables for domain.
Variable Country wasn’t adding much value to explain the variation in the dependent variable. Hence, we tried to analyze if the Country variable could be eliminated. 
Region Variable has 4 bins which represent the US countries. Country has only US and non-US, we can derive country from Region. So, we added an extra bin to the Region variable to represent all the non-US countries and rejected the Country variable. 
Now we assigned the data types to all the variables and rejected the ones which weren’t needed for our analysis. Then ran PCA



 We ran NBD regression model to check for significance values


LIBNAME proj 'C:\Users\Vinod Prem Kumar\Documents\My SAS Files\project 2';
data proj.newfile1;
set proj.data_books;
if education = "99" then education = ".";
if age = "99" then age = "6";
if country = 1 then region = 5;
if region = ' ' and country = 0 then region = 3;
run;


proc genmod data = assign1.newfile1;
model domain= region hhsz age income child race;
run;

10.
We identified DATE variable our analysis, date can be broken down into Weekdays and Weekends to give more comprehensive view to our analysis.
To convert date from calculate day of the week taking wd=1 as Sunday and wd=7 as Saturday numeric yyyymmdd format to ddMMMyyyy format.

LIBNAME assign1 'C:\Users\Vinod Prem Kumar\Documents\My SAS Files\project 2';

From the question 9, as we know the variable Country is redundant and the data can be retained by adding a new category to the region which is 5th region for Non-US (Country = 1) and imputing the region values with the mode value for missing.

Week day (wd) Variable from Date variable:

DATA assign1.ABNWeek;
SET assign1.newfile;
IF COUNTRY = 1  THEN REGION = 5;
IF COUNTRY = 0  AND REGION = ' ' THEN REGION = 3 ;
ORDERDATE = input(put(date, 8.), yymmdd8.);
format ORDERDATE date9.;
wd = WEEKDAY(ORDERDATE);
DROP COUNTRY DATE;
RUN;

Created a dataset BNWeekData containing the Quantity classifying Weekend and Weekday quantity for Barnes and Noble domain.

PROC SQL;
CREATE TABLE assign1.BNWeekData AS
SELECT userid,
CASE
WHEN Wd=1 THEN qty
WHEN Wd=7 THEN qty
ELSE 0
END
AS weekendQty,
CASE
WHEN Wd=2 THEN qty
WHEN Wd=3 THEN qty
WHEN Wd=4 THEN qty
WHEN Wd=5 THEN qty
WHEN Wd=6 THEN qty
ELSE 0
END as weekdayQty,
ORDERDATE
FROM assign1.ABNWeek
WHERE domain="barnesandnoble.com" ;
QUIT;


Dataset SUMBNWeekData is: Summarizing the data of total quantity purchased on Weekend and Weekday per User.

PROC SQL;
CREATE TABLE assign1.SUMBNWeekData as
SELECT DISTINCT userid,sum(weekdayQty) as TotalWeekendQty,
sum(weekendQty) as TotalWeekdayQty FROM assign1.BNWeekData GROUP BY userid;
QUIT;

proc sql;
create table assign1.countDate as
select userid,sum(qty) as total_qty, education, region, hhsz, age, income, child, race, country, domain,
sum(price) as total_sum
from assign1.newfile
where domain = "barnesandnoble.com"
group by userid, region, hhsz, age, income, child, race, country;
quit;

We have quantity purchases completed on weekdays and weekends. From this data we can assess what product are purchased on which days of the week. This is also helpful to run different models again to achieve the finest results to benefit information that impacts.

PROC SQL;
CREATE TABLE assign1.BNWeeKQty AS
SELECT DISTINCT *,
CASE
WHEN TotalWeekendQty IS NULL THEN 0
ELSE TotalWeekendQty
END
AS TotalWeekendQty,
CASE
WHEN TotalWeekdayQty IS NULL THEN 0
ELSE TotalWeekdayQty
END
AS TotalWeekdayQty
FROM assign1.countDate as CBN
LEFT JOIN
assign1.SUMBNWeekData as SBN
ON SBN.USERID = CBN.USERID;
QUIT;




Poisson Model


proc nlmixed data= assign1.BNWeeKQty;
parms m0=1 b1=0 b2=0 b3=0 b4=0 b5=0 b6=0 b7=0 b9=0 b10=0 b11=0;
m=m0*exp(b1*education + b2*region+b3*hhsz+b4*age+b5*income+b6*child+b7*race+b8*country + b9*total_sum+b10*TotalWeekendQty+b11*TotalWeekdayQty);
ll = total_qty*log(m)-m-log(fact(total_qty));
model total_qty~ general(ll);
run;

Results



Managerial Takeaway


We can infer from the parameter estimates that the newly constructed variables TotalWeekendQty(b10) which is Total quantity purchased on weekend and TotalWeekdayQty(b11) which is Total quantity purchased on weekday are having significant p-value less than 0.05. This indicates that these variables will impact our analysis.


NBD Regression

PROC NLMIXED DATA= assign1.BNWeeKQty;
PARMS r=1 alpha=1 b1=0 b2=0 b3=0 b4=0 b5=0 b6=0 b7=0 b8=0 b9=0 b10=0 b11=0;
Expn=exp(b1*education + b2*region+b3*hhsz+b4*age+b5*income+b6*child+b7*race+ b8*country + b9*total_sum+b10*TotalWeekendQty+b11*TotalWeekdayQty);
Prob=(gamma(r+total_qty)/(gamma(r)*fact(total_qty)))*((alpha/(alpha+Expn))**r)*((Expn/(alpha+Expn))**total_qty);
ll=Log(Prob);
MODEL total_qty~ GENERAL(ll);
RUN;


Managerial Takeaway

With the NBD Regression Model Maximum LL was improved. We can infer the significant variables in the above model are Income(b5), total_sum(b9), TotalWeekendQty(b10) which is Total quantity purchased on weekend and TotalWeekdayQty(b11) which is Total quantity purchased on weekday are having significant p-value less than 0.05.

Season(season) variable from the DATE variable

proc sql;
create table assign1.countSeasonBN as
select userid, region, hhsz, age, income, child, race, country, domain, date, education,
sum(qty) as total_qty, sum(price) as total_sum
from assign1.ABN
where domain = "barnesandnoble.com"
group by userid, region, hhsz, age, income, child, race, country;
quit;


DATA assign1.SeasonBN;
SET assign1.countSeasonBN;
length month $ 2 day $ 2;
date_conv = put(date,8.);
month = substr(date_conv,5,2);
day = substr(date_conv,7,2);
DROP DATE domain;
RUN;

DATA assign1.SeasonBN;
SET assign1.SeasonBN;
IF month = 01 or month = 02 or month = 03 or month = 04  THEN season = 1;
ELSE IF month = 05 or month = 06 or month = 07 THEN season=2;
ELSE season=3;
RUN;

proc GENMOD data= assign1.SeasonBN;
CLASS season region;
MODEL total_qty = education region hhsz age income child race country total_sum season / dist = Poisson link = log;
run;


Managerial Takeaway
For the class variables region and season, with the NBD distribution the significant variables in the above model are Age, Income, Race, Country, Total_sum, Season.  Purchases from the region 1 are more likely tending towards the BN than from region 2 and 3. In Season 2, people are buying more from BN than n Season 1 or 2.

Calculating the Loyalty of BN Customers
LIBNAME P2 'C:\Users\chait\Desktop\Sem3\ABA SAS\Project2\ds';
PROC sql;
Create table p2.loyalBN as
SELECT DISTINCT userid, count(qty) as CountOfQtyBN
from p2.newfile
Where domain= "barnesandnoble.com"
GROUP BY userid;
QUIT;

PROC sql;
Create table p2.loyalTotal as
SELECT DISTINCT userid, count(qty) as CountOfTotalQty
from p2.newfile
GROUP BY userid;
QUIT;

PROC sql;
Create table p2.loyalUsers as
SELECT Distinct c.userid, c.region, c.hhsz, c.age, c.income, c.child, c.race, lbn.CountOfQtyBN,lt.CountOfTotalQty
from p2.lbn as lbn inner join p2.lt as lt ON lbn.userid = lt.userid
inner join p2.ABNWeek as c ON c.userid = lbn.userid
GROUP BY lt.userid ;
QUIT;

DATA p2.loyalty;
SET p2.loyalUsers;
IF CountOfQtyBN > (CountOfTotalQty – CountOfQtyBN)
THEN loyalBN = 1;
ELSE loyalBN=0;
PercentageOfLoyalty = (CountOfQtyBN/CountOfTotalQty)*100;
RUN;

proc genmod data = p2.loyalty;
class age income child race region;
model CountOfQtyBN = age income child race region hhsz /dist = NBD
link = log;
run;







Managerial Takeaway
For the class variables age region race income with the NBD distribution the significant variables in the above model are Age, Income and Region.  Customers from the region 1 are more loyal compared to region 2, 3 and 4 towards BN. Customers falling under the band of Age 4 and 10 or Income level 5 are more loyal towards BN.
11. Interactions
a. Child and Income
proc genmod data = assign1.newfile1;
model domain= region age hhsz income*child race;
run;



As observed from Analysis of Maximum Likelihood Parameter Estimates table, the interaction of child and income variables does not yield any significant values as the p value > 0.05.

b. Age and Child
proc genmod data = assign1.newfile1;
model domain= region age*child hhsz income race;
run;

As observed from Analysis of Maximum Likelihood Parameter Estimates table, the interaction of age and Child variables does not yield any significant values as the p value > 0.05.

c. hhsz and Income
proc genmod data = assign1.newfile1;
model domain= region age hhsz*income child race;
run;


As observed from Analysis of Maximum Likelihood Parameter Estimates table, the interaction of hhsz and Income variables does not yield any significant values as the p value > 0.05.



PART 3: Certain Customers prefer Amazon over BN. Why?

12. Logistic Regression
LIBNAME assign1 'F:\6334\aba_project2_data_books';
data assign1.newfile;
set assign1.aba_project2_data_books;
if education = "99" then education = '.';
if age = "99" then age = '.';
if country = 1 then region = 5;
if region = ' ' and country = 0 then region = 3;
run;

proc sql;
create table assign1.barntable as
select distinct userid, region, hhsz, age, income, child, race, country, domain,
sum(qty) as total_barn
from assign1.newfile
where domain = "barnesandnoble.com"
group by userid;

proc sql;
create table assign1.amzntable as
select distinct userid, region, hhsz, age, income, child, race, country, domain,
sum(qty) as total_amzn
from assign1.newfile
where domain = "amazon.com"
group by userid;

proc sql;
create table assign1.final as select * from proj.barntable
 union select * from assign1.amzntable;

data assign1.final2;
set assign1.final;
if domain = "amazon.com" then do
total_amzn = total_barn;
total_barn = 0;
end;
if domain = "barnesandnoble.com" then total_amzn = 0;
run;

proc sql;
create table assign1.comb as select distinct userid, region, hhsz, age, income, child, race, country, sum(total_barn) as totbarn, sum(total_amzn) as totamzn from assign1.final2
group by userid;
run;

data assign1.finreg;
set assign1.comb;
if totbarn > 0 then flag_barn = 1;
else flag_barn = 0;
if totamzn > 0 then flag_amzn = 1;
else flag_amzn = 0;
run;
/* logistic regression for BarnesandNobles*/
proc logistic data = assign1.finreg;
model flag_barn= region hhsz age income child race;
run;
/* logistic regression for Amazon*/
proc logistic data = assign1.finreg;
model flag_amzn= region hhsz age income child race;
run;

Barnes and Nobles


Amazon


proc logisitic data = proj.finreg;
class region age income;
model flag_amzn = region hhsz age income child race/ selection = forward expb;
run;


Managerial Takeaway:
·       People from region 4 tends to purchase more from BarnesandNobles than Amazon.
·       People from U.S. leans towards Amazon over BarnesandNobles
·       Younger age people prefer Amazon over BarnesandNobles




PART 4: Summary
13. Insights and Learnings
·       Date: In the dataset the date variable is in the numeric data type. We converted the date variable to DATE 9. Format to get insights based on Weekday, Weekend, Seasons and Loyalty
·       All the regions are having the Non-Us (Country = 1) values which is incorrect, we added a new category to the region as 5 contains Non-Us value.
·       We observed that region 2 has not much sales for BN.
·       Learned how to build customized analytic model using Base SAS
·       Understood the difference between Poisson Regression and NBD Regression Model and how to compare the models using LR test
·       Dataset has 5:1 ratio of amazon is to BN. Hence, it is biased towards amazon which leads to small contribution for findings on BN
·       Although there are high influential variables, combining with non-influential variables will not yield any significant results
·       Data preprocessing played a key role in analyzing the MLE for the given data
·       Working on this project and creating customized BA model using Base SAS has equipped us with the understanding of analysis of raw data and coming up with actionable recommendations to make business more efficient