2月 202020
 

Image by 동철 이 from Pixabay

SAS® Scripting Wrapper for Analytics Transfer (SWAT), a powerful Python interface, enables you to integrate your Python code with SAS® Cloud Analytic Services (CAS). Using SWAT, you can execute CAS analytic actions, including feature engineering, machine learning modeling, and model testing, and then analyze the results locally.

This article demonstrates how you can predict the survival rates of Titanic passengers with a combination of both Python and CAS using SWAT. You can then see how well the models performed with some visual statistics.

 

Prerequisites


To get started, you will need the following:

  1. 64-bit Python 2.7 or Python 3.4+
  2. SAS® Viya®
  3. Jupyter Notebook
  4. SWAT (if needed, see the Installation page in the documentation)

After you install and configure these resources, start a Jupyter Notebook session to get started!

 

Step 1. Initialize the Python packages

Before you can build models and test how they perform, you need to initialize the different Python libraries that you will use throughout this demonstration.

Submit the following code and insert the specific values for your environment where needed:

# Import SAS SWAT Library
import swat

# Import OS for Local File Paths
import os
for dirname, _, filenames in os.walk('Desktop/'):
     for filename in filenames:
     print(os.path.join(dirname, filename))

# Import Numpy Library for Linear Algebra
import numpy as np
from numpy import trapz

# Import Pandas Library for Panda Dataframe
import pandas as pd

# Import Seaborn & Matplotlib for Data Visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

Step 2. Create the connection between Python and CAS

After you import the libraries, now you want to connect to CAS by using the SWAT package. In this demonstration, SSL is enabled in the SAS Viya environment and the SSL certificate is stored locally. If you have any connection errors, see Encryption (SSL).

Use the command below to create a connection to CAS:

# Create a Connection to CAS
conn = swat.CAS("ServerName.sas.com", PORT "UserID", "Password")

 

This command confirms the connection status:

# Verify the Connection to CAS
connection_status = conn.serverstatus()
connection_status

 

If the connection to CAS is working (if it is, you would see information similar to the above status), you can begin to import and explore the data.

Step 3. Import data into CAS

To gather the data needed for this analysis, run the following code in SAS and save the data locally.

This example saves the data locally in the Documents folder. With a CAS action, you can import this data on to the Viya CAS server.

# Import Titanic Data on to Viya CAS Server
titanic_cas = conn.read_csv(r"C:\Users\krstob\Documents\Titanic\titanic.csv", 
casout = dict(name="titanic", replace=True))

Step 4. Explore the data loaded in CAS

Now that the data is loaded into CAS memory, use the SWAT interface to interact with the data set. Using CAS actions, you can look at the shape, column information, records, and descriptions of the data. A machine learning engineer should review data before loading the data locally, in order to dive deeper on certain features.

If any of the SWAT syntax looks familiar to you, it is because SWAT is integrated with pandas. Here is a high-level look at the data:

  • The shape (rows, columns):
  • The column information:
  • The first three records:
  • An in-depth feature description:

Take a moment to think about this data set. If you were just using a combination of pandas and scikit-learn, you would need a good amount of data preprocessing. There are missing entries, character features that need to be converted to numeric, and data that needs to be distributed correctly for efficient and accurate processing.

Luckily, when SWAT is integrated with CAS, CAS does a lot of this work for you. CAS machine learning modeling can easily import character variables, impute missing values, normalize the data, partition the data, and much more.

The next step is to take a closer look at some of the data features.

 

Step 5. Explore the data locally

There is great information here from CAS about your 14 features. You know the data types, means, unique values, standard deviation, and others. Now, bring the data back locally into essentially a pandas data frame and create some graphs on what you believe might be variables to predict on.

# Use the CAS To_Frame Action to Bring the CAS Table Locally into a Data Frame
titanic_pandas_df = titanic_cas.to_frame()

With data loaded locally, examine the numerical distributions:
# How Is the Data Distributed? (Numerical)
distribution_plot = titanic_pandas_df.drop('survived', axis=1).hist(bins = 15, figsize = (12,12), alpha = 0.75)

The pclass variable represents the passenger class (first class, second class, third class). Does the passenger class have any effect on the survival rate? To look more into this, you can plot histograms that compare pclass, age, and the number who survived.

# For Seaborne Facet Grids, Create an Empty 3 by 2 Graph to Place Data On
pclass_survived = sns.FacetGrid(titanic_pandas_df, col='survived', row = 'pclass', height = 2.5, aspect = 3)

# Overlay a Histogram of Y(Age) = Survived
pclass_survived.map(plt.hist, 'age', alpha = 0.75, bins = 25)

# Add a Legend for Readability
pclass_survived.add_legend()

Note: 1 = survived, 0 = did not survive

 

As this graph suggests, the higher the class, the better the chance that someone survived. There is also a low survival rate for the approximately 18–35 age range for the lower class. This information is great, because you can build new features later that focus on pclass.

Another common predictor feature is someone’s sex. Were women and children saved first in the case of the Titanic crash? The following code creates graphs to help answer this question:

# Create a Graph Canvas - One for Female Survival Rate - One for Male
Survived = 'Survived'
Not_Survived = 'Not Survived'
fig,axes = plt.subplots(nrows=1, ncols=2, figsize=(10,4))

# Initialize Women and Male Variables to the Data Set Value
Women = titanic_pandas_df[titanic_pandas_df['sex'] == 'female']
Male = titanic_pandas_df[titanic_pandas_df['sex'] == 'male']

# For the First Graph, Plot the Amount of Women Who Survived Dependent on Their Age
Female_vs_Male = sns.distplot(Women[Women['survived']==1].age.dropna(), 
               bins=25, label = Survived, ax = axes[0], kde = False)

# For the First Graph, Layer the Amount of Women Who Did Not Survive Dependent on Their Age
Female_vs_Male = sns.distplot(Women[Women['survived']==0].age.dropna(), 
               bins=25, label = Not_Survived, ax = axes[0], kde = False)

# Display a Legend for the First Graph
Female_vs_Male.legend()
Female_vs_Male.set_title('Females')

# For the Second Graph, Plot the Amount of Men Who Survived Dependent on Their Age
Female_vs_Male = sns.distplot(Male[Male['survived']==1].age.dropna(), 
               bins=25, label = Survived, ax = axes[1], kde = False)

# For the Second Graph, Layer the Amount of Men Who Did Not Survive Dependent on Their Age
Female_vs_Male = sns.distplot(Male[Male['survived']==0].age.dropna(), 
                bins=25, label = Not_Survived, ax = axes[1], kde = False)

# Display a Legend for the Second Graph
Female_vs_Male.legend()
Female_vs_Male.set_title('Men')

This graph confirms that both women and children had a better chance at survival.

These feature graphs are nice visuals, but a correlation matrix is another great way to compare numerical features and the correlation to survival rate:

# Heatmap Correlation Matrix That Compares Numerical Values and Survived
Heatmap_Matrix = sns.heatmap(
              titanic_pandas_df[["survived","sibsp","parch","age","fare"]].corr(),
              annot = True,
              fmt = ".3f",
              cmap = "coolwarm",
              center = 0,
              linewidths = 0.1
)

The heat map shows that the fare, surprisingly, has a significant correlation with survival rate (or seems to, at least). Keep this in mind when you build the models.

You now have great knowledge of the features in this data set. Age, fare, and sex do affect someone’s survival chances. For a general machine learning problem, you would typically explore each feature in more detail, but, for now, it is time to move on to some feature engineering in CAS.

 

Step 6. Check for missing values

Now that you have a general idea about which features are important, you should clean up any missing values quickly by using CAS. By default, CAS replaces any missing values with the mean, but there are many other modes to choose from.

For this test case, you can keep the default because the data set is quite small. Use the CAS impute action to perform a data matrix (variable) imputation that fills in missing values.

First, check to see how many missing values are in the data set:

# Check for Missing Values
titanic_cas.distinct()

Both fare and age are important numeric variables that have missing values, so run the impute action with SWAT to fill in any missing values:

# Impute Missing Values (Replace with Substituted Values [By Default w/ the Mean])
conn.dataPreprocess.impute(
     table            = 'titanic',
     inputs          = ['age','fare'],
     copyAllVars = True,
     casOut        = dict(name = 'titanic', replace = True)
)

And, just like that, CAS takes care of the missing numeric values and creates a new variable, IMP_variable. This action would have taken more time to do with pandas or scikit-learn, so CAS was a nice time saver.

Step 7. Load the data locally to create new features

You now have great, clean data that you can model in CAS. Sometimes, though, for a machine learning problem, you want to create your own custom features.

It is easy to do that by using a data frame locally, so bring the data back to your local machine to create some custom features.

Using the to_frame() CAS action, convert the CAS data set into a local data frame. Keep only the variables needed for modeling:

<u></u><span style="font-size: 14px;"># Use the CAS To_Frame Action to bring the CAS Table Locally into a Data Frame</span>
titanic_pandas_df = titanic_cas.to_frame()

# Remove Some Features That Are Not Needed for Predictive Modeling
titanic_pandas_df = titanic_pandas_df[['embarked',
     'parch',
     'sex',
     'pclass',
     'sibsp',
     'survived',
     'IMP_fare',
     'IMP_age']
]

After the predictive features are available locally, confirm that the CAS statistical imputation worked:
# Check How Many Values Are Null by Using the isnull() Function
total_missing = titanic_pandas_df.isnull().sum().sort_values(ascending=False)
total_missing.head(5)

# Find the Total Values
total = titanic_pandas_df.notnull().sum().sort_values(ascending=False)
total.head(5)

# Find the Percentage of Missing Values per Variable
Percent = titanic_pandas_df.isnull().sum()/titanic_pandas_df.isnull().count()*100
Percent.sort_values(ascending=False).head(5)

# Round to One Decimal Place for Less Storage
Percent_Rounded = (round(Percent,1)).sort_values(ascending=False)

# Plot the Missing Data [Total Missing, Percentage Missing] with a Concatenation of Two Columns
Missing_Data = pd.concat([total, total_missing, Percent_Rounded], axis = 1,
                                            keys=['Non Missing Values', 'Total Missing Values', '% Missing'], sort=True)
Missing_Data

As you can see from the output above, all features are clean, and no values are missing! Now you can create some new features.

Step 8. Create new features

With machine learning, there are times when you want to create your own features that combine useful information to create a more accurate model. This action can help with overfitting, memory usage, or many other reasons.

This demo shows how to build four new features:

  • Relatives
  • Alone_on_Ship
  • Age_Times_Class
  • Fare_Per_Person

 

Relatives and Alone_on_Ship

The sibsp feature is the number of siblings and spouses, and the parch variable is the number of parents and children. So, you can combine these two for a Relatives feature that indicates how many people someone had on the ship in total. If a passenger traveled completely alone, you can flag that by creating the categorical variable Alone_on_Ship:

# Create a Relatives Variable / Alone on Ship
data = [titanic_pandas_df]
for dataset in data:
     dataset['Relatives'] = dataset['sibsp'] + dataset['parch']
     dataset.loc[dataset['Relatives'] &gt; 0, 'Alone_On_Ship'] = 0
     dataset.loc[dataset['Relatives'] == 0, 'Alone_On_Ship'] = 1
     dataset['Alone_On_Ship'] = dataset['Alone_On_Ship'].astype(int)

Age_Times_Class

As discussed earlier in this demo, both age and class had an effect on survivability. So, create a new Age_Times_Class feature that combines a person’s age and class:

data = [titanic_pandas_df]

# For Loop That Creates a New Variable, Age_Times_Class
for dataset in data:
     dataset['Age_Times_Class']= dataset['IMP_age'] * dataset['pclass']

Fare_Per_Person

The Fare_Per_Person variable is created by dividing the IMP_fare variable (cleaned by CAS) by the Relatives variable and then adding 1, which accounts for the passenger:

# Set the Training &amp; Testing Data for Efficiency
data = [titanic_pandas_df]

# For Loop Through Both Data Sets That Creates a New Variable, Fare_Per_Person
for dataset in data:
     dataset['Fare_Per_Person'] = dataset['IMP_fare']/(dataset['Relatives']+1)
     dataset['Sib_Div_Spouse'] = dataset['sibsp']
     dataset['Parents_Div_Children'] = dataset['parch']

# Drop the Parent Variable
titanic_pandas_df = titanic_pandas_df.drop(['parch'], axis=1)

# Drop the Siblings Variable
titanic_pandas_df = titanic_pandas_df.drop(['sibsp'], axis=1)

With these new features that you created with Python, here is how the data set looks:
# Look at How the Data Is Distributed
titanic_pandas_df.head(5)

Step 9. Load the data back into CAS for model training

Now you can make some models! Load the clean data back into CAS and start training:

# Upload the Data Set
titanic_cas = conn.upload_frame(titanic_pandas_df, casout = dict(name='titanic', replace='True'))

These code examples display the tables that the data is in:
# The Data Frame Type of the CAS Data
type(titanic_train_cas)


# The Data Frame Type of the Local Data
type(titanic_train_pd)

As you can see, the Titanic CAS table has a CASTable data frame and the local table has a SAS data frame.

Here is a final look at the data that you are going to train:

If you were training a model using scikit-learn, it would not really achieve great results without more preparation. Most of the values are in float format, and there are a few categorical variables. Luckily, CAS can handle these issues when it builds the models.

 

Step 10. Create a testing and training set

One of the awesome things about CAS machine learning is that you do not have to manually separate the data set. Instead, you can run a partitioning function and then model and test based on these parameters. To do this, you need to load the Sampling action set. Then, you can call the srs action, which can quickly partition a data table.

# Partitioning the Data
conn.loadActionSet('sampling')
conn.sampling.srs(
     table = 'titanic',
     samppct = 80,
     partind = True,
     output = dict(casout = dict(name = 'titanic', replace = True), copyVars = 'ALL')
)

This code partitions the data and adds a unique identifier to the data row to indicate whether it is testing or training. The unique identifier is _PartInd_. When a data row has this identifier equal to 0, it is part of the testing set. Similarly, when it is equal to 1, the data row is part of the training set.

 

Step 11. Build various models with CAS

One of my favorite parts about machine learning with CAS is how simple building a model is. With looping, you can dynamically change the targets, inputs, and nominal variables. If you are trying to build an extremely accurate model, it would be a great solution.

Building a model with CAS requires you to do a few things:

  • Load the action set (in this demo, a forest model, decision tree, and gradient boosting model)
  • Set your model variables (targets, inputs, and nominals)
  • Train the model

 

Forest Model

What does the data look like with a forest model? Here is the relevant code:

# Load the decisionTree CAS Action Set
conn.loadActionSet('decisionTree')

# Set Out Target for Predictive Modeling
target = 'survived'

# Set Inputs to Use to Predict Survived (Numerical Variable Inputs)
inputs = ['sex', 'pclass', 'Alone_On_Ship', 
                'Age_Times_Class', 'Relatives', 'IMP_age', 
                'IMP_fare', 'Fare_Per_Person', 'embarked']

# Set Nominal Variables to Use in Model (Categorial Variable Inputs)
nominals = ['sex', 'pclass', 'Alone_On_Ship', 'embarked', 'survived']

# Train the Forest Model
conn.decisionTree.forestTrain(
     table = dict(name = 'titanic', where = '_PartInd_ = 1'),
     target = target,
     inputs = inputs,
     nominals = nominals,
     casOut = dict(name = 'titanic_forest_model', replace = True)
)

Why does the input table have a WHERE clause? That is because you are looking at rows that contain the training flag, which was created with the srs action.

After running that block of code, you also get a response from CAS detailing how the model was trained, including great parameters like Number of Trees, Confidence Level for Pruning, and Max Number of Tree Nodes. If you wanted to do hyper-parameter tuning, this output shows you what the model currently looks like before you even adjust how it executes.

Decision Tree Model

The code to train a decision tree model is similar to the forest model example:

# Train the Decision Tree Model
conn.decisionTree.forestTrain(
     table = dict(name = 'titanic', where = '_PartInd_ = 1'),
     target = target,
     inputs = inputs,
     nominals = nominals,
     casOut = dict(name = 'titanic_decisiontree_model', replace = True)
)

Gradient Boosting Model

Lastly, you can build a gradient boosting model in this way:

# Train the Gradient Boosting Model
conn.decisionTree.gbtreeTrain(
     table = dict(name = 'titanic', where = '_PartInd_ = 1'),
     target = target,
     inputs = inputs,
     nominals = nominals,
     casOut = dict(name = 'titanic_gradient_model', replace = True)
)

Step 12. Score the models

How do you score the models with CAS? CAS has a score function for each model that is built. This function generates a new table that contains how the model performed on each data input.

Here is how to score the three models:

titanic_forest_score = conn.decisionTree.forestScore(
     table = dict(name = 'titanic', where = '_PartInd_ = 0'),
     model = 'titanic_forest_model',
     casout = dict(name='titanic_forest_score', replace = True),
     copyVars = target,
     encodename = True,
     assessonerow = True
)

titanic_decisiontree_score = conn.decisionTree.dtreeScore(
     table = dict(name = 'titanic', where = '_PartInd_ = 0'),
     model = 'titanic_decisiontree_model',
     casout = dict(name='titanic_decisiontree_score', replace = True),
     copyVars = target,
     encodename = True,
     assessonerow = True
)

titanic_gradient_score = conn.decisionTree.gbtreeScore(
     table = dict(name = 'titanic', where = '_PartInd_ = 0'),
     model = 'titanic_gradient_model',
     casout = dict(name='titanic_gradient_score', replace = True),
     copyVars = target,
     encodename = True,
     assessonerow = True
)

When the scoring function is running, it creates the new variables P_survived1, which is the prediction for whether a passenger survived or not, and P_survived0, which is the prediction for whether the passenger did not survive. With this scoring function, you can see how accurately a model could correctly classify passengers on the testing set.

If you dive deeper into the Python object, this scoring function is set as equal to, so you can actually see the misclassification rate!

For example, examine how the forest model did by running this code:

titanic_forest_score

The scoring function read in all of the model-tested set and told you the misclassification error. By calculating [1 – Misclassification Error], you can see that the model was approximately 85% accurate. For barely exploring the data and testing and training on a small data set, this score is good. These scores can be misleading though, as they do not tell the entire story. Do you have more false positives or false negatives? When it comes to predicting human survival, those parameters are important to investigate.

To analyze that, load the percentile CAS action set. This action set provides actions for calculating percentiles and box plot values. In your case, it also assesses models. With this information, CAS has an assess function to determine a final assessment of how the model did.

conn.loadActionSet('percentile')
prediction = 'P_survived1'

titanic_forest_assessed = conn.percentile.assess(
     table = 'titanic_forest_score',
     inputs = prediction,
     casout = dict(name = 'titanic_forest_assessed', replace = True),
     response = target,
     event = '1'
)

titanic_decissiontree_assessed = conn.percentile.assess(
     table = 'titanic_decisiontree_score',
     inputs = prediction,
     casout = dict(name = 'titanic_decissiontree_assessed', replace = True),
     response = target,
     event = '1'
)

titanic_gradient_assessed = conn.percentile.assess(
     table = 'titanic_gradient_score',
     inputs = prediction,
     casout = dict(name = 'titanic_gradient_assessed', replace = True),
     response = target,
     event = '1'
)

This CAS action returns three types of assessments: lift-related assessments, ROC-related assessments, and concordance statistics. Python is great at graphing data, so now you can move the data locally and see how it did with the new assessments.

 

Step 13. Analyze the results locally

You can plot the receiver operating characteristic (ROC) curve and the cumulative lift to determine how the models performed. Using the ROC curve, you can then calculate the area under the ROC curve (AUC) to see overall how well the models predicted the survival rate.

What exactly is a ROC curve or lift?

  • A ROC curve is determined by plotting the true positive rate (TPR) against the false positive rate. The true positive rate is the proportional observations that were correctly predicted to be positive. The false positive rate is the proportional observations that were incorrectly predicted to be positive.
  • A lift chart is derived from a gains chart. The X axis acts as a percentile, but the Y axis is the ratio of the gains value of our model and the gains value of a model that is choosing passengers randomly. That is, it details how many times the model is better than the random choice of cases.

Before you can plot these tables locally, you need to create a connection to them. CAS created some new assessed tables, so create a connection to these CAS tables for analysis:

# Assess Forest
titanic_assess_ROC_Forest = conn.CASTable('titanic_forest_assessed_ROC')
titanic_assess_Lift_Forest = conn.CASTable('titanic_forest_assessed')

titanic_ROC_pandas_Forest = titanic_assess_ROC_Forest.to_frame()
titanic_Lift_pandas_Forest = titanic_assess_Lift_Forest.to_frame()

# Assess Decision Tree
titanic_assess_ROC_DT = conn.CASTable('titanic_decisiontree_assessed_ROC')
titanic_assess_Lift_DT = conn.CASTable('titanic_decisiontree_assessed')

titanic_ROC_pandas_DT = titanic_assess_ROC_DT.to_frame()
titanic_Lift_pandas_DT = titanic_assess_Lift_DT.to_frame()

# Assess GB
titanic_assess_ROC_gb = conn.CASTable('titanic_gradient_assessed_ROC')
titanic_assess_Lift_gb = conn.CASTable('titanic_gradient_assessed')

titanic_ROC_pandas_gb = titanic_assess_ROC_gb.to_frame()
titanic_Lift_pandas_gb = titanic_assess_Lift_gb.to_frame()

Now that there is a connection to these tables, you can use the Matplotlib library to plot the ROC curve. Plot each model on this graph to see which model performed the best:
# Plot ROC Locally
plt.figure(figsize = (10,10))
plt.plot(1-titanic_ROC_pandas_Forest['_Specificity_'], 
             titanic_ROC_pandas_Forest['_Sensitivity_'], 'bo-', linewidth = 3)
plt.plot(1-titanic_ROC_pandas_DT['_Specificity_'], 
            titanic_ROC_pandas_DT['_Sensitivity_'], 'ro-', linewidth = 3)
plt.plot(1-titanic_ROC_pandas_gb['_Specificity_'], 
             titanic_ROC_pandas_gb['_Sensitivity_'], 'go-', linewidth = 3)
plt.plot(pd.Series(range(0,11,1))/10, pd.Series(range(0,11,1))/10, 'k--')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.legend(['Forest', 'Decision Tree', 'Gradient Boosting'])
plt.show()

You can also take the depth and cumulative lift scores from the assessed data set and plot that information:

# Plot Lift Locally
plt.figure(figsize = (10,10))
plt.plot(titanic_Lift_pandas_Forest['_Depth_'], titanic_Lift_pandas_Forest['_CumLift_'], 'bo-', linewidth = 3)
plt.plot(titanic_Lift_pandas_DT['_Depth_'], titanic_Lift_pandas_DT['_CumLift_'], 'ro-', linewidth = 3)
plt.plot(titanic_Lift_pandas_gb['_Depth_'], titanic_Lift_pandas_gb['_CumLift_'], 'go-', linewidth = 3)
plt.xlabel('Depth')
plt.ylabel('Cumulative Lift')
plt.title('Cumulative Lift Curve')
plt.legend(['Forest', 'Decision Tree', 'Gradient Boosting'])
plt.show()

Although these curves work well for exploring the model success and perhaps how you can better tune the data, typically most people just want to see overall how well the model performed. To get a general idea, you can integrate the ROC curve to get this overview. This area is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance.

# Forest Scores
x_forest = np.array([titanic_ROC_pandas_Forest['_Specificity_']])
y_forest  = np.array([titanic_ROC_pandas_Forest['_Sensitivity_']])

# Decision Tree Scores
x_dt = np.array([titanic_ROC_pandas_DT['_Specificity_']])
y_dt  = np.array([titanic_ROC_pandas_DT['_Sensitivity_']])

# GB Scores
x_gb = np.array([titanic_ROC_pandas_gb['_Specificity_']])
y_gb  = np.array([titanic_ROC_pandas_gb['_Sensitivity_']])

# Calculate Area Under Curve (Integrate)
area_forest = trapz(y_forest ,x_forest)
area_dt = trapz(y_dt ,x_dt)
area_gb = trapz(y_gb ,x_gb)

# Table For Model Scores
Model_Results = pd.DataFrame({
'Model': ['Forest', 'Decision Tree', 'Gradient Boosting'],
'Score': [area_forest, area_dt, area_gb]})

Model_Results

With the AUC ROC score, you can now see how well the model performs at distinguishing between positive and negative outcomes.

 

Conclusion

Any machine learning engineer should take time to further investigate integrating SAS Viya with their normal programming environment. When it works through the SWAT Python interface, CAS excels at quickly building and scoring a model. You can do this even with large data sets, because the data is stored in CAS memory. If you want to go further in-depth with using ensemble methods, I would recommend using SAS® Model Studio on SAS Viya, or perhaps one of the many great open-source libraries, like scikit-learn on Python.

The ability of CAS to quickly clean, model, and score a prediction is quite impressive. If you would like to take a further look at what SWAT and CAS can do, check out the different action sets that can be completed.

If you would like some more information about SWAT and SAS® Viya®, see these resources:

Would you like to see SWAT machine learning work with a larger data set, or perhaps use SWAT to build a neural network? Please leave a comment!

Building Machine Learning Models by Integrating Python and SAS® Viya® was published on SAS Users.

2月 202020
 

Image by 동철 이 from Pixabay

SAS® Scripting Wrapper for Analytics Transfer (SWAT), a powerful Python interface, enables you to integrate your Python code with SAS® Cloud Analytic Services (CAS). Using SWAT, you can execute CAS analytic actions, including feature engineering, machine learning modeling, and model testing, and then analyze the results locally.

This article demonstrates how you can predict the survival rates of Titanic passengers with a combination of both Python and CAS using SWAT. You can then see how well the models performed with some visual statistics.

 

Prerequisites


To get started, you will need the following:

  1. 64-bit Python 2.7 or Python 3.4+
  2. SAS® Viya®
  3. Jupyter Notebook
  4. SWAT (if needed, see the Installation page in the documentation)

After you install and configure these resources, start a Jupyter Notebook session to get started!

 

Step 1. Initialize the Python packages

Before you can build models and test how they perform, you need to initialize the different Python libraries that you will use throughout this demonstration.

Submit the following code and insert the specific values for your environment where needed:

# Import SAS SWAT Library
import swat

# Import OS for Local File Paths
import os
for dirname, _, filenames in os.walk('Desktop/'):
     for filename in filenames:
     print(os.path.join(dirname, filename))

# Import Numpy Library for Linear Algebra
import numpy as np
from numpy import trapz

# Import Pandas Library for Panda Dataframe
import pandas as pd

# Import Seaborn &amp; Matplotlib for Data Visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

Step 2. Create the connection between Python and CAS

After you import the libraries, now you want to connect to CAS by using the SWAT package. In this demonstration, SSL is enabled in the SAS Viya environment and the SSL certificate is stored locally. If you have any connection errors, see Encryption (SSL).

Use the command below to create a connection to CAS:

# Create a Connection to CAS
conn = swat.CAS("ServerName.sas.com", PORT "UserID", "Password")

 

This command confirms the connection status:

# Verify the Connection to CAS
connection_status = conn.serverstatus()
connection_status

 

If the connection to CAS is working (if it is, you would see information similar to the above status), you can begin to import and explore the data.

Step 3. Import data into CAS

To gather the data needed for this analysis, run the following code in SAS and save the data locally.

This example saves the data locally in the Documents folder. With a CAS action, you can import this data on to the Viya CAS server.

# Import Titanic Data on to Viya CAS Server
titanic_cas = conn.read_csv(r"C:\Users\krstob\Documents\Titanic\titanic.csv", 
casout = dict(name="titanic", replace=True))

Step 4. Explore the data loaded in CAS

Now that the data is loaded into CAS memory, use the SWAT interface to interact with the data set. Using CAS actions, you can look at the shape, column information, records, and descriptions of the data. A machine learning engineer should review data before loading the data locally, in order to dive deeper on certain features.

If any of the SWAT syntax looks familiar to you, it is because SWAT is integrated with pandas. Here is a high-level look at the data:

  • The shape (rows, columns):
  • The column information:
  • The first three records:
  • An in-depth feature description:

Take a moment to think about this data set. If you were just using a combination of pandas and scikit-learn, you would need a good amount of data preprocessing. There are missing entries, character features that need to be converted to numeric, and data that needs to be distributed correctly for efficient and accurate processing.

Luckily, when SWAT is integrated with CAS, CAS does a lot of this work for you. CAS machine learning modeling can easily import character variables, impute missing values, normalize the data, partition the data, and much more.

The next step is to take a closer look at some of the data features.

 

Step 5. Explore the data locally

There is great information here from CAS about your 14 features. You know the data types, means, unique values, standard deviation, and others. Now, bring the data back locally into essentially a pandas data frame and create some graphs on what you believe might be variables to predict on.

# Use the CAS To_Frame Action to Bring the CAS Table Locally into a Data Frame
titanic_pandas_df = titanic_cas.to_frame()

With data loaded locally, examine the numerical distributions:
# How Is the Data Distributed? (Numerical)
distribution_plot = titanic_pandas_df.drop('survived', axis=1).hist(bins = 15, figsize = (12,12), alpha = 0.75)

The pclass variable represents the passenger class (first class, second class, third class). Does the passenger class have any effect on the survival rate? To look more into this, you can plot histograms that compare pclass, age, and the number who survived.

# For Seaborne Facet Grids, Create an Empty 3 by 2 Graph to Place Data On
pclass_survived = sns.FacetGrid(titanic_pandas_df, col='survived', row = 'pclass', height = 2.5, aspect = 3)

# Overlay a Histogram of Y(Age) = Survived
pclass_survived.map(plt.hist, 'age', alpha = 0.75, bins = 25)

# Add a Legend for Readability
pclass_survived.add_legend()

Note: 1 = survived, 0 = did not survive

 

As this graph suggests, the higher the class, the better the chance that someone survived. There is also a low survival rate for the approximately 18–35 age range for the lower class. This information is great, because you can build new features later that focus on pclass.

Another common predictor feature is someone’s sex. Were women and children saved first in the case of the Titanic crash? The following code creates graphs to help answer this question:

# Create a Graph Canvas - One for Female Survival Rate - One for Male
Survived = 'Survived'
Not_Survived = 'Not Survived'
fig,axes = plt.subplots(nrows=1, ncols=2, figsize=(10,4))

# Initialize Women and Male Variables to the Data Set Value
Women = titanic_pandas_df[titanic_pandas_df['sex'] == 'female']
Male = titanic_pandas_df[titanic_pandas_df['sex'] == 'male']

# For the First Graph, Plot the Amount of Women Who Survived Dependent on Their Age
Female_vs_Male = sns.distplot(Women[Women['survived']==1].age.dropna(), 
               bins=25, label = Survived, ax = axes[0], kde = False)

# For the First Graph, Layer the Amount of Women Who Did Not Survive Dependent on Their Age
Female_vs_Male = sns.distplot(Women[Women['survived']==0].age.dropna(), 
               bins=25, label = Not_Survived, ax = axes[0], kde = False)

# Display a Legend for the First Graph
Female_vs_Male.legend()
Female_vs_Male.set_title('Females')

# For the Second Graph, Plot the Amount of Men Who Survived Dependent on Their Age
Female_vs_Male = sns.distplot(Male[Male['survived']==1].age.dropna(), 
               bins=25, label = Survived, ax = axes[1], kde = False)

# For the Second Graph, Layer the Amount of Men Who Did Not Survive Dependent on Their Age
Female_vs_Male = sns.distplot(Male[Male['survived']==0].age.dropna(), 
                bins=25, label = Not_Survived, ax = axes[1], kde = False)

# Display a Legend for the Second Graph
Female_vs_Male.legend()
Female_vs_Male.set_title('Men')

This graph confirms that both women and children had a better chance at survival.

These feature graphs are nice visuals, but a correlation matrix is another great way to compare numerical features and the correlation to survival rate:

# Heatmap Correlation Matrix That Compares Numerical Values and Survived
Heatmap_Matrix = sns.heatmap(
              titanic_pandas_df[["survived","sibsp","parch","age","fare"]].corr(),
              annot = True,
              fmt = ".3f",
              cmap = "coolwarm",
              center = 0,
              linewidths = 0.1
)

The heat map shows that the fare, surprisingly, has a significant correlation with survival rate (or seems to, at least). Keep this in mind when you build the models.

You now have great knowledge of the features in this data set. Age, fare, and sex do affect someone’s survival chances. For a general machine learning problem, you would typically explore each feature in more detail, but, for now, it is time to move on to some feature engineering in CAS.

 

Step 6. Check for missing values

Now that you have a general idea about which features are important, you should clean up any missing values quickly by using CAS. By default, CAS replaces any missing values with the mean, but there are many other modes to choose from.

For this test case, you can keep the default because the data set is quite small. Use the CAS impute action to perform a data matrix (variable) imputation that fills in missing values.

First, check to see how many missing values are in the data set:

# Check for Missing Values
titanic_cas.distinct()

Both fare and age are important numeric variables that have missing values, so run the impute action with SWAT to fill in any missing values:

# Impute Missing Values (Replace with Substituted Values [By Default w/ the Mean])
conn.dataPreprocess.impute(
     table            = 'titanic',
     inputs          = ['age','fare'],
     copyAllVars = True,
     casOut        = dict(name = 'titanic', replace = True)
)

And, just like that, CAS takes care of the missing numeric values and creates a new variable, IMP_variable. This action would have taken more time to do with pandas or scikit-learn, so CAS was a nice time saver.

Step 7. Load the data locally to create new features

You now have great, clean data that you can model in CAS. Sometimes, though, for a machine learning problem, you want to create your own custom features.

It is easy to do that by using a data frame locally, so bring the data back to your local machine to create some custom features.

Using the to_frame() CAS action, convert the CAS data set into a local data frame. Keep only the variables needed for modeling:

<u></u><span style="font-size: 14px;"># Use the CAS To_Frame Action to bring the CAS Table Locally into a Data Frame</span>
titanic_pandas_df = titanic_cas.to_frame()

# Remove Some Features That Are Not Needed for Predictive Modeling
titanic_pandas_df = titanic_pandas_df[['embarked',
     'parch',
     'sex',
     'pclass',
     'sibsp',
     'survived',
     'IMP_fare',
     'IMP_age']
]

After the predictive features are available locally, confirm that the CAS statistical imputation worked:
# Check How Many Values Are Null by Using the isnull() Function
total_missing = titanic_pandas_df.isnull().sum().sort_values(ascending=False)
total_missing.head(5)

# Find the Total Values
total = titanic_pandas_df.notnull().sum().sort_values(ascending=False)
total.head(5)

# Find the Percentage of Missing Values per Variable
Percent = titanic_pandas_df.isnull().sum()/titanic_pandas_df.isnull().count()*100
Percent.sort_values(ascending=False).head(5)

# Round to One Decimal Place for Less Storage
Percent_Rounded = (round(Percent,1)).sort_values(ascending=False)

# Plot the Missing Data [Total Missing, Percentage Missing] with a Concatenation of Two Columns
Missing_Data = pd.concat([total, total_missing, Percent_Rounded], axis = 1,
                                            keys=['Non Missing Values', 'Total Missing Values', '% Missing'], sort=True)
Missing_Data

As you can see from the output above, all features are clean, and no values are missing! Now you can create some new features.

Step 8. Create new features

With machine learning, there are times when you want to create your own features that combine useful information to create a more accurate model. This action can help with overfitting, memory usage, or many other reasons.

This demo shows how to build four new features:

  • Relatives
  • Alone_on_Ship
  • Age_Times_Class
  • Fare_Per_Person

 

Relatives and Alone_on_Ship

The sibsp feature is the number of siblings and spouses, and the parch variable is the number of parents and children. So, you can combine these two for a Relatives feature that indicates how many people someone had on the ship in total. If a passenger traveled completely alone, you can flag that by creating the categorical variable Alone_on_Ship:

# Create a Relatives Variable / Alone on Ship
data = [titanic_pandas_df]
for dataset in data:
     dataset['Relatives'] = dataset['sibsp'] + dataset['parch']
     dataset.loc[dataset['Relatives'] &gt; 0, 'Alone_On_Ship'] = 0
     dataset.loc[dataset['Relatives'] == 0, 'Alone_On_Ship'] = 1
     dataset['Alone_On_Ship'] = dataset['Alone_On_Ship'].astype(int)

Age_Times_Class

As discussed earlier in this demo, both age and class had an effect on survivability. So, create a new Age_Times_Class feature that combines a person’s age and class:

data = [titanic_pandas_df]

# For Loop That Creates a New Variable, Age_Times_Class
for dataset in data:
     dataset['Age_Times_Class']= dataset['IMP_age'] * dataset['pclass']

Fare_Per_Person

The Fare_Per_Person variable is created by dividing the IMP_fare variable (cleaned by CAS) by the Relatives variable and then adding 1, which accounts for the passenger:

# Set the Training &amp; Testing Data for Efficiency
data = [titanic_pandas_df]

# For Loop Through Both Data Sets That Creates a New Variable, Fare_Per_Person
for dataset in data:
     dataset['Fare_Per_Person'] = dataset['IMP_fare']/(dataset['Relatives']+1)
     dataset['Sib_Div_Spouse'] = dataset['sibsp']
     dataset['Parents_Div_Children'] = dataset['parch']

# Drop the Parent Variable
titanic_pandas_df = titanic_pandas_df.drop(['parch'], axis=1)

# Drop the Siblings Variable
titanic_pandas_df = titanic_pandas_df.drop(['sibsp'], axis=1)

With these new features that you created with Python, here is how the data set looks:
# Look at How the Data Is Distributed
titanic_pandas_df.head(5)

Step 9. Load the data back into CAS for model training

Now you can make some models! Load the clean data back into CAS and start training:

# Upload the Data Set
titanic_cas = conn.upload_frame(titanic_pandas_df, casout = dict(name='titanic', replace='True'))

These code examples display the tables that the data is in:
# The Data Frame Type of the CAS Data
type(titanic_train_cas)


# The Data Frame Type of the Local Data
type(titanic_train_pd)

As you can see, the Titanic CAS table has a CASTable data frame and the local table has a SAS data frame.

Here is a final look at the data that you are going to train:

If you were training a model using scikit-learn, it would not really achieve great results without more preparation. Most of the values are in float format, and there are a few categorical variables. Luckily, CAS can handle these issues when it builds the models.

 

Step 10. Create a testing and training set

One of the awesome things about CAS machine learning is that you do not have to manually separate the data set. Instead, you can run a partitioning function and then model and test based on these parameters. To do this, you need to load the Sampling action set. Then, you can call the srs action, which can quickly partition a data table.

# Partitioning the Data
conn.loadActionSet('sampling')
conn.sampling.srs(
     table = 'titanic',
     samppct = 80,
     partind = True,
     output = dict(casout = dict(name = 'titanic', replace = True), copyVars = 'ALL')
)

This code partitions the data and adds a unique identifier to the data row to indicate whether it is testing or training. The unique identifier is _PartInd_. When a data row has this identifier equal to 0, it is part of the testing set. Similarly, when it is equal to 1, the data row is part of the training set.

 

Step 11. Build various models with CAS

One of my favorite parts about machine learning with CAS is how simple building a model is. With looping, you can dynamically change the targets, inputs, and nominal variables. If you are trying to build an extremely accurate model, it would be a great solution.

Building a model with CAS requires you to do a few things:

  • Load the action set (in this demo, a forest model, decision tree, and gradient boosting model)
  • Set your model variables (targets, inputs, and nominals)
  • Train the model

 

Forest Model

What does the data look like with a forest model? Here is the relevant code:

# Load the decisionTree CAS Action Set
conn.loadActionSet('decisionTree')

# Set Out Target for Predictive Modeling
target = 'survived'

# Set Inputs to Use to Predict Survived (Numerical Variable Inputs)
inputs = ['sex', 'pclass', 'Alone_On_Ship', 
                'Age_Times_Class', 'Relatives', 'IMP_age', 
                'IMP_fare', 'Fare_Per_Person', 'embarked']

# Set Nominal Variables to Use in Model (Categorial Variable Inputs)
nominals = ['sex', 'pclass', 'Alone_On_Ship', 'embarked', 'survived']

# Train the Forest Model
conn.decisionTree.forestTrain(
     table = dict(name = 'titanic', where = '_PartInd_ = 1'),
     target = target,
     inputs = inputs,
     nominals = nominals,
     casOut = dict(name = 'titanic_forest_model', replace = True)
)

Why does the input table have a WHERE clause? That is because you are looking at rows that contain the training flag, which was created with the srs action.

After running that block of code, you also get a response from CAS detailing how the model was trained, including great parameters like Number of Trees, Confidence Level for Pruning, and Max Number of Tree Nodes. If you wanted to do hyper-parameter tuning, this output shows you what the model currently looks like before you even adjust how it executes.

Decision Tree Model

The code to train a decision tree model is similar to the forest model example:

# Train the Decision Tree Model
conn.decisionTree.forestTrain(
     table = dict(name = 'titanic', where = '_PartInd_ = 1'),
     target = target,
     inputs = inputs,
     nominals = nominals,
     casOut = dict(name = 'titanic_decisiontree_model', replace = True)
)

Gradient Boosting Model

Lastly, you can build a gradient boosting model in this way:

# Train the Gradient Boosting Model
conn.decisionTree.gbtreeTrain(
     table = dict(name = 'titanic', where = '_PartInd_ = 1'),
     target = target,
     inputs = inputs,
     nominals = nominals,
     casOut = dict(name = 'titanic_gradient_model', replace = True)
)

Step 12. Score the models

How do you score the models with CAS? CAS has a score function for each model that is built. This function generates a new table that contains how the model performed on each data input.

Here is how to score the three models:

titanic_forest_score = conn.decisionTree.forestScore(
     table = dict(name = 'titanic', where = '_PartInd_ = 0'),
     model = 'titanic_forest_model',
     casout = dict(name='titanic_forest_score', replace = True),
     copyVars = target,
     encodename = True,
     assessonerow = True
)

titanic_decisiontree_score = conn.decisionTree.dtreeScore(
     table = dict(name = 'titanic', where = '_PartInd_ = 0'),
     model = 'titanic_decisiontree_model',
     casout = dict(name='titanic_decisiontree_score', replace = True),
     copyVars = target,
     encodename = True,
     assessonerow = True
)

titanic_gradient_score = conn.decisionTree.gbtreeScore(
     table = dict(name = 'titanic', where = '_PartInd_ = 0'),
     model = 'titanic_gradient_model',
     casout = dict(name='titanic_gradient_score', replace = True),
     copyVars = target,
     encodename = True,
     assessonerow = True
)

When the scoring function is running, it creates the new variables P_survived1, which is the prediction for whether a passenger survived or not, and P_survived0, which is the prediction for whether the passenger did not survive. With this scoring function, you can see how accurately a model could correctly classify passengers on the testing set.

If you dive deeper into the Python object, this scoring function is set as equal to, so you can actually see the misclassification rate!

For example, examine how the forest model did by running this code:

titanic_forest_score

The scoring function read in all of the model-tested set and told you the misclassification error. By calculating [1 – Misclassification Error], you can see that the model was approximately 85% accurate. For barely exploring the data and testing and training on a small data set, this score is good. These scores can be misleading though, as they do not tell the entire story. Do you have more false positives or false negatives? When it comes to predicting human survival, those parameters are important to investigate.

To analyze that, load the percentile CAS action set. This action set provides actions for calculating percentiles and box plot values. In your case, it also assesses models. With this information, CAS has an assess function to determine a final assessment of how the model did.

conn.loadActionSet('percentile')
prediction = 'P_survived1'

titanic_forest_assessed = conn.percentile.assess(
     table = 'titanic_forest_score',
     inputs = prediction,
     casout = dict(name = 'titanic_forest_assessed', replace = True),
     response = target,
     event = '1'
)

titanic_decissiontree_assessed = conn.percentile.assess(
     table = 'titanic_decisiontree_score',
     inputs = prediction,
     casout = dict(name = 'titanic_decissiontree_assessed', replace = True),
     response = target,
     event = '1'
)

titanic_gradient_assessed = conn.percentile.assess(
     table = 'titanic_gradient_score',
     inputs = prediction,
     casout = dict(name = 'titanic_gradient_assessed', replace = True),
     response = target,
     event = '1'
)

This CAS action returns three types of assessments: lift-related assessments, ROC-related assessments, and concordance statistics. Python is great at graphing data, so now you can move the data locally and see how it did with the new assessments.

 

Step 13. Analyze the results locally

You can plot the receiver operating characteristic (ROC) curve and the cumulative lift to determine how the models performed. Using the ROC curve, you can then calculate the area under the ROC curve (AUC) to see overall how well the models predicted the survival rate.

What exactly is a ROC curve or lift?

  • A ROC curve is determined by plotting the true positive rate (TPR) against the false positive rate. The true positive rate is the proportional observations that were correctly predicted to be positive. The false positive rate is the proportional observations that were incorrectly predicted to be positive.
  • A lift chart is derived from a gains chart. The X axis acts as a percentile, but the Y axis is the ratio of the gains value of our model and the gains value of a model that is choosing passengers randomly. That is, it details how many times the model is better than the random choice of cases.

Before you can plot these tables locally, you need to create a connection to them. CAS created some new assessed tables, so create a connection to these CAS tables for analysis:

# Assess Forest
titanic_assess_ROC_Forest = conn.CASTable('titanic_forest_assessed_ROC')
titanic_assess_Lift_Forest = conn.CASTable('titanic_forest_assessed')

titanic_ROC_pandas_Forest = titanic_assess_ROC_Forest.to_frame()
titanic_Lift_pandas_Forest = titanic_assess_Lift_Forest.to_frame()

# Assess Decision Tree
titanic_assess_ROC_DT = conn.CASTable('titanic_decisiontree_assessed_ROC')
titanic_assess_Lift_DT = conn.CASTable('titanic_decisiontree_assessed')

titanic_ROC_pandas_DT = titanic_assess_ROC_DT.to_frame()
titanic_Lift_pandas_DT = titanic_assess_Lift_DT.to_frame()

# Assess GB
titanic_assess_ROC_gb = conn.CASTable('titanic_gradient_assessed_ROC')
titanic_assess_Lift_gb = conn.CASTable('titanic_gradient_assessed')

titanic_ROC_pandas_gb = titanic_assess_ROC_gb.to_frame()
titanic_Lift_pandas_gb = titanic_assess_Lift_gb.to_frame()

Now that there is a connection to these tables, you can use the Matplotlib library to plot the ROC curve. Plot each model on this graph to see which model performed the best:
# Plot ROC Locally
plt.figure(figsize = (10,10))
plt.plot(1-titanic_ROC_pandas_Forest['_Specificity_'], 
             titanic_ROC_pandas_Forest['_Sensitivity_'], 'bo-', linewidth = 3)
plt.plot(1-titanic_ROC_pandas_DT['_Specificity_'], 
            titanic_ROC_pandas_DT['_Sensitivity_'], 'ro-', linewidth = 3)
plt.plot(1-titanic_ROC_pandas_gb['_Specificity_'], 
             titanic_ROC_pandas_gb['_Sensitivity_'], 'go-', linewidth = 3)
plt.plot(pd.Series(range(0,11,1))/10, pd.Series(range(0,11,1))/10, 'k--')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.legend(['Forest', 'Decision Tree', 'Gradient Boosting'])
plt.show()

You can also take the depth and cumulative lift scores from the assessed data set and plot that information:

# Plot Lift Locally
plt.figure(figsize = (10,10))
plt.plot(titanic_Lift_pandas_Forest['_Depth_'], titanic_Lift_pandas_Forest['_CumLift_'], 'bo-', linewidth = 3)
plt.plot(titanic_Lift_pandas_DT['_Depth_'], titanic_Lift_pandas_DT['_CumLift_'], 'ro-', linewidth = 3)
plt.plot(titanic_Lift_pandas_gb['_Depth_'], titanic_Lift_pandas_gb['_CumLift_'], 'go-', linewidth = 3)
plt.xlabel('Depth')
plt.ylabel('Cumulative Lift')
plt.title('Cumulative Lift Curve')
plt.legend(['Forest', 'Decision Tree', 'Gradient Boosting'])
plt.show()

Although these curves work well for exploring the model success and perhaps how you can better tune the data, typically most people just want to see overall how well the model performed. To get a general idea, you can integrate the ROC curve to get this overview. This area is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance.

# Forest Scores
x_forest = np.array([titanic_ROC_pandas_Forest['_Specificity_']])
y_forest  = np.array([titanic_ROC_pandas_Forest['_Sensitivity_']])

# Decision Tree Scores
x_dt = np.array([titanic_ROC_pandas_DT['_Specificity_']])
y_dt  = np.array([titanic_ROC_pandas_DT['_Sensitivity_']])

# GB Scores
x_gb = np.array([titanic_ROC_pandas_gb['_Specificity_']])
y_gb  = np.array([titanic_ROC_pandas_gb['_Sensitivity_']])

# Calculate Area Under Curve (Integrate)
area_forest = trapz(y_forest ,x_forest)
area_dt = trapz(y_dt ,x_dt)
area_gb = trapz(y_gb ,x_gb)

# Table For Model Scores
Model_Results = pd.DataFrame({
'Model': ['Forest', 'Decision Tree', 'Gradient Boosting'],
'Score': [area_forest, area_dt, area_gb]})

Model_Results

With the AUC ROC score, you can now see how well the model performs at distinguishing between positive and negative outcomes.

 

Conclusion

Any machine learning engineer should take time to further investigate integrating SAS Viya with their normal programming environment. When it works through the SWAT Python interface, CAS excels at quickly building and scoring a model. You can do this even with large data sets, because the data is stored in CAS memory. If you want to go further in-depth with using ensemble methods, I would recommend using SAS® Model Studio on SAS Viya, or perhaps one of the many great open-source libraries, like scikit-learn on Python.

The ability of CAS to quickly clean, model, and score a prediction is quite impressive. If you would like to take a further look at what SWAT and CAS can do, check out the different action sets that can be completed.

If you would like some more information about SWAT and SAS® Viya®, see these resources:

Would you like to see SWAT machine learning work with a larger data set, or perhaps use SWAT to build a neural network? Please leave a comment!

Building Machine Learning Models by Integrating Python and SAS® Viya® was published on SAS Users.

2月 192020
 

Are you a statistical programmer whose company has adopted SAS Viya? If so, you probably know that the DATA step can run in parallel in SAS Cloud Analytic Services (CAS). As Sekosky (2017) says, "running in a single thread in SAS is different from running in many threads in CAS." He goes on to state, "you cannot take any DATA step, change the librefs used, and have it run correctly in parallel. You ... have to know what your program is doing to make sure you know what it does when it runs in parallel."

This article discusses one aspect of "know what your program is doing." Specifically, to run in parallel, the DATA step must use only functions and statements that are "CAS-enabled." Most DATA step functions run in CAS. However, there is a set of "SAS-only" functions that do not run in CAS. This article discusses these functions and provides a link to a list of the SAS-only functions. It also shows how you can get SAS to tell you whether your DATA step contains a SAS-only function.

DATA steps that run in CAS

By default, the DATA step will attempt to run in parallel when the step satisfies three conditions (Bultman and Secosky, 2018, p. 2):

  1. All librefs in the step are CAS engine librefs to the same CAS session.
  2. All statements in the step are supported by the CAS DATA step.
  3. All functions, CALL routines, formats, and informats in the step are available in CAS.

The present article is concerned with the third condition. How can you know in advance whether all functions and CALL routines are available in CAS?

A list of DATA step functions that do not run in CAS

For every DATA step function, the SAS Viya documentation indicates whether the function is supported on CAS or not. For example, the following screenshots of the Viya 3.5 documentation show the documentation for the RANUNI function and the RAND function:

Notice that the documentation for the RANUNI function says, "not supported in a DATA step that runs in CAS," whereas the RAND function is in the "CAS category," which means that it is supported in CAS. This means that if you use the RANUNI function in a DATA step, the DATA step will not run in CAS. (Similarly, for the other old-style random number functions, such as RANNOR.) Instead, it will try to run in SAS. This could result in copying input data from CAS, running the program in a single thread, and copying the final data set into a CAS table. Copying all that data is not efficient.

Fortunately, you do not need to look up every function to determine if it is a CAS-enabled or SAS-only function. The documentation now includes a list, by category, of the Base SAS functions (and CALL routines) that do not run in CAS. The following screenshot shows the top of the documentation page.

Can you always rewrite a DATA step to run in CAS?

For the example in the previous section (the RANUNI function), there is a newer function (the RAND function) that has the same functionality and is supported in CAS. Thus, if you have an old SAS program that uses the RANUNI function, you can replace that call with RAND("UNIFORM") and the modified DATA step will run in CAS. Unfortunately, not all functions have a CAS-enabled replacement. There are about 200 Base SAS functions that do not run in CAS, and most of them cannot be replaced by an equivalent function that runs in CAS.

The curious reader might wonder why certain classes of functions cannot run in CAS. Here are a few sets of functions that do not run in CAS, along with a few reasons:

  • Functions specific to the English language, such as the SOUNDEX and SPEDIS functions. Also, functions that are specific to single-byte character sets (especially I18N Level 0). Most of these functions are not applicable to an international audience that uses UTF-8 encoding.
  • Functions and statements for reading and writing text files. For example, INFILE, INPUT, and FOPEN/FCLOSE. There are other ways to import text files into CAS.
  • Macro-related functions such as SYMPUT and SYMGET. Remember: There are no macro variables in CAS! The macro pre-processor is a SAS-specific feature, and one of the principles of SAS Viya is that programmers who use other languages (such as Python or Lua) should have equal access to the functionality in Viya.
  • Old-style functions for generating random numbers from probability distributions. Use the RAND function instead.
  • Functions that rely on a single-threaded execution on a data set that has ordered rows. Examples include DIF, LAG, and some permutation/combination functions such as ALLCOMB. Remember: A CAS data table does not have an inherent order.
  • The US-centric state and ZIP code functions.
  • Functions for working with Git repositories.

How to force the DATA step to stop if it is not able to run in CAS

When a DATA step runs in CAS, you will see a note in the log that says:
NOTE: Running DATA step in Cloud Analytic Services.

If the DATA step runs in SAS, no note is displayed. Suppose that you intend for a DATA step to run in CAS but you make a mistake and include a SAS-only function. What happens? The default behavior is to (silently) run the DATA step in SAS and then copy the (possibly huge) data into a CAS table. As discussed previously, this is not efficient.

You might want to tell the DATA step that it should run in CAS or else report an error. You can use the SESSREF= option to specify that the DATA step must run in a CAS session. For example, if MySess is the name of your CAS session, you can submit the following DATA step:

/* use the SESSREF= option to force the DATA step to report an ERROR if it cannot run in CAS */
data MyCASLib.Want / sessref=MySess;
   x = ranuni(1234);    /* this function is not supported in the CAS DATA step */
run;
NOTE: Running DATA step in Cloud Analytic Services.
ERROR: The function RANUNI is unknown, or cannot be accessed.
ERROR: The action stopped due to errors.

The log is shown. The NOTE says that the step was submitted to a CAS session. The first ERROR message reports that your program contains a function that is not available. The second ERROR message reports that the DATA step stopped running. A DATA step that runs on CAS calls the dataStep.runCode action, which is why the message says "the action stopped."

This is a useful trick. The SESSREF= option forces the DATA step to run in CAS. If it cannot, it tells you which function is not CAS-enabled.

Other ways to monitor where the DATA steps runs

The DATA step documentation in SAS Viya contains more information about how to control and monitor where the DATA step runs. In particular, it discusses how to use the MSGLEVEL= system option to get detailed information about where a DATA step ran and in how many threads. The documentation also includes additional examples and best practices for running the DATA step in CAS. I recommend the SAS Cloud Analytic Services: DATA Step Programming documentation as the first step towards learning the advantages and potential pitfalls of running a DATA step in CAS.

Summary

The main purpose of this article is to provide a list of Base SAS functions (and CALL routines) that do not run in CAS. If you include one of these functions in a DATA step, the DATA step cannot run in CAS. This can be inefficient. You can use the SESSREF= option on the DATA statement to force the DATA step to run in CAS. If it cannot run in CAS, it stops with an error and informs you which function is not supported in CAS.

The post A list of SAS DATA step functions that do not run in CAS appeared first on The DO Loop.

2月 182020
 

In case you missed the news, there is a new edition of The Little SAS Book! Last fall, we completed the sixth edition of our book, and even though it is actually a few pages shorter than the fifth edition, we managed to add many more topics to the book. See if you can answer this question.

The answer is D – all of the above! We also added new sections on subsetting, summarizing, and creating macro variables using PROC SQL, new sections on the XLSX LIBNAME engine and ODS EXCEL, more on iterative DO statements, a new section on %DO, and more. For a summary of all the changes, see our blog post “The Little SAS Book 6.0: The best-selling SAS book gets even better."

Updating The Little SAS Book meant updating its companion book, Exercises and Projects for The Little SAS Book, as well. The exercises and projects book contains multiple choice and short answer questions as well as programming exercises that cover the same topics that are in The Little SAS Book. The exercises and projects book can be used in a classroom setting, or for anyone wanting to test their SAS knowledge and practice what they have learned.

Here are examples of the types of questions you might find in the exercises and projects book.

Multiple Choice

Short Answer

Programming Exercise

Solutions

In the book, we provide solutions for odd-numbered multiple choice and short answer questions and hints for the programming exercises.

  1. B
  2. Hint: New variables (columns) can be specified in the SELECT clause. Also, see our blog post “Expand your SAS Knowledge by Learning PROC SQL.”

While we don’t provide solutions for even-numbered questions, we can tell you that the iterative DO statement is covered in Section 3.12 of The Little SAS Book, Sixth Edition, “Using Iterative DO, DO WHILE, and DO UNTIL Statements.” The %DO statement is covered in Section 7.7, “Using %DO Loops in Macros.”

For more information about these books, explore the following links to the SAS website:

The Little SAS Book, Sixth Edition

Exercises and Projects for The Little SAS Book, Sixth Edition

Test your SAS skills with the newest edition of Exercises and Projects for The Little SAS Book was published on SAS Users.

2月 182020
 

Maybe you are new to AI and analytics. Or maybe you have been working with data and analytics for decades, even before we called this work data science or decision science. As the industry has broadened from statistics and analytics to big data and artificial intelligence, some things have remained [...]

4 principles of analytics you cannot ignore was published on SAS Voices by Oliver Schabenberger

2月 182020
 

Maybe you are new to AI and analytics. Or maybe you have been working with data and analytics for decades, even before we called this work data science or decision science. As the industry has broadened from statistics and analytics to big data and artificial intelligence, some things have remained [...]

4 principles of analytics you cannot ignore was published on SAS Voices by Oliver Schabenberger

2月 172020
 

Administrators like to be able to keep track of resource usage and who is using what in a system. When an administrator has this capability, they can look out for issues of high resource usage that may have an impact on overall system performance. In Viya, data is accessed in caslibs. In this post, I will show you how an administrator can track and control resource usage of personal caslibs.

A Caslib is an in-memory space to hold tables, access control lists, and data source information. In GEL enablement classes as we have discussed CAS and caslibs, one of the big asks we have had was related to personal caslibs and how can an administrator keep track of the resources that they use. Until recently we didn’t have a great answer, the good news is now we do.

Personal caslibs are, by default, the active caslib when a user starts a CAS session. This gives each user a default location to access and load data (sort of like a home directory). As the name suggests, they are personal and only the user who starts the session can access the data. In early releases, administrators saw this as a big problem because personal caslibs were basically invisible to them. There was no way to monitor what was going on with a personal caslib, leaving the system open to issues where one user could consume a lot of resources and have an adverse impact.

Introduced in Viya 3.4, the accessControl.accessPersonalCaslibs action brings all existing personal Caslibs into a session where an administrator has assumed the data or superuser role.

Running the accessControl.accessPersonalCaslibs action has the following effects. The administrator can:

  • See all personal caslibs that existed at the time the action was run
  • View promoted tables characteristics within the personal caslibs
  • Drop promoted tables within other users’ personal caslibs.

This elevation of access remains in effect for the duration of the session. The action does not give an administrator full access to personal caslibs, an administrator can never fetch data from other users’ personal caslibs, drop any personal caslib, set access controls on any personal caslib, or set access controls on any table in any personal caslib. What it does is give the administrator a view into personal caslibs to be able to monitor and troubleshooting their resource usage.

Let’s look at how it works. Logged into Viya as Viya administrator (by default also a CAS administrator), I can use the table.caslibinfo action to see all the caslibs that the administrator has permission to view. In the output below, I see my own personal caslib, and all the other caslibs that the administrator has permissions (set by the CAS authorization system) to view.

cas mysess;
proc cas;
table.caslibinfo;
quit;
cas mysess terminate;

In the code below, the super user role is assumed for this session (SAS Administrators by default can also assume the super user role in CAS). With the role assumed, the administrator can execute the accessControl.accessPersonalCaslibs action and the subsequent table.caslibinfo action returns all caslibs including any personal caslibs that existed when the session started.

cas adminsession;
proc cas;
/* need to be a super user or data administrator */
accessControl.assumeRole / adminRole="superuser";
accessControl.accessPersonalCaslibs;
table.caslibinfo ;
quit;
cas adminsession terminate;

That helps, but what about the details? How can the administrator see how many resources the tables in a personal CASLIB are using? To get the details, we can access an individual CASLIB and its tables, and for each table, execute the table.tabledetails action. The program below will loop all the personal caslibs and, for each of the caslibs, it will loop the in-memory tables and execute the table.tabledetails action. The output of tabledetails gives an idea of how many resources (memory, disk etc.) a table is using.

cas adminsession;
proc cas;
/* need to be a super user */
accessControl.assumeRole / adminRole=”superuser”;
accessControl.accessPersonalCaslibs;table.caslibinfo result=fileresult;
casliblist=findtable(fileresult);
 
/* loop caslibs */
do cvalue over casliblist;
 
if cvalue.name==: ‘CASUSER’ then
do; /* only look at caslibs that contain CASUSER */
 
table.tableinfo result=tabresult / caslib=cvalue.name;
 
tablelist=findtable(tabresult);
x=dim(tablelist);
 
if x>1 then
do; /* there are tables available */
 
do tvalue over tablelist; /* loop all tables in the caslib */
 
table.tabledetails / caslib=cvalue.name name=tvalue.name;
table.tableinfo / caslib=cvalue.name name=tvalue.name;
 
end; /* loop all tables in the caslib */
end; /* there are tables available */
end; /* only look at caslibs that contain CASUSER */
end; /* loop caslibs */
 
accessControl.dropRole / adminRole=”superuser”;
quit;
cas adminsession terminate;

Two fields that can give an administrator an idea of how big a table is are:

  • Data size: the size of the SAS Dataset in memory
  • Memory Mapped: the part of the data that has been “memory mapped” and is backed up in the CAS Disk Cache memory-mapped files.

The table below show the output for one users personal caslib.

If one table in particular is causing problems, it is possible for the administrator to drop the table from memory.

cas adminsession;
proc cas;
 
accessControl.assumeRole / adminRole=”superuser”;
accessControl.accessPersonalCaslibs;
sessionProp.setSessOpt / caslib=”CASUSER(gatedemo499)”;
table.droptable / name=”TRAIN”;
 
quit;
cas adminsession terminate;

Visibility into personal caslibs will be a big help to Viya administrators monitoring CAS resource usage. Check out the following for more details:

Viya administrators can now get personal with users' Caslibs was published on SAS Users.

2月 172020
 

A previous article shows how to interpret the collinearity diagnostics that are produced by PROC REG in SAS. The process involves scanning down numbers in a table in order to find extreme values. This can be a tedious and error-prone process. Friendly and Kwan (2009) compare this task to a popular picture book called Where's Waldo? in which children try to find one particular individual (Waldo) in a crowded scene that involves of hundreds of people. The game is fun for children, but less fun for a practicing analyst who is trying to discover whether a regression model suffers from severe collinearities in the data.

Friendly and Kwan suggest using visualization to turn a dense table of numbers into an easy-to-read graph that clearly displays the collinearities, if they exist. Friendly and Kwan (henceforth, F&K) suggest several different useful graphs. I decided to implement a simple graph (a discrete heat map) that is easy to create and enables the analyst to determine whether there are collinearities in the data. One version of the collinearity diagnostic heat map is shown below. (Click to enlarge.) For comparison, the table from my previous article is shown below it. The highlighted cells in the table were added by me; they are not part of the output from PROC REG.


Visualization principles

There are two important sets of elements in a collinearity diagnostics table. The first is the set of condition indices, which are displayed in the leftmost column of the heat map. The second is the set of cells that show the proportion of variance explained by each row. (However, only the rows that have a large condition index are important.) F&K make several excellent points about the collinearity diagnostic table:

  • Display order: In a table, the important information is in the bottom rows. It is better to reverse-sort the table so that the largest condition indices (the important ones) are at the top.
  • Condition indices: A condition number between 20 and 30 is starting to get large (F&K use 10-30). An index over 30 is generally considered large and an index that exceeds 100 is "a sign of potential disaster in estimation" (p. 58). F&K suggest using "traffic lighting" (green, yellow, and red) to color the condition indices by the severity of the collinearity. I modified their suggestion to include an orange category.
  • Proportion of variance: F&K note that "the variance proportions corresponding to small condition numbers are completely irrelevant" (p. 58) and also that tables print too many decimals. "Do we really care that [a]variance proportion is 0.00006088?" Of course not! Therefore we should only display the large proportions. F&K also suggest displaying a percentage (instead of proportion) and rounding the percentage to the nearest integer.

A discrete heat map to visualize collinearity diagnostics

There are many ways to visualize the Collinearity Diagnostics table. F&K use traffic lighting for the condition numbers and a bubble plot for the proportion of variance entries. Another choice would be to use a panel of bar charts for the proportion of variance. However, I decided to use a simple discrete heat map. The following list describes the main steps to create the plot. You can download the complete SAS program that creates the plot and modify it (if desired) to use with your own data. For each step, I link to a previous article that describes more details about how to perform the step.

  1. Use the ODS OUTPUT statement to save the Collinearity Diagnostics table to a data set.
  2. Use PROC FORMAT to define a format. The format converts the table values into discrete values. The condition indices are in the range [1, ∞) whereas the values for the proportion of variance are in the range [0, 1). Therefore you can use a single format that maps these values into 'low', 'medium', and 'high' values.
  3. The HEATMAPPARM statement in PROC SGPLOT is designed to work with data in "long format." Therefore convert the Collinearity Diagnostics data set from wide form to long form.
  4. Create a discrete attribute map that maps categories to colors.
  5. Use the HEATMAPPARM statement in PROC SGPLOT to create a discrete heat map that visualizes the collinearity diagnostics. Overlay (rounded) values for the condition indices and the important (relatively large) values of the proportion of variance.

The discrete heat map enables you to draw the same conclusions as the original collinearity diagnostics table. However, whereas using the table is akin to playing "Where's Waldo," the heat map makes it apparent that the most severe collinearity (top row; red condition index) is between the RunPulse and MaxPulse variables. The second most severe collinearity (second row from top; orange condition index) is between the Intercept and the Age variable. None of the remaining rows have two or more large cells for the proportion of variance.

You can download the SAS program that creates the collinearity plot. It would not be hard to turn it into a SAS macro, if you intend to use it regularly.

References

Friendly, M., & Kwan, E. (2009). "Where's Waldo? Visualizing collinearity diagnostics." The American Statistician, 63(1), 56-65. https://doi.org/10.1198/tast.2009.0012

The post Visualize collinearity diagnostics appeared first on The DO Loop.

2月 142020
 

In honor of Valentine’s day, we thought it would be fitting to present an excerpt from a paper about the LIKE operator because when you like something a lot, it may lead to love! If you want more, you can read the full paper “Like, Learn to Love SAS® Like” by Louise Hadden, which won best paper at WUSS 2019.

Introduction

SAS provides numerous time- and angst-saving techniques to make the SAS programmer’s life easier. Among those techniques are the ability to search and select data using SAS functions and operators in the data step and PROC SQL, as well as the ability to join data sets based on matches at various levels. This paper explores how LIKE is featured in each one of these techniques and is suitable for all SAS practitioners. I hope that LIKE will become part of your SAS toolbox, too.

Smooth Operators

SAS operators are used to perform a number of functions: arithmetic calculations, comparing or selecting variable values, or logical operations. Operators are loosely grouped as “prefix” (for example a sign before a variable) or “infix” which generally perform an operation BETWEEN two variables. Arithmetic operations using SAS operators may include exponentiation (**), multiplication (*), and addition (+), among others. Comparison operators may include greater than (>, GT) and equals (=, EQ), among others. Logical, or Boolean, operators include such operands as || or !!, AND, and OR, and serve the purpose of grouping SAS operations. Some operations that are performed by SAS operators have been formalized in functions. A good example of this is the concatenation operators (|| and !!) and the more powerful CAT functions which perform similar, but not identical, operations. LIKE operators are most frequently utilized in the DATA step and PROC SQL via a DATA step.

There is a category of SAS operators that act as comparison operators under special circumstances, generally in where statements in PROC SQL and the data step (and DS2) and subsetting if statements in the data step. These operators include the LIKE operator and the SOUNDS LIKE operator, as well as the CONTAINS and the SAME-AND operators. It is beyond the scope of this short paper to discuss all the smooth operators, but they are definitely worth a look.

LIKE Operator

Character operators are frequently used for “pattern matching,” that is, evaluating whether a variable value equals, does not equal, or sounds like a specified value or pattern. The LIKE operator is a case-sensitive character operator that employs two special “wildcard” characters to specify a pattern: the percent sign (%) indicates any number of characters in a pattern, while the underscore (_) indicates the presence of a single character per underscore in a pattern. The LIKE operator is akin to the GREP utility available on Unix/Linux systems in terms of its ability to search strings.

The LIKE operator also includes an escape routine in case you need to use a string that includes a comparison operator such as the carat, the underscore, or the percent sign, etc. An example of the escape routine syntax, when looking for a string containing a percent sign, is:

where yourvar like ‘100%’ escape ‘%’;

Additionally, SAS practitioners can use the NOT LIKE operator to select variables WITHOUT a given pattern. Please note that the LIKE statement is case-sensitive. You can use the UPCASE, LOWCASE, or PROPCASE functions to adjust input strings prior to using the LIKE statement. You may string multiple LIKE statements together with the AND or OR operators.

SOUNDS LIKE Operator

The LIKE operator, described above, searches the actual spelling of operands to make a comparison. The SOUNDS LIKE operator uses phonetic values to determine whether character strings match a given pattern. As with the LIKE operator, the SOUNDS LIKE operator is useful for when there are misspellings and similar sounding names in strings to be compared. The SOUNDS LIKE operator is denoted with a short cut ‘-*’. SOUNDS LIKE is based on SAS’s SOUNDEX algorithm. Strings are encoded by retaining the original first column, stripping all letters that are or act as vowels (A, E, H, I, O, U, W, Y), and then assigning numbers to groups: 1 includes B, F, P, and V; 2 includes C, G, J, K, Q, S, X, Z; 3 includes D and T; 4 includes L; 5 includes M and N; and 6 includes R. “Tristn” therefore becomes T6235, as does Tristan, Tristen, Tristian, and Tristin.

For more on the SOUNDS LIKE operator, please read the documentation.

Joins with the LIKE Operator

It is possible to select records with the LIKE operator in PROC SQL with a WHERE statement, including with joins. For example, the code below selects records from the SASHELP.ZIPCODE file that are in the state of Massachusetts and are for a city that begins with “SPR”.

proc sql;
    CREATE TABLE TEMP1 AS
    select
        a.City ,
        a.countynm  , a.city2 ,
         a.statename , a.statename2
    from sashelp.zipcode as a
    where upcase(a.city) like 'SPR%' and 
upcase(a.statename)='MASSACHUSETTS' ; 
quit;

The test print of table TEMP1 shows only cases for Springfield, Massachusetts.

The code below joins SASHELP.ZIPCODE and a copy of the same file with a renamed key column (city --> geocity), again selecting records for the join that are in the state of Massachusetts and are for a city that begins with “SPR”.

proc sql;
    CREATE TABLE TEMP2 AS
    select
        a.City , b.geocity, 
        a.countynm  ,
        a.statename , b.statecode, 
        a.x, a.y
    from sashelp.zipcode as a, zipcode2 as b
    where a.city = b.geocity and upcase(a.city) like 'SPR%' and b.statecode
= 'MA' ;
quit;

The test print of table TEMP2 shows only cases for Springfield, Massachusetts with additional variables from the joined file.

The LIKE “Condition”

The LIKE operator is sometimes referred to as a “condition,” generally in reference to character comparisons where the prefix of a string is specified in a search. LIKE “conditions” are restricted to the DATA step because the colon modifier is not supported in PROC SQL. The syntax for the LIKE “condition” is:

where firstname=: ‘Tr’;

This statement would select all first names in Table 2 above. To accomplish the same goal in PROC SQL, the LIKE operator can be used with a trailing % in a where statement.

Conclusion

SAS provides practitioners with several useful techniques using LIKE statements including the smooth LIKE operator/condition in both the DATA step and PROC SQL. There’s definitely reason to like LIKE in SAS programming.

To learn more about SAS Press, check out our up-and-coming titles, and to receive exclusive discounts make sure to subscribe to our newsletter.

References

    Gilsen, Bruce. September 2001. “SAS® Program Efficiency for Beginners.” Proceedings of the Northeast SAS Users Group Conference, Baltimore, MD.

    Roesch, Amanda. September 2011. “Matching Data Using Sounds-Like Operators and SAS® Compare Functions.” Proceedings of the Northeast SAS Users Group Conference, Portland, ME.

    Shankar, Charu. June 2019. “The Shape of SAS® Code.” Proceedings of PharmaSUG 2019 Conference, Philadelphia, PA.

Learn to Love SAS LIKE was published on SAS Users.

2月 132020
 

If you’re a SAS user, you know that SAS Global Forum is where you want to be! It’s our premier, can’t-miss event for SAS professionals—that includes thousands of users, executives, partners and academics.

SAS Global Forum 2018 ftirst-time attendee, Rachel Nemecek

It’s also our largest users’ event, organized by users, for users, and this year’s event marks our 44th annual! Of course, back when SAS got its start, our inaugural event was then called SAS.ONE. And while it’s grown from a couple dozen sessions to hundreds, it’s been a major keystone for SAS users around the world who look forward connecting and building knowledge.

Considering attending this year? We sat down with Analytics Manager, Rachel Nemecek, who attended for the first time in 2018, to share her SAS Global Forum experience with us!

Rachel’s favorite part of SAS Global Forum? “The access to so many different SAS subject matter experts in one place.  I could just wander around the Attendee Hub [The Quad] and check out different technologies, meet product experts and get all my questions answered. It’s also a great opportunity to set up meetings with experts on particular topics relevant to my business. And, of course, swag.  I won some SAS socks that I still wear and all my analyst friends are jealous.”

Keep reading to hear more about all the great things to expect at SAS Global Forum, and Rachel’s first-time experience, tips and insights!

Psst: Looking for conference proceedings? Our very own Lex Jansen has documented SAS conference proceedings from 1976-present!

Learn: Enhance your analytics skills

At the forefront of advanced analytics, digital transformation and innovation, you’ll hear the latest SAS advancements and announcements from SAS executives during the event. You’ll also hear from a great lineup of partners, customers (like Kellogg Company!) and industry experts on topics like artificial intelligence, IoT, cloud and more. With separate tracks for Users, Executives, Partners and Academics, the event features customized content for each audience.

What was Rachel’s favorite session of SAS Global Forum 2018? “I particularly enjoyed hearing from Reshma Saujani, founder of Girls Who Code.  Kept thinking of my nieces and how great it was that opportunities like that to experiment with technical ideas could be available to them at a young age.”

Covering all learning preferences and formats, the event offers a variety of session types. In addition to a plethora of SAS topics, you’ll enjoy over 600 sessions, workshops, demos and presentations. And The Quad is not to be missed! With a 20% larger space than last year, you’ll learn more about SAS and our partners, and take part in hands-on activities, experiences and games like chess, smart darts and more!

Lastly, don’t forget that event attendees can also take advantage of (discounted) SAS training and certification right on-site! To secure your seat, register for pre and post-conference training and exams during your event registration. During the event check out the Learning Lab, where you can take free SAS e-learning classes and access our free SAS Viya for Learners tool.

Rachel’s advice? “Make sure to explore all the activities that are available outside just the conference sessions – hands-on activities, demos, and of course, the networking events.”

Tip:  Don’t miss Opening Session for insights and entertainment! Also, don’t forget to schedule in time to visit the custom t-shirt press in the Cherry Blossom Lounge for your SAS Global Forum 2020 t-shirt!

Once available, use the SAS Global Forum app to build, map, and plan out your agenda in advance—not only will it help you keep track of the sessions you want to attend, but you’ll have a handy guide for navigating around the conference venue.

You’re sure to takeaway a lot of great SAS nuggets!

Network: Build your analytics circle

Attendees can also enjoy organized receptions for networking. Taking advantage of this year’s location, the 2020 event social will take place across two (Smithsonian) museums in just one night!

And, as Rachel points out, networking isn’t limited to who you meet outside your organization: “There were probably a couple dozen folks from various parts of my company that attended, which ended up being an extra benefit – to network with internal analytics colleagues outside my immediate organization.”

Be sure to sign up for the free networking events during registration, take advantage of the free lunch on-site (and lunchtime chats), and utilize the SAS Global Forum app! This year, all content will be housed in the app as part of our “go green” initiative. These are great opportunities to mingle with fellow SAS users and build your SAS network. Also, keep an eye out for SAS user group gatherings and special events!

Tip: Connect online via the SAS Global Forum Community and follow along on Twitter, Facebook, Instagram and LinkedIn. #SASGF

Join us there!

Will Rachel be attending this year’s event? “Yes, I’ll be there and will be bringing a few folks from my team.”

Don’t miss out! Registration is open now!

We’ll see you in DC!

Tip: Not able to attend? You can still join in online. Several sessions will be live streamed via the SAS Users YouTube channel.

Have you attended #SASGF in the past? Share your favorite highlights from the event with us below!

First-timer to the event this year? We’re excited to have you join us and we’re looking forward to helping you create your SAS Global Forum story!

Experience SAS Global Forum: A First-Timer’s View & Tip Sheet was published on SAS Users.