CAS

11月 212017
 

SAS Viya provides a robust, scalable, cloud-ready, distributed runtime engine. This engine is driven by CAS (Cloud Analytic Services), providing fast processing for many data management techniques that run distributive, i.e. using all threads on all defined compute nodes.

Why

PROC APPEND is a common technique used in SAS processes. This technique will concatenate two data sets together. However, PROC APPEND will produce an ERROR if the target CAS table exists prior to the PROC APPEND.

Simulating PROC APPEND

Figure 1. SAS Log with the PROC APPEND ERROR message

Now what?

How

To explain how to simulate PROC APPEND we first need to create two CAS tables. The first CAS table is named CASUSER.APPEND_TARGET. Notice the variables table, row and variable in figure 2.

Figure 2. Creating the CAS table we need to append rows to

The second CAS table is called CASUSER.TABLE_TWO and in figure 3 we can review the variables table, row, and variable.

Figure 3. Creating the table with the rows that we need to append to the existing CAS table

To simulate PROC APPEND we will use a DATA Step. Notice on line 77 in figure 4 we will overwrite an existing CAS table, i.e. CASUSER.APEND_TARGET. On Line 78, we see the first table on the SET statement is CASUSER.APPEND_TARGET, followed by CASUSER.TABLE_TWO. When this DATA Step runs, all of the rows in CASUSER.APPEND_TARGET will be processed first, followed by the rows in CASUSER.TABLE_TWO. Also, note we are not limited to two tables on the SET statement with DATA Step; we can use as many as we need to solve our business problems.

Figure 4. SAS log validating the DATA Step ran in CAS i.e. distributed

The result table created by the DATA Step is shown in figure 5.

Figure 5. Result table from simulating PROC APPEND using a DATA Step

Conclusion

SAS Viya’s CAS processing allows us to stage data for downstream consumption by leveraging robust SAS programming techniques that run distributed, i.e. fast. PROC APPEND is a common procedure used in SAS processes. To simulate PROC APPEND when using CAS tables as source and target tables to the procedure use DATA Step.

How to simulate PROC APPEND in CAS was published on SAS Users.

11月 202017
 

Many SAS users have inquired about SAS Cloud Analytic Services’ (CAS) Distributed Network File System (Learn more about CAS.)

The “NFS” in “DNFS”

Let’s start at the beginning. The “NFS” in DNFS stands for “Network File System” and refers to the ability to share files across a network. As the picture below illustrates, a network file system lets numerous remote hosts access another host’s files.

Understanding DNFS

NFS

There are numerous network file system protocols that can be used for file sharing – e.g. CIFS, Ceph, Lustre – but the most common on Linux is NFS. While NFS is the dominant file-sharing protocol, the “NFS” part of the DNFS does not correspond to the NFS protocol. Currently all the DNFS supported file systems are based on NFS, but DNFS may support file systems based on other protocols in the future. So, it’s best to think of the “NFS” part of “DNFS” as a generic “network file system” (clustered file system) and not the specific NFS protocol.

The “D” in “DNFS”

The “D” in DNFS stands for “Distributed” but it does not refer to the network file system. By definition, that is already distributed since the file system is external to the machines accessing it. The “Distributed” in DNFS refers to CAS’ ability to use a network file system in a massively parallel way. With a supported file system mounted identically on each CAS node, CAS can access (both write) the file system’s CSV and SASHDAT files from every worker node in parallel.

This parallel file access is not an attribute of the file system, it is a capability of CAS. By definition, network file systems facilitate access at the file level, not the block level. With DNFS, CAS actively manages parallel block level I/O to the network file system, making sure file seek locations and block I/O operations do not overlap, etc.

DNFS

 

DNFS as CAS Backing Store

Not only can CAS perform multi-machine parallel I/O from network file systems, it can also memory-map NFS SASHDAT files directly into CAS. Thus, SASHDAT files on DNFS act as both the CASlib data source as well as the virtual memory “backing store,” often negating the need for CAS to utilize memory mapping (mmap()).

Note 1: Data transformations on load, such as row filtering and field selection, as well as encryption can trigger CAS_DISK_CACHE usage. Since the data must be transformed (subset and/or decrypted), CAS copies the transformed data into CAS_DISK_CACHE to support CAS processing.

Note 2: It is possible to use DNFS atop an encrypting file system or hardware storage device. Here, the HDAT blocks are stored encrypted but transmitted to the CAS I/O layer decrypted. Assuming no other transformations, CAS_DISK_CACHE will not be used in this scenario.

DNFS Memory Mapping

Performance Considerations

DNFS-based CAS loading will only be as fast as the slowest component involved. The chosen NFS architecture (hardware and CAS connectivity) should support I/O throughput commensurate with the CAS installation and in-line with the implementation’s service level agreements. So, while NetApp ONTAP clustering architecture. A different file system technology might look a little different but the same basic ideas will apply.

DNFS w/ Multi Machine File System

As described earlier, CAS manages the parallel I/O operations. Requests from CAS are sent to the appliance and handled by the NFS metadata server. The storage device implementing the NFS protocol points CAS DNFS to the proper file and block locations on the NFS data servers which pass the data to the CAS worker nodes directly.

Understanding DNFS was published on SAS Users.

11月 162017
 

As a SAS Viya user, you may be wondering whether it is possible to execute data append and data update concurrently to a global Cloud Analytic Services (CAS) table from two or more CAS sessions. (Learn more about CAS.) How would this impact the report view while data append or data update is running on a global CAS table? These questions are even more important for those using the programming interface to load and update data in CAS. This post discusses data append, data update, and concurrency in CAS.

Two or more CAS sessions can simultaneously submit a data append and data update process to a CAS table, but only one process at a time can run against the same CAS table. The multiple append and update processes execute in serial, one after another, never running in a concurrent fashion. Whichever CAS session is first to acquire the write lock on a global CAS table prevails, appending or updating the data first. The other append and update processes must wait in a queue to acquire the write lock.

During the data append process, the appended data is not available to end users or reports until all rows are inserted and committed into the CAS table. While data append is running, users can still render reports against the CAS table using the original data, but excluding the appended rows.

Similarly, during the data update process, the updated data is not available to users or reports until the update process is complete. However, CAS lets you render reports using the original (non-updated) data, as the CAS table is always available for the read process. During the data update process, CAS makes additional copies into memory of the to-be-updated blocks containing rows in order to perform the update statement. Once the update process is complete, the additional and now obsolete copies of blocks, are removed from CAS. Data updates to a global CAS table is an expensive operation in terms of CPU and memory usage. You have to factor in the additional overhead memory or CAS_CACHE space to support the updates. The space requirement depends on the number of rows being affected by the update process.

At any given time, there could be only one active write process (append/update) against a global CAS table. However, there could be many concurrent active read processes against a global CAS table. A global CAS table is always available for read processes, even when an append or update process is running on the same CAS table.

The following log example describes two simultaneous CAS sessions executing data appends to a CAS table. Both append processes were submitted to CAS with a gap of a few seconds. Notice the execution time for the second CAS session MYSESSION1 is double the time that it took the first CAS session to append the same size of data to the CAS table. This shows that both appends were executing one after another. The amount of memory used and the CAS_CACHE location also shows that both processes were running one after another in a serial fashion.

Log from simultaneous CAS session MYSESSION submitting APPEND

58 proc casutil ;
NOTE: The UUID '1411b6f2-e678-f546-b284-42b6907260e9' is connected using session MYSESSION.
59 load data=mydata.big_prdsale
60 outcaslib="caspath" casout="big_PRDSALE" append ;
NOTE: MYDATA.BIG_PRDSALE was successfully added to the "caspath" caslib as "big_PRDSALE".
61 quit ;
NOTE: PROCEDURE CASUTIL used (Total process time):
real time 49.58 seconds
cpu time 5.05 seconds

Log from simultaneous CAS session MYSESSION1 submitting APPEND

58 proc casutil ;
NOTE: The UUID 'a20a246e-e0cc-da4d-8691-c0ff2a222dfd' is connected using session MYSESSION1.
59 load data=mydata.big_prdsale1
60 outcaslib="caspath" casout="big_PRDSALE" append ;
NOTE: MYDATA.BIG_PRDSALE1 was successfully added to the "caspath" caslib as "big_PRDSALE".
61 quit ;
NOTE: PROCEDURE CASUTIL used (Total process time):
real time 1:30.33
cpu time 4.91 seconds

 

When the data append process from MYSESSION1 was submitted alone (no simultaneous process), the execution time is around the same as for the first session MYSESSION. This also shows that when two simultaneous append processes were submitted against the CAS table, one was waiting for the other to finish. At one time, only one process was running the data APPEND action to the CAS table (no concurrent append).

Log from a lone CAS session MYSESSION1 submitting APPEND

58 proc casutil ;
NOTE: The UUID 'a20a246e-e0cc-da4d-8691-c0ff2a222dfd' is connected using session MYSESSION1.
59 load data=mydata.big_prdsale1
60 outcaslib="caspath" casout="big_PRDSALE" append ;
NOTE: MYDATA.BIG_PRDSALE1 was successfully added to the "caspath" caslib as "big_PRDSALE".
61 quit ;
NOTE: PROCEDURE CASUTIL used (Total process time):
real time 47.63 seconds
cpu time 4.94 seconds

 

The following log example describes two simultaneous CAS sessions submitting data updates on a CAS table. Both update processes were submitted to CAS in a span of a few seconds. Notice the execution time for the second CAS session MYSESSION1 is double the time it took the first session to update the same number of rows. The amount of memory used and the CAS_CACHE location also shows that both processes were running one after another in a serial fashion. While the update process was running, memory and CAS_CACHE space increased, which suggests that the update process makes copies of to-be-updated data rows/blocks. Once the update process is complete, the space usage in memory/CAS_CACHE returned to normal.

When the data UPDATE action from MYSESSION1 was submitted alone (no simultaneous process), the execution time is around the same as for the first CAS session.

Log from a simultaneous CAS session MYSESSION submitting UPDATE

58 proc cas ;
59 table.update /
60 set={
61 {var="actual",value="22222"},
62 {var="country",value="'FRANCE'"}
63 },
64 table={
65 caslib="caspath",
66 name="big_prdsale",
67 where="index in(10,20,30,40,50,60,70,80,90,100 )"
68 }
69 ;
70 quit ;
NOTE: Active Session now MYSESSION.
{tableName=BIG_PRDSALE,rowsUpdated=86400}
NOTE: PROCEDURE CAS used (Total process time):
real time 4:37.68
cpu time 0.05 seconds

 

Log from a simultaneous CAS session MYSESSION1 submitting UPDATE

57 proc cas ;
58 table.update /
59 set={
60 {var="actual",value="22222"},
61 {var="country",value="'FRANCE'"}
62 },
63 table={
64 caslib="caspath",
65 name="big_prdsale",
66 where="index in(110,120,130,140,150,160,170,180,190,1100 )"
67 }
68 ;
69 quit ;
NOTE: Active Session now MYSESSION1.
{tableName=BIG_PRDSALE,rowsUpdated=86400}
NOTE: PROCEDURE CAS used (Total process time):
real time 8:56.38
cpu time 0.09 seconds

 

The following memory usage snapshot from one of the CAS nodes describes the usage of memory before and during the CAS table update. Notice the values for “used” and “buff/cache” columns before and during the CAS table update.

Memory usage on a CAS node before starting a CAS table UPDATE

Memory usage on a CAS node during CAS table UDPATE

Summary

When simultaneous data append and data update requests are submitted against a global CAS table from two or more CAS sessions, they execute in a serial fashion (no concurrent process execution). To execute data updates on a CAS table, you need an additional overhead memory/CAS_CACHE space. While the CAS table is going through the data append or data update process, the CAS table is still accessible to rendering reports.

Concurrent data append and update to a global CAS table was published on SAS Users.

10月 272017
 

When loading data into CAS using PROC CASUTIL, you have two choices on how the table can be loaded:  session-scope or global-scope.  This is controlled by the PROMOTE option in the PROC CASUTIL statement.

Session-scope loaded

proc casutil;
                                load casdata="model_table.sas7bdat" incaslib="ryloll" 
                                outcaslib="otcaslib" casout="model_table”;
run;
Global-scope loaded
proc casutil;
                                load casdata="model_table.sas7bdat" incaslib="ryloll" 
                                outcaslib="otcaslib" casout="model_table" promote;
run;

 

Global-scope loaded

proc casutil;
                                load casdata="model_table.sas7bdat" incaslib="ryloll" 
                                outcaslib="otcaslib" casout="model_table" promote;
run;

 

Remember session-scope tables can only be seen by a single CAS session and are dropped from CAS when that session is terminated, while global-scope tables can be seen publicly and will not be dropped when the CAS session is terminated.

But what happens if I want to create a new table for modeling by partitioning an existing table and adding a new partition column? Will the new table be session-scoped or global-scoped? To find out, I have a global-scoped table called MODEL_TABLE that I want to partition based on my response variable Event. I will use PROC PARTITION and call my new table MODEL_TABLE_PARTITIONED.

proc partition data=OTCASLIB.MODEL_TABLE partind samppct=30;
	by Event;
	output out=OTCASLIB.model_table_partitioned;
run;

 

After I created my new table, I executed the following code to determine its scope. Notice that the Promoted Table value is set to No on my new table MODEL_TABLE_PARTITIONED which means it’s session-scoped.

proc casutil;
     list tables incaslib="otcaslib";
run;

 

promote CAS tables from session-scope to global-scope

How can I promote my table to global-scoped?  Because PROC PARTITION doesn’t provide me with an option to promote my table to global-scope, I need to execute the following PROC CASUTIL code to promote my table to global-scope.

proc casutil;
     promote casdata="MODEL_TABLE_PARTITIONED"
     Incaslib="OTCASLIB" Outcaslib="OTCASLIB" CASOUT="MODEL_TABLE_PARTITIONED";
run;

 

I know what you’re thinking.  Why do I have to execute a PROC CASUTIL every time I need my output to be seen publicly in CAS?  That’s not efficient.  There has to be a better way!

Well there is, by using CAS Actions.  Remember, when working with CAS in SAS Viya, SAS PROCs are converted to CAS Actions and CAS Actions are at a more granular level, providing more options and parameters to work with.

How do I figure out what CAS Action syntax was used when I execute a SAS PROC?  Using the PROC PARTITION example from earlier, I can execute the following code after my PROC PARTITION completes to see the CAS Action syntax that was previously executed.

proc cas;
     history;
run;

 

This command will return a lot of output, but if I look for lines that start with the word “action,” I can find the CAS Actions that were executed.  In the output, I can see the following CAS action was executed for PROC PARTITION:

action sampling.stratified / table={name='MODEL_TABLE', caslib='OTCASLIB', groupBy={{name='Event'}}}, samppct=30, partind=true, output={casOut={name='MODEL_TABLE_PARTITIONED', caslib='OTCASLIB', replace=true}, copyVars='ALL'};

 

To partition my MODEL_TABLE using a CAS Action, I would execute the following code.

proc cas;
  sampling.stratified / 
    table={name='MODEL_TABLE', caslib='OTCASLIB', groupBy={name='Event'}}, 
    samppct=30, 
    partind=true, 
    output={casOut={name='embu_partitioned', caslib='OTCASLIB'}, copyVars='ALL'};
run;

 

If I look up sampling.stratified syntax in the

proc cas;
  sampling.stratified / 
    table={name='MODEL_TABLE', caslib='OTCASLIB', groupBy={name='Event'}}, 
    samppct=30, 
    partind=true, 
    output={casOut={name='embu_partitioned', caslib='OTCASLIB', promote=true}, copyVars='ALL'};
run;

 

So, what did we learn from this exercise?  We learned that when we create a table in CAS from a SAS PROC, the default scope will be session and to change the scope to global we would need to promote it through a PROC CASUTIL statement.  We also learned how to see the CAS Actions that were executed by SAS PROCs and how we can write code in CAS Action form to give us more control.

I hope this exercise helps you when working with CAS.

Thanks.

Tip and tricks to promote CAS tables from session-scope to global-scope was published on SAS Users.

8月 182017
 

SAS Viya deployments use credentials for accessing databases and other third-party products that require authentication. In this blog post, I will look at how this sharing of credentials is implemented in SAS Environment Manager.

In SAS Viya, domains are used to store the:

  • Credentials required to access external data sources.
  • Identities that are allowed to use those credentials.

There are three types of domains:

  • Authentication stores credentials that are used to access an external source that can then be associated with a caslib.
  • Connection used when the external database has been set up to require a User ID but no password.
  • Encryption stores an encryption key required to read data at rest in a path assigned to a caslib.

In this blog post we will focus on authentication domains which are typically used to provide access to data in a database management system. It is a pretty simple concept; an authentication domain makes a set of credentials available to a set of users. This allows SAS Viya to seamlessly access a resource. The diagram below shows a logical view of a domain. In this example, the domain PGAuth stores the credentials for a Postgres database, and makes those credentials available to two groups (and their members) and three users.

How does this work when a user accesses data in a database caslib? The following steps are performed:

1.     Log on to SAS Viya using personal credentials: the user’s identity is established including group memberships.

2.     Access a CASLIB for a database: using the user’s identity and the authentication domain of the CASLIB, Viya will look up the credentials associated with that identity in the domain.

3.     Two results are possible. A credential match is:

  • 1.     Found: the credentials are passed to the database authentication provider to determine access to the data.
  • 2.     Not found: no access to the data is provided.

To manage domains in SAS Environment Manager you must be an administrator. In SAS Environment Manager select Security > Domains. There are two views available:  Domains and Credentials. The Domains view lists all defined domains. You can access the credentials for a domain by right-clicking on the domain and selecting Credentials.

The Credentials view lists all credentials defined and the domains for which they are associated.

Whatever way you get to a credential, you can edit it by right-clicking and selecting Edit. In the edit dialog, you can specify the Identities (users and groups) that can use the credential, and the User ID and Password of the credential.  Note that only users who are already listed in the Identities field will be able to edit this field, so make sure you are in this field (directly or through group membership) prior to saving.

To use an authentication domain, you reference it in the CASLIB definition. When defining a non-path based CASLIB you must select a domain to provide user credentials to connect to the database server. This can be done when creating a new CASLIB in SAS Environment Manager in the Data > Libraries area.

If you use code to create or access your caslib, use the authenticationdomain option. In this example, we specify authenticationdomain in the table.addcaslib action.

If a user is not attached to the authentication domain directly, or through a group membership, they will not be able to access the credentials. An error will occur when they attempt to access the data.

This has been a brief look at storing and using credentials to access databases from SAS Viya. You can find  more detail in the SAS Viya Administration Guide in the section titled SAS Viya sharing credentials for database access was published on SAS Users.

8月 172017
 

In this blog post I am going to cover the example of importing data into SAS Viya using Cloud Analytic Services (CAS) actions via REST API. For example, you may want to import data into a CASLib via REST API.  This means you can perform an import of data outside of the SAS Self-Service Import user interface environment using REST API.  Once this data is loaded into CAS it is available for use in applications such as SAS Visual Analytics and SAS Visual Data Builder.

Introduction

To import data into SAS Viya via REST API, you need to make a series of REST API calls:

1.     Start CAS Session
2.     Load Data into a CASLib
3.     End CAS Session

I will walk through these various REST API calls in the sections below using the REST API testing application HTTPRequestor, which is a free add-on to the Mozilla Firefox browser.

Before I perform any of my REST API calls, I need to Base-64 encode my credentials. The input for encoding the credentials is: I used the site https://www.base64encode.org/ to encode my credentials.  Note: You can use other methods (e.g., Python) to encode your credentials. Use the preferred method by your organization to ensure you are meeting their security protocols.

Below is the header Authorization information I will be sending with each of my requests.

Authorization Header

1.     Start CAS Session

First, I need to start a CAS Session. Below is an example request for starting a CAS Session:

POST https://<YourCASServer:Port>/cas/sessions

Authorization: Basic <Base-64EncodedCredentials>
 Content-Type: application/json

{}

This request returns the CASSessionUUID needed in the next step.

I construct my request in HTTPRequestor as follows and submit the request:

Start CAS Session Request/Response

Here is a screenshot of the raw transaction information.

Start CAS Session Raw Transaction

I need to copy the CAS Session UUID information that was returned for use in the subsequent REST API calls since their CAS Actions must be performed within a CAS Session.

2.     Load Data into a CASLib

Now that I have started my CAS session and have its UUID, I can load the table to CAS. Below is an example request for the table.loadTable CAS Action:

POST 
https://<YourCASServer:Port>/cas/sessions/<CASSessionUUID>/actions/table.load
Table

Authorization: Basic <Base-64EncodedCredentials>
 Content-Type: application/json

{"casLib":"<InputCASLib>","importOptions":{"fileType":"<FileType>"},"path":"<InputFilePathAndName>",
 "casout":{"caslib":"<OutputCASLib>","name":"<OutputTableName>","promote":true}}

 

This request returns a log message: “NOTE: Cloud Analytic Services made the file <InputFilePathAndName> available as table <OutputTableName> in caslib <OutputCASLib>.”

For my example, I will load the SAS data set BASEBALL located in the helpdata CASLib to the Public CASLib and call the CAS Table SAS_BASEBALL.  I am copying the data to the Public CASLib to make it more readily available to all CAS users. Let’s first confirm that the SAS_BASEBALL table does not currently exist in the Public CASLib.

Public CASLib Before LoadTable CAS Action Called

I construct my request in HTTPRequestor as follows and submit the request:

Load Table Request/Response

Here is a screenshot of the raw transaction information.

Load Table Raw Transaction

Next, I will confirm that the SAS_BASEBALL data set is now loaded in the Public CASLib.

Public CASLib After LoadTable CAS Action Called

The SAS_BASEBALL data set is now available for use in applications such as SAS Visual Analytics and SAS Visual Data Builder.

3.     End CAS Session

Finally, I need to terminate my CAS Session. Below is an example request for the session.endSession CAS Action:

POST https://&lt;YourCASServer:Port&gt;/cas/sessions/&lt;CASSessionUUID&gt;/actions/session.endSession

Authorization: Basic &lt;Base-64EncodedCredentials&gt;
 Content-Type: application/json

{}

 

This request returns a status of 0 indicating there was no error and the CASSessionUUID specified in the request has ended.

I construct my request in HTTPRequestor as follows and submit the request:

End CAS Session Request/Response

Here is a screenshot of the raw transaction information.

End CAS Session Raw Transaction

Conclusion

These calls can be strung together so you could schedule their execution. For more information on SAS Viya and REST APIs, refer to the following documentation the SAS Cloud Analytics REST API documentation.

Load Data into SAS Viya via REST API was published on SAS Users.

6月 292017
 

One of the big benefits of the SAS Viya platform is how approachable it is for programmers of other languages. You don't have to learn SAS in order to become productive quickly. We've seen a lot of interest from people who code in Python, maybe because that language has become known for its application in machine learning. SAS has a new product called SAS Visual Data Mining and Machine Learning. And these days, you can't offer such a product without also offering something special to those Python enthusiasts.

Introducing Python SWAT

And so, SAS has published the Python SWAT project (where "SWAT" stands for the SAS scripting wapper for analytical transfer. The project is a Python code library that SAS released using an open source model. That means that you can download it for free, make changes locally, and even contribute those changes back to the community (as some developers have already done!). You'll find it at github.com/sassoftware/python-swat.

SAS developer Kevin Smith is the main contributor on Python SWAT, and he's a big fan of Python. He's also an expert in SAS and in many programming languages. If you're a SAS user, you probably run Kevin's code every day; he was an original developer on the SAS Output Delivery System (ODS). Now he's a member of the cloud analytics team in SAS R&D. (He's also the author of more than a few conference papers and SAS books.)

Kevin enjoys the dynamic, fluid style that a scripting language like Python affords - versus the more formal "code-compile-build-execute" model of a compiled language. Watch this video (about 14 minutes) in which Kevin talks about what he likes in Python, and shows off how Python SWAT can drive SAS' machine learning capabilities.

New -- but familiar -- syntax for Python coders

The analytics engine behind the SAS Viya platform is called CAS, or SAS Cloud Analytic Services. You'll want to learn that term, because "CAS" is used throughout the SAS documentation and APIs. And while CAS might be new to you, the Python approach to CAS should feel very familiar for users of Python libraries, especially users of pandas, the Python Data Analysis Library.

CAS and SAS' Python SWAT extends these concepts to provide intuitive, high-performance analytics from SAS Viya in your favorite Python environment, whether that's a Jupyter notebook or a simple console. Watch the video to see Kevin's demo and discussion about how to get started. You'll learn:

  • How to connect your Python session to the CAS server
  • How to upload data from your client to the CAS server
  • How SWAT extends the concept of the DataFrame API in pandas to leverage CAS capabilities
  • How to coax CAS to provide descriptive statistics about your data, and then go beyond what's built into the traditional DataFrame methods.

Learn more about SAS Viya and Python

There are plenty of helpful resources to help you learn about using Python with SAS Viya:

And finally, what if you don't have SAS Viya yet, but you're interested in using Python with SAS 9.4? Check out the SASPy project, which allows you to access your traditional SAS features from a Jupyter notebook or Python console. It's another popular open source project from SAS R&D.

The post Using Python to work with SAS Viya and CAS appeared first on The SAS Dummy.

5月 062017
 

As SAS Viya has been gaining awareness over the past year among SAS users, there has been a lot of discussion about how SAS’ Cloud Analytic Server (CAS) handles memory vs SAS’ previous technologies such as LASR and HPA.  Recently, while I was involved in delivering several SAS Viya enablement sessions, I realised that many, including myself, held an incorrect understanding of how this works, mainly around one particular CAS option called maxTableMem.

The maxTableMem option determines the memory block size that is used per table, per CAS Worker, before converting data to memory-mapped memory.  It is not intended to directly control how much data is put into memory vs how much is put into CAS_DISK_CACHE, but rather it indirectly influences this.

Let’s unpack that a bit and try to understand what it really means.

The CAS Controller doesn’t care what the value of maxTableMem is.  In a serial load example, the CAS Controller distributes the data evenly across the CAS Workers[1], which then fill up maxTableMem-sized buckets (memory blocks), emptying them (converting them to memory-mapped memory) as they fill up, only leaving non-full buckets of table data.  You should almost never  change the default setting of this option (16MB), except perhaps in cases of extremely large tables, in order to reduce the number of file handles (up to 256MB is probably sufficient in these cases).

CAS takes advantage of standard memory mapping techniques for the CAS_DISK_CACHE, and leaves the optimisation of it up to the OS.  With SASHDAT files and LASR in SAS 9.4, the SASHDAT file essentially acts as a pre-paged file, written in a memory-mapped format, so the table data in memory doesn’t need to be written to disk when it is paged out.  Should a table need to be dropped from memory to make room for other data, and subsequently needed to be read back in to memory, it would be paged in from the SASHDAT file.

With CAS, the CAS_DISK_CACHE allows us to extend this pre-paged file approach to all data sources, not just SASHDAT.  Traditional OS swap files are written to each time memory is paged out, however with CAS, regardless of the data source (SASHDAT, database, client-uploaded file etc.) most table memory will never need to be written to disk, as it will already exist in the backing store (this could be CAS_DISK_CACHE, HDFS or NFS).   Although data will be continually paged in and out of memory, the amount of writing to disk will be minimised, which is typically slower than reading data from disk.

Another advantage of the CAS_DISK_CACHE is that when data does need to be written to disk it can happen upfront when the server is less busy, rather than at the last moment when the system detects it is out of memory (pre-paging rather than demand-paging).  Once it is written, it can be paged back into memory multiple times, by multiple concurrent processes.  The CAS_DISK_CACHE also spreads the I/O across multiple devices and servers as opposed to a typical OS swap file that may only write to a single file on a single server.

While CAS supports exceeding memory capacity by using CAS_DISK_CACHE as a backing store, read/write disk operations do have a performance cost.  Therefore, for best performance, we recommend you have enough memory capacity to hold your  most commonly used tables, meaning  most of the time the entire table will be both in memory and the backing store.

If you expect to regularly exceed memory capacity, and therefore are frequently paging data in from CAS_DISK_CACHE, consider spreading the CAS_DISK_CACHE location across multiple devices and using newer solid state storage technologies in order to improve performance.[1]

Additionally, when you need CAS to peacefully co-exist with other applications that are sharing resources on the same nodes, standard Linux cgroup settings along with Hadoop YARN configuration can be utilised to control the resources that CAS sessions can exploit.

References

Paging

Notes

[1] There are exceptions to data being evenly distributed across the CAS Workers.  The main one is if the data is partitioned and the partitions are of different sizes – all the data of a partition must be on the same node therefore resulting in an uneven distribution.  Also, if a table is very small, it may end up on only a single node, and when CAS is co-located with Hadoop the data is loaded locally from each node, so CAS receives whatever the distribution of data is that Hadoop provides.

[2] A comprehensive analysis of all possible storage combinations and the impact on performance has not yet been completed by SAS.

Dr. StrangeRAM or: How I learned to stop worrying and love CAS was published on SAS Users.