Grid

1月 192016
 

In Part 1 of this series, Cheryl Doninger described how SAS Grid Manager can extend your investment in the Hadoop infrastructure. In this post, we’ll take a look at how Cloudera Manager helps Hadoop administrators meet competing service level agreements (SLAs). Cloudera Manager lets Hadoop admins set up queues to […]

The post SAS Grid Manager for Hadoop nicely tied into YARN (Part 2) appeared first on The Data Roundtable.

1月 132016
 

If you'd like to extend your investment in the Hadoop infrastructure, SAS Grid Manager for Hadoop can help by enabling you to colocate SAS Grid jobs on your Hadoop data nodes. It works because SAS Grid Manager for Hadoop – which is Cloudera certified – is integrated with the native […]

The post SAS Grid Manager for Hadoop nicely tied into YARN (Part 1) appeared first on The Data Roundtable.

10月 262015
 

SAS Grid Manager for Hadoop is a brand new product released with SAS 9.4M3 this summer. It gives you the ability to co-locate your SAS Grid jobs on your Hadoop data nodes to let you further leverage your investment in your Hadoop infrastructure. This is possible because SAS Grid Manager for Hadoop is integrated with the native components, specifically YARN and Oozie, of your Hadoop ecosystem. Let's review the architecture of this new offering.

First of all, the official name– SAS Grid Manager for Hadoop– shows that it is a brand new product, not just an addition or a different configuration of the “classic” SAS Grid Manager – which I will subsequently refer to as “for Platform” to distinguish the two.

For an end user, grid usage and functionality remains the same, but an architect will notice that many components of the offering have changed. Describing these components will be the focus of the remainder of this post.

Let me start by showing a picture of a sample software architecture, so that it will be easier to recognize all the pieces with a visual schema in front of us. The following is one possible deployment architecture; there are other deployment choices.

SAS_Grid_Manager_for_Hadoop_9_4M3_Architecture_v1_1_full

Third party components

Just as SAS Grid Manager for Platform builds on top of third party software from Platform Computing (part of IBM), SAS Grid Manager for Hadoop requires Hadoop to function. There is a big difference, though.

SAS Grid Manager for Platform includes all of the required Platform Computing components, as they are delivered, installed and supported by SAS.

On the other side, SAS Grid Manager for Hadoop considers all of the Hadoop components (highlighted in yellow in the above diagram) as prerequisites. As such, customers are required to procure, install and support Hadoop before SAS gets installed.

Hadoop, as you know, includes many different components. The diagram lists the one that are needed for SAS Grid Manager:

  • HDFS provides cluster-wide filessytem storage
  • YARN is used for resource management
  • Oozie is the scheduling service
  • Hue is required, if the Oozie web GUI is surfaced through Hue.
  • Hive is required at install time for the SAS Deployment Wizard to be able to access the required Hadoop configuration and jar files.
  • Hadoop jars and config files need to be on every machine, including clients.

YARN Resource Manager, HDFS Name Node, Hive, and Oozie are not necessarily on the same machine. By default, the SAS grid control server needs to be on the machine that YARN Resource Manager is on.

SAS Components

SAS programming interfaces to grid have not changed, apart from the lower-level libraries to connect to the third party software. As such, SAS will deploy the traditional SAS grid control server, SAS grid nodes, SAS thin client (aka SASGSUB) or the full SAS client (SAS Display Manger).

In a typical SAS Grid deployment, a shared directory is used to share the installation and configuration directories between machines in the grid. With SAS Grid Manager for Hadoop, you can either use NFS to mount a shared directory on all cluster hosts or use the SAS Deployment Manager (SDM) to work with the cluster manager to distribute the deployment to the cluster hosts. The SDM has the ability to create Cloudera parcels and Ambari packages to enable the distribution of the installation and configuration directories from the grid control server to the grid nodes.

One notable missing component is the SAS Grid Manager plug-in for SAS Management Console. This management interface is tightly coupled with Platform Computing GMS, and cannot be used with Hadoop.

The Middle Tier

You will notice in the above diagram that the middle tier is faded. In fact, no middle tier components are included in SAS Grid Manager for Hadoop. Anyway, a middle tier will generally be included and deployed as part of other solutions licensed on top of SAS Grid Manager, so you will still be able to program using SAS Studio and monitor the SAS infrastructure using SAS Environment Manager.

Please note that I say “monitor the SAS infrastructure”, not “monitor the SAS grid.” There are no plug-ins or modules within SAS Environment Manager that are specific to SAS Grid Manager for Hadoop.   This is by design because SAS is part of your overall Hadoop environment and therefore the SAS Grid workload can be monitored using your favorite Hadoop management tools.

Hadoop provides plenty of web interfaces to monitor, manage and configure its environment. As such, you will be able to use YARN Web UI to monitor and manage submitted SAS jobs, as well as Hue web UI to review scheduled workflows.

The Storage

Discussing grid storage is never a quick task and could require a full blog post on its own. It is worth noting some architecture peculiarities related to SAS Grid Manager for Hadoop. HDFS can be used to store shared data, and is used to store scheduled jobs, workflows, logs. But, we still require a traditional, POSIX complaint filesystem for stuff such as SAS Work, SASGSUB, solution specific projects, etc.

Conclusion

SAS Grid Manager for Hadoop enables customers to co-locate their SAS Grid and all of the associated SAS workload on their existing Hadoop cluster. We have briefly discussed the key components that are included in – or are missing from – this new offering. I hope you found this post helpful. As always, any comments are welcome.

tags: grid, Hadoop, SAS architecture, SAS Grid Manager, SAS Grid Manager for Hadoop, SAS Professional Services

SAS Grid Manager for Hadoop architecture was published on SAS Users.

4月 272015
 

SAS recently performed testing using the Intel Cloud Edition for Lustre* Software - Global Support (HVM) available on AWS marketplace to determine how well a standard workload mix using SAS Grid Manager performs on AWS.  Our testing demonstrates that with the right design choices you can run demanding compute and I/O applications on AWS. You can find the detailed results in the technical paper, SAS® Grid Manager 9.4 Testing on AWS using Intel® Lustre.

In addition to the paper, Amazon will be publishing a post on the AWS Big Data Blog that will take a look at the approach to scaling the underlying AWS infrastructure to run SAS Grid Manager to meet the demands of SAS applications with demanding I/O requirements.  We will add the exact URL to the blog as a comment once it is published.

System design overview – network, instance sizes, topology, performance

For our testing, we set up the following AWS infrastructure to support the compute and IO needs for these two components of the system:

  • the SAS workload that was submitted using SAS Grid Manager
  • the underlying Lustre file system required to meet the clustered file system requirement of SAS Grid Manager.

SAS Grid Manager and Lustre shared file configuration on AWS clour

The SAS Grid nodes in the cluster are i2.8xlarge instances.  The 8xlarge instance size provides proportionally the best network performance to shared storage of any instance size, assuming minimal EBS traffic.  The i2 instance also provides high performance local storage, which is covered in more detail in the following section.

The use of an 8xlarge size for the Lustre cluster is less impactful since there is significant traffic to both EBS and the file system clients, although an 8xlarge is still is more optimal.  The Lustre file system has a caching strategy, and you will see higher throughput to clients in the case of frequent cache hits which effectively reduces the network traffic to EBS.

Steps to maximize storage I/O performance

The shared storage for SAS applications needs to be high speed temporary storage.  Typically temporary storage has the most demanding load.  The high I/O instance family, I2, and the recently released dense storage instance, D2, provide high aggregate throughput to ephemeral (local) storage.  For the SAS workload tested, the i2.8xlarge has 6.4 TB of local SSD storage, while the D2 has 48 TB of HDD.

Throughput testing and results

We wanted to achieve a throughput of least 100 MB/sec/core to temporary storage, and 50-75 MB/sec/core to shared storage.  The i2.8xlarge has 16 cores (32 virtual CPUs, each virtual CPU is a hyperthread on a core, and a core has two hyperthreads).  Testing done with lower level testing tools (fio and a SAS tool, iotest.sh)  showed a throughput of about 3 GB/sec to ephemeral (temporary) storage and about 1.5 GB/sec to shared storage.  The shared storage performance does not take into account file system caching, which Lustre does well.

This testing demonstrates that with the right design choices you can run demanding compute and I/O applications on AWS. For full details of the testing configuration and results, please see the SAS® Grid Manager 9.4 Testing on AWS using Intel® Lustre technical white paper.

 

tags: cloud computing, configuration, grid, SAS Administrators

The post Can I run SAS Grid Manager in the AWS cloud? appeared first on SAS Users.

1月 142015
 

When designing a SAS Grid Manager architecture, there is a requirement that has always been a critical component: a clustered file system. Over the years, vendors have released versions of these systems that are more robust and SAS has increased the minimum IO requirements, but the basic design has never changed—until now.

Any guess who the driver of this change could be? I heard a yellow elephant somewhere? Yes, Hadoop, but not only! File systems are now available that support SAS Grid computing in other shared-nothing storage architectures.

Let’s take a step back to understand how new file system options can facilitate your SAS Grid deployment. In this post, I’ll start with a quick review of storage architectures for SAS Grid Manager and what other vendors are doing. In a subsequent blog post, I’ll dive more specifically into the interaction of Hadoop and SAS Grid Manager.

Why a clustered file system

Why do we need a clustered file system? And why a traditional file share is not enough? What does clustered mean?

In short, the clustered file system is one of the most critical components of a SAS Grid deployment because it is where all the SAS data, configuration and sometimes the binaries are centrally located and shared across the machines. It feeds all the servers with the data that has to be processed, so the system you choose should meet these requirements:

  • Be reliable
  • Be fast
  • Handle the concurrency
  • Look the same from all machines

A good description of the requirements and possible candidates are found in this paper: A Survey of Shared File Systems: Determining the Best Choice for your Distributed Applications.

One of the conclusions from that paper is that clustered file systems provide the best performance. However, today, many customers are looking for options to move away from expensive SAN-based storage or at least to reduce the need and continued growth of their SAN infrastructure.

Benefits of new logical shared storage systems

The latest release of the paper includes one new IBM offering: File Placement Optimizer (GPFS FPO). As the paper says, “FPO is not a product distinct from Elastic Storage/GPFS, rather as of version 3.5 is an included feature that supports the ability to aggregate storage across multiple servers into one or more logical file systems across a distributed shared-nothing architecture.”

Aggregate storage—distributed—shared nothing. Sound familiar? Maybe HDFS?

GPFS/FPO can benefit SAS Grid deployments even without Hadoop. Here are some of the benefits you’ll see with GPFS/FPO and shared-nothing storage architectures:

  • Cost savings thanks to a favorable licensing model.
  • The ability to deploy SAS Grid Manager in a shared-nothing architecture reducing the need for expensive enterprise-class SAN infrastructure.
  • It provides block level replication at the file system level allowing continuity of operations in light of a failed disk drive or a failed node.
  • It supports full Posix file system semantics.

If we want to visualize the storage architecture a very high level, we could say that we are moving from environments built as in Figure 1 towards environments built as in Figure 2.

Shared storage

Figure 1. Shared storage

Figure 2.  Logical shared storage

Figure 2. Logical shared storage

SAS Grid Manager and logical shared file storage

IBM supports this new kind of storage, but does it work as we require for SAS Grid Manager? SAS R&D tested it and the results are in the following white paper: SAS Grid Manager – A “Building Block” approach.

It is important to consider all the caveats that the paper describes. The most important points, in my opinion, are the following two, which I tried to depict in the above image:

  • Ensure each node has a sufficient quantity of disks to meet SAS’ recommended IO throughput requirements based upon the number of cores plus the overhead needed to write replica blocks from other nodes in the environment.
  • Ensure that sufficient network connectivity exists that replica blocks can be written to other nodes without throttling IO throughput.

In conclusion, with the correct configuration, this new kind of storage can provide cost savings and simplify your SAS Grid Manager architecture, but still requires proprietary software.

Is it possible to achieve the same benefits with Hadoop? If yes, are there any differences or further considerations? I will try to provide the answers in my next blog, so stay tuned!

Edoardo

tags: clustered file systems, grid, SAS Administrators, SAS Professional Services, shared-nothing storage
11月 052014
 

For those of you who have followed my SAS Administration blogs, you will know that setting up your IO subsystem (the entire infrastructure from the network/fibre channels in your physical server, across your connections, to the fibre adapters into the storage array, and finally to the physical disk drives in the storage array) is very near and dear to my heart. Based on this desire, as my team learns information on better ways to configure the IO system infrastructure, we like to document it in new white papers, or by updating existing white papers.

Over the past few months, my team has been working with several SSD and FLASH storage vendors to stress their storage by running a mixed analytics workload and working with engineers from the storage vendors to tune the storage to run as optimally as possible. Many of these papers are already available for your review on the SAS Usage Note 53874 entitled Troubleshooting system performance problems: I/O subsystem and storage papers. A new paper Performance and Tuning Considerations for SAS on the EMC XtremIO All-Flash Array was added this week and similar papers for the EMC Isilon arrays and Intel P3700 (Fultondale) SSD cards will be added in the next week.

In addition to the above new papers, there are several papers that have been updated in the past few months.

A Survey of Shared File Systems (updated October 2014) was recently updated to add information on some tuning tips for the Veritas Clustered File System we recently learned while working with a SAS and Veritas CFS customer. 

We also clarified our position regarding the use of NFS and SAS. NFS can be a reasonable choice for small SAS Grid Manager implementations and/or when performance is less of a concern for the customer. In general, it is not a good idea to place SASWORK on an NFS file system where performance is a concern.

Best Practices for Configuring your IO Subsystem for SAS® 9 Applications (Revised May 2014) has been updated to reflect the recent influence of SSD and FLASH storage. It will be revised again for SAS Global Forum 2015 where it will be represented.

How to Maintain Happy SAS®9 Users (revised June 2014) has been updated to reflect the popularity of SAS Grid. It will also be revised for SAS Global Forum 2015 where it will be represented.

Please continue to send us your questions and concerns around configuring your IO subsystems properly for SAS applications.

tags: clustered file systems, flash storage, grid, performance, SAS Administrators
9月 172014
 

Scalability is the key objective of high-performance software solutions. “Scaling out” is a concept which is accomplished by throwing more server machines at a solution so that multiple processes can run in dedicated environments concurrently. This blog post will briefly touch on several scalability concepts that affect SAS.

Functional roles

scalability1At SAS, we have a number of different approaches to tackle the ability to scale our software across multiple machines. As we often see with our SAS Enterprise Business Intelligence solution components, we’ll split up the various functional roles of SAS software to run on specific hosts. In one of the most common examples, we’ll set aside one machine for the metadata services, another for the analytic computing workload, and a third for web services.

While this is more complicated than deploying everything to a single machine, it allows for a lot of flexibility in providing responsive resources which are optimized for each role. Now, we’re not limited to just three machines, of course.

Read more:
SAS® 9.4 Intelligence Platform: Overview

Clusters

scalability2For each of these functional roles – Meta, Compute, and Web – we can scale them out independently of the others. Depending on the technology involved, different techniques must be employed. The Meta and Web functional roles, in particular, are well-equipped to function as clusters.

Generally speaking, a software cluster is comprised of services that present as peers to the outside world. They offer scalability and improved availability where any node of the cluster can perform the requested work, continue to offer service in the face of failure of one or more nodes (depending on configuration) and other features.

Read more:

Grids

scalability_gridThe Compute functional role has some built-in ability to act as a cluster if the necessary SAS software is licensed and properly configured – which is pretty great already – but this ability can be extended even further to act as a grid. A grid is a distributed collection of machines that process many concurrent jobs by coordinating the efficient utilization of resources which may vary from host.

With proper implementation and administration, grids are very tolerant of diverse workloads and a mix of resources. For example, it’s possible to inform your grid that certain machines have certain resources available and others do not. Then, when you submit a job to the grid, you can declare parameters on the job that dictate the use of those resources. The grid will then ensure that only machines with those resources are utilized for the job. This simple illustration can be implemented in different ways depending on the kind of resources and with a high-degree of flexibility and control.

Another common component of clusters and grids is the use of a clustered file system. A clustered file system is visible to and accessed by each machine in the grid (or cluster) – typically at the exact same physical path. This is primarily used to ensure that all nodes are able to work with the same set of physical files. Those files might range from shared work product to software configuration and backups, event to shared executable binaries. The exact use of the clustered file system can of course vary from site to site.

Read more:

Massively Parallel Processing

scalability4Extending grid computing even further is the concept of massively parallel processing (or MPP). As we see with Hadoop technology and the SAS In-Memory solutions, a number of benefits can be realized through the use of carefully planned MPP clusters.

One common assumption behind MPP (especially in the implementation of the SAS In-Memory solutions) has historically been that all participating machines are as identical as possible. They have the same physical attributes (RAM, CPU, disk, network) as well as the same software components.

The premise of working in an MPP environment is that any given job (that is, something like a statistical computation or data to store for later) is simply broken into equal size chunks that are evenly distributed to all nodes. Each node works on the problem individually, sharing none of its own CPU, RAM, etc. with the others. Since the ideal is for all nodes to be identical and that each gets the same amount of work without competing for any resources, then complex workload management capabilities (such as described for grid above) are not as crucial.  This assumption keeps the required administrative overhead for workload management to a minimum.

Read more:

Hadoop and YARN

Looking forward, one of the challenges of assuming dedicated, identical nodes and equal-size chunks of work in MPP has been that it’s actually quite difficult to keep everything equal on all nodes all of the time. For one thing, this often assumes that all of the hardware is exclusive for MPP use all of the time – which might not be desirable for systems which sit idle overnight, on weekends, etc. Further, while breaking workload up into equal-size bits is possible, it’s sometimes tough to keep the workload perfectly equal and distributed when there exists competition for finite resources.

For these and many other reasons, Hadoop 2.0 introduces an improvement to the workload management of a Hadoop cluster called YARN (Yet Another Resource Negotiator).

The promise of YARN is to better manage resources in a way accessible to Hadoop as well as various other consumers (like SAS). This will help mature the MPP platform, evolving it from the old Map-Reduce framework to a more flexible platform to handle a wider variety of different workload and resource management challenges.

And of course, SAS solutions are already integrating with YARN to take advantage of the capabilities it offers.

Read more:

 

tags: grid, Hadoop, parallel processing, performance, SAS Administrators, SAS Professional Services, scalability, YARN
7月 172014
 

Most organizations enjoy a plethora of SAS user types—batch programmers and interactive users, power users and casual—and all variations in between. Each type of SAS user has its own needs and expectations, and it’s important that your SAS Grid Manager environment meets all their needs.

One common solution to this dilemma is to set up separate configurations based on a mix of requirements for departments, client applications and user roles. The grid options set feature in SAS 9.4 makes this task much easier. A grid options set is a convenient way to name a collection of SAS system options, grid options and required grid resources that are stored in metadata.

Why it’s important to tune SAS Grid Manager for interactive users

SAS Enterprise Guide users running interactive programs typically expect the results to be returned almost immediately. At present, the current out-of-the-box grid options are set for long-running batch jobs. These options include a latency of 20 seconds on the start of every server session, so SAS Enterprise Guide may experience unhappy delays.

More good news is the fact that SAS Enterprise Guide and other SAS software products are grid-aware. Once the optimum grid options set is defined and named, it is applied automatically whenever a user accesses the application and submits a job.

In this post, I’ll use Platform RTM for SAS to walk you through a few simple steps and provide a set of options that you can use as a baseline for tuning SAS Grid grid for your SAS Enterprise Guide users.

1) Reduce grid services sleep times.

The first tuning to perform is usually at the cluster level, to reduce grid services sleep times so that the interactive session starts faster. In Platform RTM, select Config►LSF►Batch Parameters and edit these settings:

MBD_SLEEP_TIME

SDB_SLEEP_TIME

MBD_REFRESH_TIME

JOB_SCHEDULING_INTERVAL

Never set these values to 0. You should tailor the actual values to your grid, considering factors such as number of nodes, number of concurrent users, patterns of utilization and so forth. You may need multiple iterations to tune performance to suit the needs of your SAS user type. Figure 1 shows a recommended starting point.

Figure 1.  Reduce grid services sleep time

Figure 1. Reduce grid services sleep time

2) Increase the number of job slots.

SAS Enterprise Guide and SAS Add-In for Microsoft Office are designed to keep the server session open for the full duration of the client session unless a user explicitly chooses to disconnect from the server. For SAS Grid Manager, this open session means that one job slot on that server is taken.

Therefore, for SAS Enterprise Guide use, you have to increase the number of job slots for each machine (use the MXJ parameter) from a default of 1 per core up to 5 or even 10 per core, depending on volume of usage. This step will increase the number of simultaneous SAS sessions on each grid node.

Interactive workloads are usually sporadic, intermittent, with short CPU bursts followed by periods of inactivity when the user is reviewing the results or exploring the data. Because these jobs are not I/O- or compute-intensive like large batch jobs, more jobs can be safely run on each machine

3) Implement CPU utilization thresholds for each machine.

Next, it is advisable to implement CPU utilization thresholds for each machine to prevent servers from being overloaded. With this limit in place, even if many users submit CPU-intensive work at the same time, SAS Grid Manager can manage the workload by suspending some jobs and resuming them when resources are available.

Changes in Step 2 and Step 3 are made at the host level. In RTM, select Config►LSF►Batch Hosts►default, edit Max Job Slots value and add the Advanced Attribute ut. See Figure 2.

Figure 2.  Increase the number of job slots and set CPU utilization thresholds.

Figure 2. Increase the number of job slots and set CPU utilization thresholds.

 

4) Create dedicated queues.

Even with this tuning, one user can easily use up all of the slots of a grid by starting many SAS Enterprise Guide sessions or by writing code that uses all the available slots for a single SAS session. When a machine runs out of slots, it is closed for use and work is routed to the next available slot. If all machines are closed and no machine has a free slot, no user can get another workspace. It doesn’t matter that the user with many open sessions is not actually using the resources. He or she might go for lunch, leaving his session open on a results page with no CPU, no I/O, nothing used on the server.

The best way to prevent this is by creating a dedicated queue called EGDefault, with a UJOB_LIMIT parameter low enough (for example, 3 slots as shown in Figure 3). After that, each user will be then limited to 3 concurrent server sessions, whether started from the same client or from different SAS Enterprise Guide instances. When using SAS Enterprise Guide parallel features, the value of UJOB_LIMIT should be higher, provided that proper server sizing has been performed to accommodate for the additional resources required.

In RTM, you can create this queue selecting Config►LSF►Queues►Add. To make this the default queue for SAS Enterprise Guide users, all you have to do is create a grid options set in SAS Management Console and add this EGDefault queue as a grid option to it.

Figure 3.  Set job limits in an EGDefault queue.

Figure 3. Set job limits in an EGDefault queue.

5) Create other grid options sets as needed.

There will always be ad hoc users or projects that do not fit into default categories (for example, they might be running jobs that have a high priority or jobs that require a large number of computing resources). For users requiring higher priority for their jobs or require more computing resources, it is just a case of defining a new queue such as EGPower. To prevent misuse, it's common to limit access to this special queue to selected users.

In previous releases, additional queues would been created by defining a special user group and then adding it to the USERS parameter in the queue definition. While effective, this has the disadvantage of duplicating user-related management both in metadata and in grid configuration files. With SAS 9.4, it possible to apply metadata security to grid options sets to keep all in one place—that is, in metadata.

6)  Set options for other interactive and batch queues.

Finally, if you have other queues, for example, ones dedicated for example to SAS® Data Integration Studio users or to batch processing, put job slot limits there, too, to compensate the large increase to the Max Job Slots parameter we made for default hosts. Figure 4 shows the Advanced Attribute PJOB_LIMIT added to a batch queue, to enforce the limit of one batch job per physical core on every host.

Figure 4.  Set job slots parameter for batch queue.

Figure 4. Set job slots parameter for batch queue.

When you have all queues defined, your final configuration may look like the following:

Figure 5.  All queues at a glance in RTM.

Figure 5. All queues at a glance in RTM.

For more details about using SAS Enterprise Guide in a SAS Grid Manager, you can refer to my SAS Global Forum 2014 presentation:   Effective Use of SAS® Enterprise Guide® in a SAS® 9.4 Grid Manager Environment.

Refer to Working with Grid Options Sets in Grid Computing in SAS® 9.4 documentation for more information on creating grid options sets.

 

tags: grid, performance, SAS Administrators
7月 022014
 

This blog is a continuation of an earlier blog entitled “To grid or not to grid?” In that blog, one of the reasons to say “yes to SAS Grid” is to see if you can gain some performance improvements from modifying your existing SAS processes by converting them to a distributed processing format. If improving performance of individual SAS applications is one of your reasons to implement SAS Grid Manager, please read on.

To start with, all of your existing SAS jobs can run on your new SAS Grid, but not all of your existing SAS jobs can be turned into distributed processing applications that can be run simultaneously across multiple nodes of the SAS Grid. For example, processes that rely heavily on OLAP processing do not lend themselves to parallelization. Statistical analysis tasks that need to create a matrix in memory cannot be parallelized either.

When identifying SAS applications to modify for distribution across a SAS Grid, the first thing to look for is jobs that take many hours and even days to complete. In addition to long execution times, there are several profiles of SAS applications or jobs that are good candidates for running in parallel across a SAS Grid.

Long-running jobs

One profile is a SAS job (or combination of multiple SAS jobs) that take an extraordinary amount of time to execute because it is processing a large amount of data (hundreds of gigabytes approaching terabytes). This type of application requires running the same SAS task or tasks, over and over, on either all of the data or different subsets of the data or both.

There are many examples of SAS tasks that fit this profile. Here are a few:

  • Large steps within a SAS job that are processing hundreds of millions of rows, for example, large DATA steps or statistical simulations such as Monte Carlo methods or mining massive data files.
  •  Any SAS task that does BY GROUP processing. In particular, forecasting steps, frequency counts and many statistical tasks.

In this profile, you would break up the large SAS task into multiple subtasks, start a distributed SAS process for each subtask (see diagram below) and run them simultaneously. As you can see, there is a join data step at the end before the final results. We need to mention that you will need to experiment with how many subtasks to run so that the time required for the join data step does not negate the performance gains of the simultaneous subtasks.

grid-enabling

Jobs with independent tasks

Another profile is a SAS job that has lots of independent SAS tasks that are run by default in sequential fashion, but these independent SAS tasks can easily be run in parallel.

  •  A good example of this is all of the modeling done with a SAS Enterprise Miner project. SAS Enterprise Miner is SAS Grid aware and will generate the code to do your model training in parallel using SAS Grid. 
  • SAS Risk Dimensions can be deployed to run independent tasks in parallel
  •  SAS Data Integration Studio is also SAS Grid aware and can therefore generate the SAS code to automatically process your data transformations in parallel using SAS Grid.

SAS Code Analyzer

Both of the above profiles can use the SAS Code Analyzer that was released with SAS 9.2 to help determine if the SAS code you are running is a candidate for distributed processing. The SCAPROC procedure assists with the difficult and tedious tasks of determining which steps can be run in parallel. This is especially helpful with legacy SAS jobs. Details on how to use this procedure can be found in this SAS Global Forum paper: Introducing the SAS® Code Analyzer.

In summary, all existing SAS jobs can run on your SAS Grid. However, not all SAS processes are good candidates for distributed processing due to the inherent sequential nature of the analytics or flow of the program logic. Please work with your SAS account team to have a technical review of your planned SAS applications, especially if improved performance of individual SAS applications is your primary reason to go to a SAS Grid.

tags: grid, parallelization of SAS jobs, SAS Administrators, SCAPROC
6月 132014
 

Let’s be honest.  When well planned, a SAS Grid Computing platform as the basis for a shared, highly available, high-performance analytics environment can pay for itself many times over. However, it is critical that your overall objectives and computing environment be well understood for you to achieve success with your SAS Grid implementation and to get the maximum benefit.

This post is the first in a series that will explore some of the best practices in setting up a high-performance, high-availability SAS analytics environment, but first let’s take time to understand what you can expect from a grid implementation:

When to say yes to the grid

  • Your SAS applications are mission critical, and you need to set up a highly available infrastructure. 
  • You have lots of SAS users running lots of SAS applications, and you want to implement a shared SAS Analytic environment that allocates resources as needed.  
  • You have end-of-month or end-of-quarter SAS processing that has very tight SLAs, and you need to be guaranteed that your compute resources meet these SLAs. 
  • You would like to establish a hardware and software infrastructure that can be scaled out to meet your ever-growing SAS user base and the ever-growing data being analyzed by these SAS users.  
  • You would like to see if you can gain some performance improvements by utilizing the new SAS High Performance Analytics processes or take your existing SAS processes and convert them to a distributed processing format or both. Please note that not all SAS processes are good candidates for distributed processing. For example, processes that rely heavily on OLAP processing do not lend themselves to parallelization.

Learning more

Here are some papers that you can read to learn more on the above: 

SAS GRID 101: How It Can Modernize Your Existing SAS Environment

SAS Goes Grid – Managing the Workload across Your Enterprise

High Availability Services with SAS Grid Manager

High Availability with SAS Grid Manager

The Top Four User-Requested Grid Features Delivered with SAS® Grid Manager 9.4

tags: grid, SAS Administrators