performance

10月 102018
 

Deep learning (DL) is a subset of neural networks, which have been around since the 1960’s. Computing resources and the need for a lot of data during training were the crippling factor for neural networks. But with the growing availability of computing resources such as multi-core machines, graphics processing units (GPUs) accelerators and hardware specialized, DL is becoming much more practical for business problems.

Financial institutions use a large number of computations to evaluate portfolios, price securities, and financial derivatives. For example, every cell in a spreadsheet potentially implements a different formula. Time is also usually of the essence so having the fastest possible technology to perform financial calculations with acceptable accuracy is paramount.

In this blog, we talk to Henry Bequet, Director of High-Performance Computing and Machine Learning in the Finance Risk division of SAS, about how he uses DL as a technology to maximize performance.

Henry discusses how the performance of numerical applications can be greatly improved by using DL. Once a DL network is trained to compute analytics, using that DL network becomes drastically faster than more classic methodologies like Monte Carlo simulations.

We asked him to explain deep learning for numerical analysis (DL4NA) and the most common questions he gets asked.

Can you describe the deep learning methodology proposed in DL4NA?

Yes, it starts with writing your analytics in a transparent and scalable way. All content that is released as a solution by the SAS financial risk division uses the "many task computing" (MTC) paradigm. Simply put, when writing your analytics using the many task computing paradigm, you organize code in SAS programs that define task inputs and outputs. A job flow is a set of tasks that will run in parallel, and the job flow will also handle synchronization.

Fig 1.1 A Sequential Job Flow

The job flow in Figure 1.1 visually gives you a hint that the two tasks can be executed in parallel. The addition of the task into the job flow is what defines the potential parallelism, not the task itself. The task designer or implementer doesn’t need to know that the task is being executed at the same time as other tasks. It is not uncommon to have hundreds of tasks in a job flow.

Fig 1.2 A Complex Job Flow

Using that information, the SAS platform, and the Infrastructure for Risk Management (IRM) is able to automatically infer the parallelization in your analytics. This allows your analytics to run on tens or hundreds of cores. (Most SAS customers run out of cores before they run out of tasks to run in parallel.) By running SAS code in parallel, on a single machine or on a grid, you gain orders of magnitude of performance improvements.

This methodology also has the benefit of expressing your analytics in the form of Y= f(x), which is precisely what you feed a deep neural network (DNN) to learn. That organization of your analytics allows you to train a DNN to reproduce the results of your analytics originally written in SAS. Once you have the trained DNN, you can use it to score tremendously faster than the original SAS code. You can also use your DNN to push your analytics to the edge. I believe that this is a powerful methodology that offers a wide spectrum of applicability. It is also a good example of deep learning helping data scientists build better and faster models.

Fig 1.3 Example of a DNN with four layers: two visible layers and two hidden layers.

The number of neurons of the input layer is driven by the number of features. The number of neurons of the output layer is driven by the number of classes that we want to recognize, in this case, three. The number of neurons in the hidden layers as well as the number of hidden layers is up to us: those two parameters are model hyper-parameters.

How do I run my SAS program faster using deep learning?

In the financial risk division, I work with banks and insurance companies all over the world that are faced with increasing regulatory requirements like CCAR and IFRS17. Those problems are particularly challenging because they involve big data and big compute.

The good news is that new hardware architectures are emerging with the rise of hybrid computing. Computers are increasing built as a combination of traditional CPUs and innovative devices like GPUs, TPUs, FPGAs, ASICs. Those hybrid machines can run significantly faster than legacy computers.

The bad news is that hybrid computers are hard to program and each of them is specific: you write code for GPU, it won’t run on an FPGA, it won’t even run on different generations of the same device. Consequently, software developers and software vendors are reluctant to jump into the fray and data scientist and statisticians are left out of the performance gains. So there is a gap, a big gap in fact.

To fill that gap is the raison d’être of my new book, Deep Learning for Numerical Applications with SAS. Check it out and visit the SAS Risk Management Community to share your thoughts and concerns on this cross-industry topic.

Deep learning for numerical analysis explained was published on SAS Users.

8月 042016
 

Testing EMC Storage and Veritas shared file systemsIn my current role I have the privilege of managing the Performance Lab in SAS R&D. Helping users work through performance challenges is a critical part of the Lab’s mission. This spring, my team has been actively testing new and enhanced storage arrays from EMC along with the Veritas clustered file system.  We have documented our findings on the SAS Usage note 42197 “List of Useful Papers.”

The two different flash based storages we tested from EMC are the new DSSD D5 appliance and XtremIO array.  The bottom line: both storages performed very nicely with a mixed analytics workload.  For more details on the results of the testing along with the tuning guidelines for using SAS with this storage, please review these papers:

As with all storage, please validate that the storage can deliver all the “bells and whistles;” you will need to support your failover and high availability needs of your SAS applications.

In addition to the storage testing, we tested with the latest version of Veritas InfoScale clustered file system.  We had great results in a distributed SAS environment with several SAS compute nodes all accessing data in this clustered file system.  A lot of information was learned in this testing and captured in the following paper:

My team plans to continue testing of new storage and file system technologies throughout the remainder 2016.  If there is a storage array or technology you would like to have tested, please let us know by sharing it in the comments section below or contacting me directly.

 

tags: performance, storage

Testing EMC Storage and Veritas shared file systems was published on SAS Users.

4月 222016
 

Our society lives under a collective delusion that burning out is a requirement to success. We’ve been conditioned to believe that sacrificing family, relationships and what’s personally important opens the door to achievement. But how can you be an effective leader, run a successful company or properly manage employees when […]

Sleep: Your legal performance enhancer was published on SAS Voices.

11月 182015
 

From time to time we’ll hear from customers who are encountering performance issues. SAS has a sound methodology for resolving these issues and we are always here to keep your SAS system humming. However, many problems can be resolved with some simple suggestions. This blog will discuss different types of performance issues you might encounter, with some suggestions on how to effectively resolve them.

Situation: You are a new SAS customer or are simply running a new SAS application on new hardware
Suggestion: Be sure you’ve read and applied all the guidelines in the various tuning papers that have been written:

Making sure you understand the performance issues will help us determine what next steps are. It’s worth noting, 90% of performance issues are because your hardware, operating system and/or storage has not been configured based on the tuning guidelines listed above.  In a recent case we were able to get a 20% performance gain from a long running ETL process by adjusting two RHEL kernel parameters that have been documented for many years in our tuning paper.

Situation: Your SAS application has been running and over time gets slower
Suggestion: Determine if the number of concurrent SAS sessions/users has increased and/or the volume of data (both input and lookup tables) have increased.  This is the top reason for a gradual slowdown.

Situation: Your SAS application took a significant performance hit overnight or in a short time frame.
Suggestion: The first thing you want to do is see if any maintenance (tweaking of your system, hotfix, patch, …) have been made to your operating system, VMware, and/or storage arrays.  A lot of customers have applied maintenance (not to SAS) and SAS all of a sudden is running 2-5 times longer. You’ll want to check that all the operating system settings, mount options, and VMware settings are the same after the maintenance as they were before maintenance.

In conclusion, if you are having performance issues, check the suggested tuning guidelines. Also, be sure to keep track of all the settings for the hardware and storage infrastructure when applying maintenance to make sure these settings are the same afterwards as they were before.

Of course, if you have followed the guidelines and maintenance is not the reason for your performance issues, please contact us. We are here to help.

tags: performance, SAS Administrators, tuning

Tips to keep your SAS system humming was published on SAS Users.

3月 112015
 

SAS FULLSTIMER is a SAS system option that takes operating system information that is being collected by SAS process runs and writes that information to the SAS log. Using it can add up to 10 lines additional lines to your SAS log for each SAS step in your SAS log—so why would I recommend turning it on?

This additional information includes memory utilization, date/time stamp for when each step finished, context switch information, along with some other operating-specific information regarding the SAS step that just finished.  Why would you need this much information?

This data is very useful in helping your SAS administrator and SAS support personnel determine why a SAS process may be running slower than expected. Having this information collected every time a SAS job is run means that data can be used to help determine which SAS step ran slower and at what time and under what circumstances.

Since the IT staff for most organizations are collecting hardware monitor data on daily basis, they can then use the information from the SAS log to pinpoint what time of day the performance issue occurred, on what system and using what file systems.

Again, this is just one way SAS users can be proactive in trying to solve any future performance issues. And all you need to do is add –FULLSTIMER to your SAS configuration file or to the SAS command line that you use to invoke SAS.

If you have any questions on the above, please let us know.  Here are additional resources if you want to learn more about SAS FULLSTIMER and its use:

SAS timer - the key to writing efficient SAS code

Improving performance: Determine the cause

Tune your SAS system for max performance

Troubleshoot Your Performance Issues: SAS® Technical Support Shows You How

Increasing IT’s awareness of SAS: A few good practices

tags: performance, SAS Administrators
1月 212015
 

New Year to me is always a stark reminder of the inexorability of Time. In a day-to-day life, time is measured in small denominations - minutes, hours, days… But come New Year, and this inescapable creature – Time – makes its decisive leap – and in a single instant, we become officially older and wiser by the entire year’s worth.

What’s a better time to re-assess ourselves, personally and professionally! What’s a better time to Resolve to improve your SAS programming skills, as skillfully crafted by Michael A. Raithel in his latest blog post.

I thought I could write a post showing how to be efficient and kill two birds with one stone.  The birds here are two New Year’s Raithel’s proposed resolutions:

#2 Volunteer to help junior SAS programmers.

#12 Reduce processing time by writing more efficient programs.

To combine the two, I could have titled this post “Helping junior SAS programmers to reduce processing time by writing more efficient programs”. However, I am not going to “teach” you efficient coding techniques which are a subject deserving of a multi-volume treatise. I will just give you a simple tool that is a must-have for any SAS programmer (not just junior) who considers writing efficient SAS code important. This simple tool has been the ultimate judge of any code’s efficiency and it is called timer.

What is efficient?

Setting aside hardware constraints and limitations (which are increasingly diminishing nowadays), efficient means fast or at least fast enough not to exceed ever-shrinking user tolerance of wait time.

Of course, if you are developing a one-time run code to generate some ad-hoc report or produce results for uniquely custom computations, your efficiency criteria might be different, such as “as long as it ends before the deadline” or at least “does not run forever”.

However, in most cases, SAS code is developed for some applications, in many cases interactive applications, where many users run the code over and over again. It may run behind the scenes of a web application with a user waiting (or rather not wanting to wait) for results. In these cases, SAS code must be really fast, and any improvement in its efficiency is multiplied by the number of times it is run.

What is out there?

SAS provides the following SAS system options to measure the efficiency of SAS code:

STIMER. You may not realize that you use this option every time you run a SAS program. This option is turned on by default (NOSTIMER to turn it off) and controls information written to the SAS Log by each SAS step. Each step of a SAS program by default generates the following sample NOTE in SAS Log:

NOTE: DATA statement used (Total process time):
      real time           1.31 seconds
      cpu time            1.10 seconds

FULLSTIMER. This option (NOFULLSTIMER to turn it off) provides much more information on used resources for each step. A sample Log output of a FULLSTIMER option for a SAS Data Step is listed below:

NOTE: DATA statement used:
real time                   0.06 seconds
user cpu time               0.02 seconds
system cpu time             0.00 seconds
Memory                      88k
Page Faults                  10
Page Reclaims                 0
Page Swaps                    0
Voluntary Context Switches   22
Involuntary Context Switches  0
Block Input Operations       10
Block Output Operations      12

While the FULLSTIMER option provides plenty of information for SAS code optimization, in many cases it is more than you really need. On the other hand, STIMER may provide quite valuable information about each step, thus identifying the most critical steps of your SAS program.

Get your own SAS timer

If your efficiency criteria is how fast your SAS program runs as a whole, than you need an old-fashioned timer, with start and stop events and time elapsed between them. To achieve this in SAS programs, I use the following technique.

  1. At the very beginning of your SAS program, place the following line of code that effectively starts the timer and remembers the start time:
  2. /* Start timer */
    %let _timer_start = %sysfunc(datetime());
    

  3. At the end of your SAS program place the following code snippet that captures the end time, calculates duration and outputs it to the SAS Log:
  4. /* Stop timer */
    data _null_;
      dur = datetime() - &_timer_start;
      put 30*'-' / ' TOTAL DURATION:' dur time13.2 / 30*'-';
    run;
    

    The resulting output in the SAS log will look like this:

    ------------------------------
     TOTAL DURATION:   0:01:31.02
    ------------------------------
    

    Despite its utter simplicity, this little timer is a very convenient little tool to improve your SAS code efficiency. You can use it to compare or benchmark your SAS programs in their entirety.

    Warning. In the above timer, I used the datetime() function, and I insist on using it instead of the time() function as I saw in many online resources. Keep in mind that the time() function resets to 0 at midnight. While time() will work just as well when start and stop times are within the same date, it will produce completely meaningless results when start time falls within one date and stop time falls within another date. You can easily trap yourself in when you submit your SAS program right before midnight while it ends after midnight, which will result in an incorrect, even negative, duration.

    I hope using this SAS timer will help you writing more efficient SAS programs.

tags: efficient SAS code, performance, SAS Professional Services, SAS Programmers, SAS system options
11月 052014
 

For those of you who have followed my SAS Administration blogs, you will know that setting up your IO subsystem (the entire infrastructure from the network/fibre channels in your physical server, across your connections, to the fibre adapters into the storage array, and finally to the physical disk drives in the storage array) is very near and dear to my heart. Based on this desire, as my team learns information on better ways to configure the IO system infrastructure, we like to document it in new white papers, or by updating existing white papers.

Over the past few months, my team has been working with several SSD and FLASH storage vendors to stress their storage by running a mixed analytics workload and working with engineers from the storage vendors to tune the storage to run as optimally as possible. Many of these papers are already available for your review on the SAS Usage Note 53874 entitled Troubleshooting system performance problems: I/O subsystem and storage papers. A new paper Performance and Tuning Considerations for SAS on the EMC XtremIO All-Flash Array was added this week and similar papers for the EMC Isilon arrays and Intel P3700 (Fultondale) SSD cards will be added in the next week.

In addition to the above new papers, there are several papers that have been updated in the past few months.

A Survey of Shared File Systems (updated October 2014) was recently updated to add information on some tuning tips for the Veritas Clustered File System we recently learned while working with a SAS and Veritas CFS customer. 

We also clarified our position regarding the use of NFS and SAS. NFS can be a reasonable choice for small SAS Grid Manager implementations and/or when performance is less of a concern for the customer. In general, it is not a good idea to place SASWORK on an NFS file system where performance is a concern.

Best Practices for Configuring your IO Subsystem for SAS® 9 Applications (Revised May 2014) has been updated to reflect the recent influence of SSD and FLASH storage. It will be revised again for SAS Global Forum 2015 where it will be represented.

How to Maintain Happy SAS®9 Users (revised June 2014) has been updated to reflect the popularity of SAS Grid. It will also be revised for SAS Global Forum 2015 where it will be represented.

Please continue to send us your questions and concerns around configuring your IO subsystems properly for SAS applications.

tags: clustered file systems, flash storage, grid, performance, SAS Administrators
10月 222014
 

If you’ve used SAS Environment Manager, you know what kind of information it’s capable of providing – the metrics that show you how the resources in your SAS environment are performing. But what if it could do more? What if it could automatically collect and standardize metric data from SAS logs for your SAS applications? What if it could automatically collect and standardize metric data about the computing resources that make up your SAS system? And what if it could store all of that data in a single location, where it could be used to generate detailed predefined reports or to perform your own analysis?

All of this is possible with the SAS Environment Manager Service Management Architecture, which is a new feature in SAS Environment Manager 2.4. These new functions enable SAS Environment Manager to be a part of a service-oriented architecture (SOA) and can help your organization meet your IT Infrastructure Library (ITIL) reporting and measurement requirements.

You can think of the core of this new feature as being made up of three broad functional areas: data collection, data storage and data reporting.

Data collection

Data collection is handled by extract, transform and load (ETL) processes. These processes obtain metric data from your system, convert the data to a standard format and load the data into storage.  Three ETL packages are included with SAS Environment Manager:

The Audit, Performance, and Measurement (APM) ETL is used to collect, process and store information from SAS logs. The APM metric data can show you things such as:

  • which SAS procedures are used most often and how much time each one is used
  • the top ten users of your SAS Workspace Server and the details of each user’s usage
  • the average response time of stored processes

The Agent-Collected Metric (ACM) ETL is used to collect, process, and store information about the computing resources in your SAS system. It handles metric data from resources such as servers and disk storage. The APM metric data can show you things such as:

  • the disk service time of each file mount
  • the number of calls to each IOM server
  • the memory usage for your Web Application Server

The Solution Kits ETL is used to collect, process and store information from specific SAS solutions and applications as those ETL processes are developed and delivered by SAS.

Data storage

The SAS Environment Manager Data Mart handles the data storage function. The Data Mart is made up of a set of predefined SAS data sets that store the metric data that is collected by the ETL processes. Once in the Data Mart, the data is used by the built-in reporting feature of SAS Environment Manager Service Management Architecture. However, the data is also available for you to perform your own analysis using SAS programs and applications or third-party monitoring tools.

Data reporting

The data reporting function uses the Report Center, which provides a wide variety of predefined reports that are tailored to the metric data collected by the ETL processes. These reports enable you to visualize the metric data and more easily identify potential problems.  You can also create your own custom reports.

EV_SMA

Other capabilities

In addition to these core functions, the SAS Environment Manager Service Management Architecture includes these capabilities:

  • Setup based on best practices – Alerts, resource definitions, resource groups and metric collection changes (based on service monitoring best practices) are all automatically applied to SAS Environment Manager. This feature doesn’t require you to use the ETL processes, so you can easily optimize your SAS Environment Manager configuration even if you don’t use any other part of the Service Management Architecture.
  • Event exporting – You can export events from SAS Environment Manager for use by third-party monitoring tools.
  • Event importing – Your SAS solutions and external applications can generate events that you can import into SAS Environment Manager.
  •  SAS Visual Analytics autofeed –You can specify that the metric data from the SAS Environment Manager Data Mart is automatically copied to a specified location where SAS Visual Analytics can then access it and load it into the application. This feature enables you to use the powerful analysis and reporting capabilities of SAS Visual Analytics with the metric data from SAS Environment Manager.

SAS Environment Manager Service Management Architecture is provided as part of SAS Environment Manager 2.4, but it’s not active by default. To use these new features, you have to follow an initialization procedure for each of the components that you want to use.

For complete information about the SAS Environment Manager Service Management Architecture, including the initialization procedure, see the SAS Environment Manager 2.4 User’s Guide.

 

tags: ITIL reporting, monitoring jobs and processes, performance, SAS Administrators, SAS Environment Manager
10月 082014
 

When SAS is used for analysis on large volumes of data (in the gigabytes), SAS reads and writes the data using large block sequential IO.  To gain the optimal performance from the hardware when doing these IOs, we strongly suggest that you review the information below to ensure that the infrastructure (CPUs, memory, IO subsystem) are all configured as optimally as possible.

Operating-system tuning. Tuning Guidelines for working with SAS on various operating systems can be found on the SAS Usage Note 53873.

CPU. SAS recommends the use of current generation processors whenever possible for all systems.

Memory. For each tier of the environment, SAS recommends the following minimum memory, guidelines:

  • SAS Compute tier: A minimum of 8GB of RAM per core
  • SAS Middle tier: A minimum 24GB or 8GB of RAM per core, whichever is larger
  • SAS Metadata tier:  A minimum of 8GB of RAM per core

It is also important to understand the amount of virtual memory that is required in the system. SAS recommends that virtual memory be 1.5 to 2 times the amount of physical RAM. If, in monitoring your system, it is evident that the machine is paging a lot, then SAS recommends either adding more memory or moving the paging file to a drive with a more robust I/O throughput rate compared to the default drive. In some cases, both of these steps may be necessary.

IO configuration. Configuring the IO subsystem (disks within the storage, adaptors coming out of the storage, interconnect between the storage and processors, input into the processors) to be able to deliver the IO throughput recommended by SAS will keep the processor busy, allow the workloads to execute without delays and make the SAS users happy.  Here are the recommended IO throughput for the typical file systems required by the SAS Compute tier:

  • Overall IO throughput needs to be a minimum of 100-125 MB/sec/core.
  • For SAS WORK, a minimum of 100 MB/sec/core
  • For permanent SAS data files, a minimum of 50-75 MB/sec/core

For more information regarding how SAS does IO, please review the Best Practices for Configuring your IO Subsystem for SAS® 9 Applications (Revised May 2014) paper.

IO throughput. Additionally, it is a good idea to establish base line IO capabilities before end-users begin placing demands on the system as well as to support monitoring the IO if end-users begin suggesting changes in performance.  To test the IO throughput, platform specific scripts are available:

File system. The Best Practices for Configuring IO paper above lists the preferred local file systems for SAS (i.e. JFS2 for AIX, XFS for RHEL, NTFS for Windows). Specific tuning for these file systems can be found the above operating system tuning papers.

For SAS Grid Computing implementations, a clustered file system is required.  SAS has tested SAS Grid Manager with many file systems, and the results of that testing along with any available tuning guidelines can be found in the A Survey of Shared File Systems (updated August 2013) paper.  In addition to this overall paper, there are more detailed papers on Red Hat’s GFS2 and IBM’s GPFS clustered file systems on the SAS Usage Note 53875.

Due to the nature of SAS WORK (the temporary file system for SAS applications), which does large sequential reads and writes and then destroys these files at the termination of the SAS session, SAS does not recommend NFS mounted file systems. These systems have a history of file-locking issues on NFS systems, and the network can negatively influence the performance of SAS when accessing files across it, especially when doing writes.

Storage array. Storage arrays play an important part in the IO subsystem infrastructure.  SAS has several papers on tuning guidelines for various storage arrays, through the SAS Usage Note 53874.

Miscellaneous. In addition to the above information, there are some general papers on how to setup the infrastructure to best support SAS, these are available for your review:

Finally, SAS recommends regular monitoring of the environment to ensure ample compute resources for SAS.  Additional papers are available that provide guidelines for appropriate monitoring.   These can be found on the SAS Usage Note 53877.

tags: configuration, deployment, performance, SAS Administrators
9月 172014
 

Scalability is the key objective of high-performance software solutions. “Scaling out” is a concept which is accomplished by throwing more server machines at a solution so that multiple processes can run in dedicated environments concurrently. This blog post will briefly touch on several scalability concepts that affect SAS.

Functional roles

scalability1At SAS, we have a number of different approaches to tackle the ability to scale our software across multiple machines. As we often see with our SAS Enterprise Business Intelligence solution components, we’ll split up the various functional roles of SAS software to run on specific hosts. In one of the most common examples, we’ll set aside one machine for the metadata services, another for the analytic computing workload, and a third for web services.

While this is more complicated than deploying everything to a single machine, it allows for a lot of flexibility in providing responsive resources which are optimized for each role. Now, we’re not limited to just three machines, of course.

Read more:
SAS® 9.4 Intelligence Platform: Overview

Clusters

scalability2For each of these functional roles – Meta, Compute, and Web – we can scale them out independently of the others. Depending on the technology involved, different techniques must be employed. The Meta and Web functional roles, in particular, are well-equipped to function as clusters.

Generally speaking, a software cluster is comprised of services that present as peers to the outside world. They offer scalability and improved availability where any node of the cluster can perform the requested work, continue to offer service in the face of failure of one or more nodes (depending on configuration) and other features.

Read more:

Grids

scalability_gridThe Compute functional role has some built-in ability to act as a cluster if the necessary SAS software is licensed and properly configured – which is pretty great already – but this ability can be extended even further to act as a grid. A grid is a distributed collection of machines that process many concurrent jobs by coordinating the efficient utilization of resources which may vary from host.

With proper implementation and administration, grids are very tolerant of diverse workloads and a mix of resources. For example, it’s possible to inform your grid that certain machines have certain resources available and others do not. Then, when you submit a job to the grid, you can declare parameters on the job that dictate the use of those resources. The grid will then ensure that only machines with those resources are utilized for the job. This simple illustration can be implemented in different ways depending on the kind of resources and with a high-degree of flexibility and control.

Another common component of clusters and grids is the use of a clustered file system. A clustered file system is visible to and accessed by each machine in the grid (or cluster) – typically at the exact same physical path. This is primarily used to ensure that all nodes are able to work with the same set of physical files. Those files might range from shared work product to software configuration and backups, event to shared executable binaries. The exact use of the clustered file system can of course vary from site to site.

Read more:

Massively Parallel Processing

scalability4Extending grid computing even further is the concept of massively parallel processing (or MPP). As we see with Hadoop technology and the SAS In-Memory solutions, a number of benefits can be realized through the use of carefully planned MPP clusters.

One common assumption behind MPP (especially in the implementation of the SAS In-Memory solutions) has historically been that all participating machines are as identical as possible. They have the same physical attributes (RAM, CPU, disk, network) as well as the same software components.

The premise of working in an MPP environment is that any given job (that is, something like a statistical computation or data to store for later) is simply broken into equal size chunks that are evenly distributed to all nodes. Each node works on the problem individually, sharing none of its own CPU, RAM, etc. with the others. Since the ideal is for all nodes to be identical and that each gets the same amount of work without competing for any resources, then complex workload management capabilities (such as described for grid above) are not as crucial.  This assumption keeps the required administrative overhead for workload management to a minimum.

Read more:

Hadoop and YARN

Looking forward, one of the challenges of assuming dedicated, identical nodes and equal-size chunks of work in MPP has been that it’s actually quite difficult to keep everything equal on all nodes all of the time. For one thing, this often assumes that all of the hardware is exclusive for MPP use all of the time – which might not be desirable for systems which sit idle overnight, on weekends, etc. Further, while breaking workload up into equal-size bits is possible, it’s sometimes tough to keep the workload perfectly equal and distributed when there exists competition for finite resources.

For these and many other reasons, Hadoop 2.0 introduces an improvement to the workload management of a Hadoop cluster called YARN (Yet Another Resource Negotiator).

The promise of YARN is to better manage resources in a way accessible to Hadoop as well as various other consumers (like SAS). This will help mature the MPP platform, evolving it from the old Map-Reduce framework to a more flexible platform to handle a wider variety of different workload and resource management challenges.

And of course, SAS solutions are already integrating with YARN to take advantage of the capabilities it offers.

Read more:

 

tags: grid, Hadoop, parallel processing, performance, SAS Administrators, SAS Professional Services, scalability, YARN