storage

4月 132020
 

Editor’s note: This is the third article in a series by Conor Hogan, a Solutions Architect at SAS, on SAS and database and storage options on cloud technologies. This article covers the SAS offerings available to connect to and interact with the various storage options available in Microsoft Azure. Access all the articles in the series here.

In this edition of the series on SAS and cloud integration, I cover the various storage options available on Microsoft Azure and how connect to and interact with them. I focus on three key storage services: object storage, block storage, and file storage. In my previous articles I have covered topics regarding database as a service (DBaaS) and storage offerings from Amazon Web Services (AWS) as well as DBaaS on Azure.

Object Storage

Azure Blob Storage is a low-cost, scalable cloud object storage service for any type of data. Objects are a great way to store large amounts of unstructured data in their native formats. Individual Azure Blob objects size up to 4.75 terabytes (TB). Azure organizes these objects into different storage accounts. Because a storage account is a globally unique namespace for your data, no two storage accounts can have the same name. The storage account supplies a unique namespace for your data and is accessible from anywhere in the world over HTTP or HTTPS.

A Container organizes a set of Blobs similar to a traditional directory in a file system. You access Azure Blobs directly through an API from anywhere in the world. For security reasons, it is vital to grant least access to a Blob.

Make sure you are being intentional about opening objects up and are not exposing any sensitive data. Security controls are offered within individual blobs and containers that organize them. The default is to create objects and blobs with no public read access, then you may grant permissions to individual users and groups.

The total cost of blob storage depends on volume of data stored, type of operations performed, data transfer costs, and data redundancy choices. You can reduce the number of replicants or use one of the various tiers of archive services to reduce the cost of your object storage. Terabytes of storage used per month determine the calculations on cost. You incur added costs for data requests and transfers over the network. Data movement is an unpredictable expense for many users.

Azure Blob Storage Tiers
Hot Frequently accessed data
Cool Infrequently accessed data – archived at least 30 days
Archive Rarely accessed data – archived at least 180 days

 
In SAS Viya 3.5, direct support is available for objects stored in Azure Data Lake Storage Gen2. Azure Data Lake Storage Gen2 extends Azure Blob Storage capabilities and optimizing it for analytics workloads. If you want to read any SAS datasets, CSV and ORC files from Azure Blob Storage, you can read them directly using a CASLIB statement to Azure Data Lake Storage (ADLS). If you have files in a different format, you can always copy them to a local file system accessible to the CAS controller. Use CAS Actions to load tables into memory. Making HTTP requests directly from within your SAS code using Proc HTTP for the download process favors automation. Remember, no restrictions exist on file types for objects moving into object storage. Hence, this may require a SAS Data Connector to read some local file system filetypes.

Block Storage

Auzre Disks is the block storage service designed for use with Azure Virtual Machines. You may only access block storage when attached to an operating system. When thinking about Azure Disks, treat the storage volumes as an independent disk drive controlled by the server operating system. Mount an Azure Disk to an operating system as if it were a physical disk. Azure Disks are valuable because they are the persisting storage when you terminate your compute instance. You can choose from four different volume types that supply performance levels at corresponding costs.

Azure makes available a choice from HDD or three different performance classes of SSD: Standard, Premium, and Ultra performance. You can use Ultra Disk if you need the lowest latency and scalable performance. Standard SDD is the most cost effective while Premium SSD is the high-performance disk offering. The table below sums up four offerings.

Azure Disk Storage Types
Standard HDD Standard SDD Premium SDD Ultra SDD
Backups and Non critical Development or Test environments and lightly used workloads Production environments and time sensitive workloads High throughput and IOPS - Transaction heavy workloads

 
Azure Disks are the permanent SAS data storage, persisting through a restart of your SAS environment. The disk performance used when selecting from the different Azure Disk type has a direct impact on the performance you get from SAS. A best practice is to use compute instances with enhanced Azure Disks performance or dedicated solid state drive instance storage.

File Storage

Azure Files provides access to data through a shared file system. The elastic network file system grows and shrinks as you add or remove files, so you only pay for the storage you consume. Users create, delete, modify, read, and write files organized logically in a directory structure for intuitive access. This service allows simultaneous access for multiple users to a common set of files data managed by user and group permissions.

Azure Files is a powerful tool, especially if utilizing a SAS Grid architecture. If you have a requirement in your SAS architecture for a shared location where any node in a group can access and write to, then Azure Files could meet your requirement. To access the data stored in your network file system you will have to mount the file system to your operating system. You can mount Azure Files to any Azure Virtual Machine, or even to an on-premise server within your Azure Virtual Network. Azure Files is a fantastic way to setup a shared file system not only for your data but also to share projects and code between users.

Finally

Storage is a key component of cloud computing because it enables users to stop their compute instances while their most important data remains in place. Storage services make it much easier to manage and scale your data. For example, Blob storage is a great place to store files that you want to make available to anyone, anywhere.

Block storage drives the performance of your environment. Abundant and performant block storage is essential to making your application run against the massive scale of data that SAS is eager to consume. Block storage is where your operating system and software ultimately are installed.

File storage is a great service to attach shared file systems to your compute instances. This is a great place to collaborate or migrate file system data from one compute instance to another. SAS is a compute engine running on data.

Without a robust set up storage tools to persist that data you may not get the performance that you desire or the progress you make will be lost when you shut down your compute instances.

Resources

Storage in the Cloud – SAS and Azure was published on SAS Users.

9月 042019
 

Editor’s note: This article is a continuation of the series by Conor Hogan, a Solutions Architect at SAS, on SAS and database and storage options on cloud technologies. Access all the articles in the series here.

In a previous article in this series, Accessing Databases in the Cloud – SAS Data Connectors and Amazon Web Services, I covered SAS and database as a service (DBaaS) and storage offerings from Amazon Web Services (AWS). Today, I cover the various storage options available on AWS and how connect to and interact with them from SAS.

Object Storage

Amazon Simple Storage Service (S3) is a low-cost, scalable cloud object storage for any type of data in its native format. Individual Amazon S3 objects can range in size from 1 byte all the way to 5 terabytes (TB). Amazon S3 organizes these objects into buckets. A bucket is globally unique. You access the bucket directly through an API from anywhere in the world, if granted permissions. The default granted to the bucket is least access. Amazon advertises 11 9’s, or 99.999999999% of durability, meaning that you never lose your data. Data replicates automatically across availability zones to meet this durability. You can reduce the number of replicants or use one of the various tiers of archive services to reduce your object storage cost. Costs are calculated based on terabytes of storage per month with added costs for request and transfers of data.

SAS and S3

Support for Amazon Web Services S3 as a Caslib data source for SAS Cloud Analytic Services (CAS) was added in SAS Viya 3.4. This data source enables you to access SASHDAT files and CSV files in S3. You can use the CASLIB statement or the table.addCaslib action to add a Caslib for S3. SAS is currently exploring native object storage integration with AWS S3 for more file types. For other file types you can copy the data from S3 and then use a SAS Data Connector to load the data into memory. For example, if I had Excel data in S3, I could use PROC S3 to copy the data locally and then load the data into CAS using the SAS Data Connector to PC Files.

Block Storage

Amazon Elastic Block Store (EBS) is the block storage service designed for use with Amazon Elastic Compute Cloud (EC2). Only when attached to an operating system is the storage class accessible. Storage volumes can be treated as an independent disk drive controlled by a server operating system. You would mount an EBS volume to an operating system as if it were a physical disk. EBS volumes are valuable because they are the storage that will persist when you terminate your compute instance. You can choose from four different volume types that supply performance levels at corresponding costs.

SAS and EBS

EBS is used as the permanent SAS data storage and persists through a restart of your SAS environment. The performance choices made when selecting from the different EBS volume type will have a direct impact on the performance that you get from SAS. One thing to consider is using compute instances that have enhanced EBS performance or dedicated solid state drive instance storage. For example, the SAS Viya on AWS QuickStart uses Storage Optimized and Memory Optimized compute instances with local NVMe-based SSDs that are physically connected to the host server that is coupled to the lifetime of the instance. This is beneficial for performance.

SAS Cloud Analytic Services (CAS) is an in-memory server that relies on the CAS Disk Cache as the virtual memory storage backend. This is especially true if you are reading data from a database. In this case, make sure you have enough block storage, in the form of EBS volumes for use as the CAS Disk Cache.

File Storage

Amazon Elastic File System (EFS) provides access to data through a shared file system. EFS is an elastic network file system that grows and shrinks as you add or remove files, so you only pay for the storage you consume. Users create, delete, modify, read, and write files organized logically in a directory structure for intuitive access. This allows simultaneous access for multiple users to a common set of file data managed with user and group permissions. Amazon FSx for Lustre is the high-performance file system service.

SAS and EFS

EFS shared file system storage can be a powerful tool if utilizing a SAS Grid architecture. If you have a requirement in your SAS architecture for a shared location that any node in a group can access and write to, then EFS could meet your requirement. To access the data stored in your network file system you will have to mount the EFS file system. You can mount your Amazon EFS file systems to any EC2 instance, or any on-premises server connected to your Amazon VPC.

BONUS: Serverless

Amazon Athena is query service for Amazon S3. This service makes it easy to submit queries against the objects stored in S3. You can run analysis on this data using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries you run. Amazon Athena uses Presto with ANSI SQL support and works with a variety of standard data formats, including CSV, JSON, ORC, Avro, and Parquet.

SAS and Athena

Amazon Athena is ODBC/JDBC compliant which means I can use SAS/ACCESS Interface to ODBC or SAS/ACCESS Interface to JDBC to connect using SAS. Download an Amazon Athena ODBC driver and submit code from SAS just like you would any ODBC data source. Athena is a great tool if you want to use the serverless computing power of Amazon to query data in S3.

Finally

Many times, we do not have a choice of technologies we use and infrastructures on which they sit. Luckily, if you use AWS, integration with SAS is not a concern. I’ve now covered databases and storage for AWS. In future articles, I’ll cover the same topics for Microsoft Azure and Google Cloud Platform.

Additional Resources

Storage in the Cloud – SAS and Amazon Web Services was published on SAS Users.

8月 042016
 

Testing EMC Storage and Veritas shared file systemsIn my current role I have the privilege of managing the Performance Lab in SAS R&D. Helping users work through performance challenges is a critical part of the Lab’s mission. This spring, my team has been actively testing new and enhanced storage arrays from EMC along with the Veritas clustered file system.  We have documented our findings on the SAS Usage note 42197 “List of Useful Papers.”

The two different flash based storages we tested from EMC are the new DSSD D5 appliance and XtremIO array.  The bottom line: both storages performed very nicely with a mixed analytics workload.  For more details on the results of the testing along with the tuning guidelines for using SAS with this storage, please review these papers:

As with all storage, please validate that the storage can deliver all the “bells and whistles;” you will need to support your failover and high availability needs of your SAS applications.

In addition to the storage testing, we tested with the latest version of Veritas InfoScale clustered file system.  We had great results in a distributed SAS environment with several SAS compute nodes all accessing data in this clustered file system.  A lot of information was learned in this testing and captured in the following paper:

My team plans to continue testing of new storage and file system technologies throughout the remainder 2016.  If there is a storage array or technology you would like to have tested, please let us know by sharing it in the comments section below or contacting me directly.

 

tags: performance, storage

Testing EMC Storage and Veritas shared file systems was published on SAS Users.

6月 102015
 

Recently my wife and I took our annual anniversary trip – this time we went to the Grand Canyon, staying in Las Vegas. In researching our options to fly from Raleigh-Durham (RDU) to Las Vegas (LAS), we had several different selection criteria:

  • what time we wanted to leave
  • what time we wanted to arrive
  • price
  • number of stops
  • layover time
  • airline – loyalty program
  • type of aircraft – seating, amenities, food, wifi

All of the flights we looked at would get us from RDU to LAS and back. So the destination wasn’t the issue – it was how much value we placed on each of the attributes: arriving in the afternoon (hotel check-in is 3:00pm) versus spending more money for a nonstop flight, for example. We made our decisions based on our specific needs at that time. We also have different opinions of what was important (I’m basically cheap, and my wife refuses to take the red-eye flight).

The evaluation of storage for a SAS solution can be viewed in a similar fashion. There are tradeoffs to be made, or certainly criteria which will be evaluated and prioritized. This blog posting will briefly examine three such attributes and how they may impact storage in a SAS environment.

Who says you can’t have it all?

tradeoffsThis diagram highlights three of the more common attributes that are considered when evaluating storage. While there are certainly other considerations (capacity, interfaces, architecture), these three are usually involved in most storage decisions. This diagram also suggests that there are tradeoffs to be considered: for example, between price and performance (higher performance may require higher price). Let’s briefly examine each of these, and where we may see tradeoffs in a SAS environment.

Price

Price is usually among the first attributes that come up in any discussion of storage. Everyone is looking to save money, and unfortunately storage often gets compromised. Consider this scenario: our SAS deployment will need about 5 terabytes (TB) of storage. In terms of raw capacity, a new 5TB disk drive can be bought from a number of online vendors for around $150.00 USD. While this drive may meet the capacity requirements, it most likely is not the best selection for a SAS deployment – especially if there are performance or availability considerations. Typical enterprise-class SAS storage may involve configurations with multiple disks and controllers, and perhaps shared storage such as  Network Attached Storage (NAS) or Storage Area Networks (SAN). Factoring in these, and possibly other, considerations would most likely (significantly!) increase the price of our storage.

Performance

SAS applications are consumers of storage, and have significant performance expectations for I/O throughput. Many SAS field consultants can share stories of under-performing storage leading to failed deployments and unhappy customers. SAS has minimum recommended I/O throughput rates of file systems that are to be used in a SAS environment, and the Performance Evaluation team within SAS R&D has written several papers that document best practices and tuning guidelines. There is even a usage note about testing throughput for your SAS9 File Systems. Multiple configuration options are reviewed and discussed, ranging from shared file systems to external SAN or NAS arrays.

Availability

Deploying SAS applications into a business-critical environment or where there are availability requirements such as a Service Level Agreement require careful attention to the type and configuration of storage used. Since SAS is implemented on the host OS file systems, commonly used high availability strategies can be used effectively. From simple strategies, such as configuring local storage using RAID mirroring, to more complex enterprise-class solutions, such as redundancy through a SAN, the appropriate level of high availability can be designed and deployed to assure that the storage is designed to meet the needs of the business.

So how does all this fit together?

tradeoffsAs you can see, none of these criteria should be considered independent of the others when designing and evaluating storage solutions for SAS environments. There will be tradeoffs made in the evaluation process, and priorities will be established. For some areas, such as performance, there are guidelines established by SAS R&D. In other areas, specific needs of the customer (a specific SLA, for example) may dictate specific design decisions. In addition, there’s some flexibility in certain areas – filesystems containing SAS permanent data should be allocated to a more available, more protected storage area than the temporary filesystem of SASWORK. A detailed analysis of the storage needs of the SAS deployment as a part of the overall architecture design will consider these three, in addition to other criteria.

 

In case you were wondering, we didn’t take the red-eye.

tags: SAS architecture, SAS Professional Services, storage

The post Evaluating tradeoffs when designing storage for SAS applications appeared first on SAS Users.