Open Source

6月 172017
 

It has been almost a year since then-U.S. CIO Tony Scott introduced the federal open source policy that called for agencies to share federally-developed software source code. The policy, more than anything, aimed to make agencies more agile. Instead of redeveloping the same programs the open source policy would allow [...]

How analytics and open source can improve government agility was published on SAS Voices by Trent Smith

6月 072017
 

Innovation used to happen in structured cycles. The new invention was often a planned event and the domain of a select few departments within an organisation. But in today's always-on economy enterprises need to innovate on a continuous basis to keep up with new players that base their entire businesses [...]

Don’t trip up on the edge of innovation was published on SAS Voices by Peter Pugh-Jones

4月 092017
 

Thanks to a new open source project from SAS, Python coders can now bring the power of SAS into their Python scripts. The project is SASPy, and it's available on the SAS Software GitHub. It works with SAS 9.4 and higher, and requires Python 3.x.

I spoke with Jared Dean about the SASPy project. Jared is a Principal Data Scientist at SAS and one of the lead developers on SASPy and a related project called Pipefitter. Here's a video of our conversation, which includes an interactive demo. Jared is obviously pretty excited about the whole thing.

Use SAS like a Python coder

SASPy brings a "Python-ic" sensibility to this approach for using SAS. That means that all of your access to SAS data and methods are surfaced using objects and syntax that are familiar to Python users. This includes the ability to exchange data via pandas, the ubiquitous Python data analysis framework. And even the native SAS objects are accessed in a very "pandas-like" way.

import saspy
import pandas as pd
sas = saspy.SASsession(cfgname='winlocal')
cars = sas.sasdata("CARS","SASHELP")
cars.describe()

The output is what you expect from pandas...but with statistics that SAS users are accustomed to. PROC MEANS anyone?

In[3]: cars.describe()
Out[3]: 
       Variable Label    N  NMiss   Median          Mean        StdDev  
0         MSRP     .   428      0  27635.0  32774.855140  19431.716674   
1      Invoice     .   428      0  25294.5  30014.700935  17642.117750   
2   EngineSize     .   428      0      3.0      3.196729      1.108595   
3    Cylinders     .   426      2      6.0      5.807512      1.558443   
4   Horsepower     .   428      0    210.0    215.885514     71.836032   
5     MPG_City     .   428      0     19.0     20.060748      5.238218   
6  MPG_Highway     .   428      0     26.0     26.843458      5.741201   
7       Weight     .   428      0   3474.5   3577.953271    758.983215   
8    Wheelbase     .   428      0    107.0    108.154206      8.311813   
9       Length     .   428      0    187.0    186.362150     14.357991   

       Min       P25      P50      P75       Max  
0  10280.0  20329.50  27635.0  39215.0  192465.0  
1   9875.0  18851.00  25294.5  35732.5  173560.0  
2      1.3      2.35      3.0      3.9       8.3  
3      3.0      4.00      6.0      6.0      12.0  
4     73.0    165.00    210.0    255.0     500.0  
5     10.0     17.00     19.0     21.5      60.0  
6     12.0     24.00     26.0     29.0      66.0  
7   1850.0   3103.00   3474.5   3978.5    7190.0  
8     89.0    103.00    107.0    112.0     144.0  
9    143.0    178.00    187.0    194.0     238.0  

SASPy also provides high-level Python objects for the most popular and powerful SAS procedures. These are organized by SAS product, such as SAS/STAT, SAS/ETS and so on. To explore, issue a dir() command on your SAS session object. In this example, I've created a sasstat object and I used dot<TAB> to list the available SAS analyses:

SAS/STAT object in SASPy

The SAS Pipefitter project extends the SASPy project by providing access to advanced analytics and machine learning algorithms. In our video interview, Jared presents a cool example of a decision tree applied to the passenger survival factors on the Titanic. It's powered by PROC HPSPLIT behind the scenes, but Python users don't need to know all of that "inside baseball."

Installing SASPy and getting started

Like most things Python, installing the SASPy package is simple. You can use the pip installation manager to fetch the latest version:

pip install saspy

However, since you need to connect to a SAS session to get to the SAS goodness, you will need some additional files to broker that connection. Most notably, you need a few Java jar files that SAS provides. You can find these in the SAS Deployment Manager folder for your SAS installation:
../deploywiz/sas.svc.connection.jar
..deploywiz/log4j.jar
../deploywiz/sas.security.sspi.jar
../deploywiz/sas.core.jar

The jar files are compatible between Windows and Unix, so if you find them in a Unix SAS install you can still copy them to your Python Windows client. You'll need to modify the sascgf.py file (installed with the SASPy package) to point to where you've stashed these. If using local SAS on Windows, you also need to make sure that the sspiauth.dll is in your Windows system PATH. The easiest method to add SASHOMESASFoundation9.4coresasexe to your system PATH variable.

All of this is documented in the "Installation and Configuration" section of the project documentation. The connectivity options support an impressively diverse set of SAS configs: Windows, Unix, SAS Grid Computing, and even SAS on the mainframe!

Download, comment, contribute

SASPy is an open source project, and all of the Python code is available for your inspection and improvement. The developers at SAS welcome you to give it a try and enter issues when you see something that needs to be improved. And if you're a hotshot Python coder, feel free to fork the project and issue a pull request with your suggested changes!

The post Introducing SASPy: Use Python code to access SAS appeared first on The SAS Dummy.

3月 082017
 

Digitalisation is blasting the cobwebs out from enterprises and organisations of all kinds – freeing them to innovate and take advantage of the always-on economy. But it’s also helping new disruptive players to gain an unexpectedly strong foothold in many markets. One of the key advantages these new players have [...]

Is governance getting in the way of innovation? was published on SAS Voices by Peter Pugh-Jones

1月 062017
 

At SAS, we've published more repositories on GitHub as a way to share our open source projects and examples. These "repos" (that's Git lingo) are created and maintained by experts in R&D, professional services (consulting), and SAS training. Some recent examples include:

With dozens of repositories under the sassoftware account, it becomes a challenge to keep track of them all. So, I've built a process that uses SAS and the GitHub APIs to create reports for my colleagues.

Using the GitHub API

GitHub APIs are robust and well-documented. Like most APIs these days, you access them using HTTP and REST. Most of the API output is returned as JSON. With PROC HTTP and the JSON libname engine (new in SAS 9.4 Maint 4), using these APIs from SAS is a cinch.

The two API calls that we'll use for this basic report are:

Fetching the GitHub account metadata

The following SAS program calls the first API to gather some account metadata. Then, it stores a selection of those values in macro variables for later use.

/* Establish temp file for HTTP response */
filename resp temp;
 
/* Get Org metadata, including repo count */
proc http
 url="https://api.github.com/orgs/sassoftware"  
 method="GET"
 out=resp
;
run;
 
/* Read response as JSON data, extract select fields */
/* It's in the ROOT data set, found via experiment   */
libname ss json fileref=resp;
 
data meta; 
  set ss.root; 
  call symputx('repocount',public_repos);
  call symputx('acctname',name);
  call symputx('accturl',html_url);
run;
 
/* log results */
%put &=repocount;
%put &=acctname;
%put &=accturl;

Here is the output of this program (as of today):

REPOCOUNT=66
ACCTNAME=SAS Software
ACCTURL=https://github.com/sassoftware

The important piece of this output is the count of repositories. We'll need that number in order to complete the next step.

Fetching the repositories and stats

It turns out that the /repos API call returns the details for 30 repositories at a time. For accounts with more than 30 repos, we need to call the API multiple times with a &page= index value to iterate through each batch. I've wrapped this process in a short macro function that repeats the calls as many times as needed to gather all of the data. This snippet calculates the upper bound of my loop index:

/* Number of repos / 30, rounded up to next integer     */
%let pages=%sysfunc(ceil(%sysevalf(&repocount / 30)));

Given the 66 repositories on the SAS Software account right now, that results in 3 API calls.

Each API call creates verbose JSON output with dozens of fields, only a few if which we care about for this report. To simplify things, I've created a JSON map that defines just the fields that I want to capture. I came up with this map by first allowing the JSON libname engine to "autocreate" a map file with the full response. I edited that file and whittled the result to just 12 fields. (Read my previous blog post about the JSON engine to learn more about JSON maps.)

The multiple API calls create multiple data sets, which I must then concatenate into a single output data set for reporting. Then to clean up, I used PROC DATASETS to delete the intermediate data sets.

First, here's the output data:

ssgit
Here's the code segment, which is rather long because I included the JSON map inline.

/* This trimmed JSON map defines just the fields we want */
/* Created by using AUTOMAP=CREATE on JSON libname       */
/* then editing the generated map file to reduce to      */
/* minimum number of fields of interest                  */
filename repomap temp;
data _null_;
 infile datalines;
 file repomap;
 input;
 put _infile_;
 datalines;
{
  "DATASETS": [
 {
   "DSNAME": "root",
   "TABLEPATH": "/root",
   "VARIABLES": [
  {
    "NAME": "id",
    "TYPE": "NUMERIC",
    "PATH": "/root/id"
  },
  {
    "NAME": "name",
    "TYPE": "CHARACTER",
    "PATH": "/root/name",
    "CURRENT_LENGTH": 50,
    "LENGTH": 50
  },
  {
    "NAME": "html_url",
    "TYPE": "CHARACTER",
    "PATH": "/root/html_url",
    "CURRENT_LENGTH": 100,
    "LENGTH": 100
  },
  {
    "NAME": "language",
    "TYPE": "CHARACTER",
    "PATH": "/root/language",
    "CURRENT_LENGTH": 20,
    "LENGTH": 20
  },
  {
    "NAME": "description",
    "TYPE": "CHARACTER",
    "PATH": "/root/description",
    "CURRENT_LENGTH": 300,
    "LENGTH": 500
  },
  {
    "NAME": "created_at",
    "TYPE": "NUMERIC",
    "INFORMAT": [ "IS8601DT", 19, 0 ],
    "FORMAT": ["DATETIME", 20],
    "PATH": "/root/created_at",
    "CURRENT_LENGTH": 20
  },
  {
    "NAME": "updated_at",
    "TYPE": "NUMERIC",
    "INFORMAT": [ "IS8601DT", 19, 0 ],
    "FORMAT": ["DATETIME", 20],
    "PATH": "/root/updated_at",
    "CURRENT_LENGTH": 20
  },
  {
    "NAME": "pushed_at",
    "TYPE": "NUMERIC",
    "INFORMAT": [ "IS8601DT", 19, 0 ],
    "FORMAT": ["DATETIME", 20],
    "PATH": "/root/pushed_at",
    "CURRENT_LENGTH": 20
  },
  {
    "NAME": "size",
    "TYPE": "NUMERIC",
    "PATH": "/root/size"
  },
  {
    "NAME": "stars",
    "TYPE": "NUMERIC",
    "PATH": "/root/stargazers_count"
  },
  {
    "NAME": "forks",
    "TYPE": "NUMERIC",
    "PATH": "/root/forks"
  },
  {
    "NAME": "open_issues",
    "TYPE": "NUMERIC",
    "PATH": "/root/open_issues"
  }
   ]
 }
  ]
}
;
run;
 
/* GETREPOS: iterate through each "page" of repositories */
/* and collect the GitHub data                           */
/* Output: <account>_REPOS, a data set with all basic data  */
/*  about an account's public repositories          */
%macro getrepos;
 %do i = 1 %to &pages;
  proc http
   url="https://api.github.com/orgs/sassoftware/repos?page=&i."  
   method="GET"
   out=resp
  ;
  run;
 
  /* Use JSON engine with defined map to capture data */
  libname repos json map=repomap fileref=resp;
  data _repos&i.;
   set repos.root;
  run;
 %end;
 
 /* Concatenate all pages of data */
 data sassoftware_allrepos;
  set _repos:;
 run;
 
 /* delete intermediate repository data */
 proc datasets nolist nodetails;
  delete _repos:;
 quit;
%mend;
 
/* Run the macro */
%getrepos;

Creating a simple report

Finally, I want to create simple report listing of all of the repositories and their top-level stats. I'm using PROC SQL without a CREATE TABLE statement, which will create a simple ODS listing report for me. I use this approach instead of PROC PRINT because I transformed a couple of the columns in the same step. For example, I created a new variable with a fully formed HTML link, which ODS HTML will render as an active link in the browser. Here's a snapshot of the output, followed by the code.

samplereport

/* Best with ODS HTML output */
title "github.com/sassoftware (&acctname.): Repositories and stats";
title2 "ALL &repocount. repos, Data pulled with GitHub API as of &SYSDATE.";
title3 height=1 link="&accturl." "See &acctname. on GitHub";
proc sql;
 select 
  catt('<a href="',t1.html_url,'">',t1.name,"</a>") as Repository, 
 case 
  when length(t1.description)>50 then cat(substr(t1.description,1,49),'...')
  else t1.description
 end 
as Description,
 t1.language as Language,
 t1.created_at format=dtdate9. as Created, 
 t1.pushed_at format=dtdate9. as Last_Update, 
 t1.stars as Stars, 
 t1.forks as Forks, 
 t1.open_issues as Open_Issues
from sassoftware_allrepos t1
 order by t1.pushed_at desc;
quit;

Get the entire example

Not wanting to get too meta on you here, but I've placed the entire program on my own GitHub account. The program I've shared has a few modifications that make it easier to adapt for any organization or user on GitHub. As you play with this, keep in mind that the GitHub API is "rate limited" -- they allow only so many API calls from a single IP address in a certain period of time. That's to ensure that the APIs perform well for all users. You can use authenticated API calls to increase the rate-limit threshold for yourself, and I do that for my own production reporting process. But...that's a blog post for a different day.

tags: github, JSON, open source, PROC HTTP

The post Reporting on GitHub accounts with SAS appeared first on The SAS Dummy.

12月 232016
 

It's that time of year again. Holidays, parties, gifts, cooking, closing annual business, hitting targets and preparing for 2017. Looking back on the year for the communication and media industries, it has been a year of transition for the industry and for many of the customers I work with in my role […]

Closing out a year of transformation in the communication and media industries was published on SAS Voices.

12月 222016
 

TL; DR

Free training from SAS: "SAS Programming for R Users." The schedule of Live Web offerings is here. If you prefer self-study, the complete course materials are on the SAS Software GitHub space and you can practice with the free SAS University Edition software.


The details: how R programmers can learn SAS for free

diagbeta As much as I would love for SAS customers to use SAS to the exclusion of everything else, that rarely happens. Every time I visit a SAS customer, I hear about the other non-SAS tools that they use alongside SAS and their integration points. The most popular of these include desktop tools such as Microsoft Excel, or enterprise databases from other vendors. But increasingly, I hear from users who dabble in open source tools such as Python and R, or who work with other teams that use those tools.

Programmers tend to favor the programming languages that they know. When you learn a new programming language, your experience is colored by inevitable comparisons with the languages you've already mastered. If you work with R coders who want to learn SAS, you should consider that they probably won't learn SAS the same way that you did.

A SAS programming course for experienced programmers

The traditional way to learn SAS begins with the DATA step, where you learn how to read files, how to write files, about the program data vector, and basically how the DATA step "thinks". Then you move on to the various procedures for descriptive stats, reporting, and maybe even some graphing. While this approach can make you productive with simple tasks quickly, to an R coder this might feel too much like "starting over." That's why R programmers (or even MATLAB or Stata users) need an approach that leverages what they already know to hit the ground running.

That's the thinking behind the new SAS Programming for R Users course. This course does not start with the basics about statistics or the importance of data prep -- the assumption is that you already know that. Instead, you'll get hands-on experience with SAS/IML -- a statistical matrix language that will certainly feel familiar to R users. You'll eventually get to the DATA step and other procedures, of course -- and these will open new worlds for you -- but you'll learn to be productive quickly using the skills you already have. (You can read more about the genesis of the course from its creator and main instructor, Jordan Bakerman.)

The course centers around classic and real statistical problems, from Bayesian logistic regression to the Monty Hall problem. If you don't know your statistics, you might feel that you're swimming in waters over your head. But if you're comfortable with the concepts, you should feel right at home. (If you're just beginning with statistics, SAS offers this different free e-learning course.)

The classic game show proof

The classic game show proof - click for code

"SAS Programming for R Users" also shows you how to use SAS and R together, submitting R code from within your SAS program. That's made possible by a special connection between SAS/IML and R -- something that SAS has supported for years.

This is a free instructor-led course that's offered in Live Web format. "Live Web" means that you connect from your desk at home or work, tune into the lecture and demos, and then practice your skills on a hosted classroom environment. And this course is free -- costing you only your time (5 half-day sessions). Check out the SAS Training site to see when the next offering might meet your schedule.

Find the course materials on GitHub, right now

What if you can't find a Live Web offering that meets your schedule? In the spirit of openness, the SAS Training team has published the complete course materials on GitHub. You'll find the course notes (over 600 pages), data sets, and over 80 SAS programs to support the course exercises. You can use the free SAS University Edition to try the course exercises yourself and practice with the software. (The only part that you can't practice is the "submit to R" lessons, because the SAS University Edition doesn't support the connection to R.)

tags: open source, SAS programming, sas training

The post Learning SAS programming for R users appeared first on The SAS Dummy.

11月 022016
 

With DataFlux Data Management 2.7, the major component of SAS Data Quality and other SAS Data Management solutions, every job has a REST API automatically created once moved to the Data Management Server. This is a great feature and enables us to easily call Data Management jobs from programming languages like Python. We can then involve the Quality Knowledge Base (QKB), a  pre-built set of data quality rules, and do other Data Quality work that is impossible or challenging to do when using only Python.

calling-sas-data-quality-jobs-from-pythonIn order to make a RESTful call from Python we need to first get the REST API information for our Data Management job. The best way to get this information is to go to Data Management Server in your browser where you’ll find respective links for:

  • Batch Jobs
  • Real-Time Data Jobs
  • Real-Time Process Jobs.

From here you can drill through to your job REST API.

Alternatively, you can use a “shortcut” to get the information by calling the job’s REST API metadata URL directly. The URL looks like this:

http://<DM Server>:<port>/<job type>/rest/jobFlowDefns/<job id>/metadata

calling-sas-data-quality-jobs-from-python02

The <job id> is simply the job name (with subdirectory and extension) Base64 encoded. This is a common method to avoid issues with illegal URL characters like: # % & * { } : < > ? / + or space. You can go to this website to Base64 encode your job name.

If you have many jobs on the Data Management Server it might be quicker to use the “shortcut” instead of drilling through from the top.

Here an example to get the REST API information for the Data Management job “ParseAddress.ddf” which is in the subdirectory Demo of Real-Time Data Services on DM Server:

calling-sas-data-quality-jobs-from-python03

We Base64 encode the job name “Demo/ParseAddress.ddf” using the website mentioned above…

calling-sas-data-quality-jobs-from-python04

…and call the URL for the job’s REST API metadata:

http://DMServer:21036/SASDataMgmtRTDataJob/rest/jobFlowDefns/ RGVtby9QYXJzZUFkZHJlc3MuZGRm/metadata

calling-sas-data-quality-jobs-from-python05

From here we collect the following information:

The REST API URL and Content-Type information…
calling-sas-data-quality-jobs-from-python06

…the JSON structure for input data

calling-sas-data-quality-jobs-from-python07

…which we need in this format when calling the Data Management job from Python:

{"inputs" : {"dataTable" : {"data" : [[ "sample string" ],[ "another string" ]], "metadata" : [{"maxChars" : 255, "name" : "Address", "type" : "string"}]}}}

…and the JSON structure for the data returned by the Data Management job.

calling-sas-data-quality-jobs-from-python08

When you have this information, the Python code to call the Data Management job would look like this:

calling-sas-data-quality-jobs-from-python09

The output data from the Data Management job will be in data_raw. We call the built-in JSON decoder from the “request” module to move the output into a dictionary (data_out) from where we can access the data. The structure of the dictionary is according to the REST metadata. We can access the relevant output data via data_out[‘outputs’][‘dataTable’][‘data’]

calling-sas-data-quality-jobs-from-python10

calling-sas-data-quality-jobs-from-python11

The Python program will produce an output like this…

calling-sas-data-quality-jobs-from-python12

You can find more information about the DataFlux Data Management REST API here.

Calling Data Management jobs from Python is straight forward and is a convenient way to augment your Python code with the more robust set of Data Quality rules and capabilities found in the SAS Data Quality solution.

Learn more about SAS Data Quality.

tags: data management, DataFlux Data Management Studio, open source, REST API

Calling SAS Data Quality jobs from Python was published on SAS Users.

10月 142016
 

Open. The very word evokes a sense of happiness and possibility. When you’re hungry at an odd hour and everything around you seems to be closed, that lone neon sign glowing in a restaurant window is a most welcome relief. When a shop or service you’ve longed for finally builds […]

Always Open was published on SAS Voices.