Tech

6月 192019
 

As a data scientist, did you ever come to the point where you felt the need for an evolved analytics platform bringing together the disparate skills of open source and commercial software? A system that can enable advanced analytic capabilities. This is now possible and easy to implement. With many deployment possibilities, SAS Viya allows you to choose the data storage location where compute happens, and the deployment methods for models.

Let’s say you want to expand your model development process with SAS Viya analytical capabilities and you don’t want to wait for getting such environment up and running. Unfortunately, you have no infrastructure, nor the experience to install SAS Viya. Moving the traditional way, you could go for:

  • Protracted hardware procurement and provisioning
  • Deployment planning and coordination with IT
  • Effort and time required for software installation/configuration

This solution may be the right path for many organizations, but I think we all recognize this: the traditional approach could take days, weeks and yes sometimes months.

What if you could get up and running with a full SAS Viya platform in two hours? If you have some affinity for cloud-based solutions, SAS offers you the AWS SAS Viya Cloud Rapid Deployment tool. SAS released this AWS Quick Start as a rapid deployment architecture for SAS Viya on AWS. Deployable products include SAS Visual Data Mining and Machine Learning, SAS Visual Statistics and SAS Visual Analytics.

The goal of this article is to brief you how I launched such an AWS SAS Viya Quickstart. I strongly advise you to watch this related video by my colleague Erwan Granger. Much of what is covered here appears in Erwan's video. The recording predates the SAS Viya 3.4 release, but main concepts are still the same.

What you will need

The following is a list of items you need to complete this task.

  • AWS Account with appropriate creation privileges
  • A valid SAS Viya License; this means you will need a SAS Software Order Confirmation e-mail
  • Optional: you deploy with your own DNS Name and SSL Certificate. In that case you need to register a domain managed by Amazon Route 53. For instructions on registering the domain, see the Route 53 documentation. And you can request and register a certificate with AWS Certificate Manager.

Furthermore, it’s good to know this Quick Start provides two deployment options. You can deploy SAS Viya into a new Virtual Private Cloud (VPC) or into an existing VPC. The first option builds a new AWS environment consisting of the VPC, private and public subnets, NAT gateways, security groups, Ansible controllers, and other infrastructure components, and then deploys SAS Viya into this new VPC. The second option provisions SAS Viya in your existing AWS infrastructure. I decided to go for the first option.

What you will build

Here's an architectural overview of what we will build:

SAS Viya architecture on the AWS Cloud

You can find exactly the same architecture on the SAS Viya AWS Quick Start landing page.

Configure the build

We’ll be following the build process outlined in the Quick Start guide. On the landing page, next to the "What you’ll build tab" you can click on "How to deploy". From there launch the "Deploy into a new VPC" wizard.

Deploy into a new VPC wizard

Prerequisite prep

Make sure you sign in with your AWS account and you have chosen the region where you want to deploy. On that first screen you can leave the Amazon S3 template URL default. That template is the basics for the AWS CloudFormation we are launching. CloudFormation is a tool from AWS that allows you to spin up resources in the right order. The template is the blueprint document for your CloudFormation. By keeping the default template, we will build exactly the architecture displayed above.

Pre-req prep template

Now click "Next" and move to the page where we can specify more details and the required parameters of the CloudFormation parameters.

Cloudfourmation parameters

The first parameter is the SAS Viya Software Order file, which is the Amazon S3 location of the Software Order e-mail attachment.

SAS Viya install package location

In the Administration section, you provide parameters to configure your AWS architecture. That way, you control access, instance type, and if you will use a SAS Viya Mirror repository.

CloudFormation administration parameters

Administration parameter definitions:

  • The name of an Amazon EC2 key pair, so you can access the Ansible controller
  • The Amazon Availability zone for the public and private subnet
  • Allowable IP range for HTTP traffic; must be a valid IP CIDR range
  • Allowable IP Range for SSH traffic to the Ansible controller; must be a valid IP CIDR range
  • SAS Administrator password
  • Password for Default (sasuser) user
  • Amazon EC2 Instance type for CAS Compute VM
  • Amazon EC2 Instance type for SAS Viya Services VM
  • (Optional) Location of SAS Viya Deployment Repository data
  • (Optional) Operator Email

If you want to work with custom DNS names and SSL, you will need to provide the next three parameters as well.

DNS and SSL configuration (optional)

DNS and SSL parameters:

You may accept the defaults on the remaining parameters.

Optional parameters

After clicking "Next" another set of optional parameters are available. I mostly go with accepting the default parameters provided. The lone exception is the Rollback on failure.

Optional administration parameters

Based on what I’ve learned from Erwan's video, the safer choice is "No" on the Rollback option. This way, if the deployment process encounters issues, the log will identify in which step the error occurred. Of course this means you are responsible to manually delete AWS created resources that are not longer necessary. The easiest way to do this is by deleting the CloudFormation Stacks afterward.

Kick off the build

To conclude the deployment wizard, click "Next" once more and acknowledge the necessary AWS resources to create. By clicking "Create stack" the deployment process starts.

Start the build process

You can monitor the deployment log using AWS CloudWatch. In his video, Erwan demonstrates this at around minute 23.

After a successful formation you will find two AWS CloudFormation Stacks created. The Outputs gives you the direct links to SAStudioV and SASDrive.

SAS Studio and SAS Drive stacks

That’s it. You are deployed and ready to begin using your SAS Viya environment!

Additional Reference

Alexander Koller writes about SAS on AWS and takeaways for preparing for the AWS associate solution architect exam.

Your experiences and opinion matter

New forces are shaping the analytics ecosystem. Because of increased competition, rise in customer expectations and new, emerging technology such as AI and Machine Learning, challenging IT departments with evolving their analytic ecosystems to meet the demands of their business partners.

How is your organization doing this? How does your Analytics Cloud strategy compare to the market? And what do your peers think about migrating Analytics to the cloud? We can give you some insights and an industry benchmark on the topic.

Tell us about your experience in this 5 minute survey and we will be happy to share a detailed industry insight report with you, to answer these questions.

Deploy SAS Viya on AWS - Quick Start was published on SAS Users.

6月 182019
 

What is Item Response Theory?

Item Response Theory (IRT) is a way to analyze responses to tests or questionnaires with the goal of improving measurement accuracy and reliability.

A common application is in testing a student’s ability or knowledge. Today, all major psychological and educational tests are built using IRT. The methodology can significantly improve measurement accuracy and reliability while providing potential significant reductions in assessment time and effort, especially via computerized adaptive testing. For example, the SAT and GRE both use Item Response Theory for their tests. IRT takes into account the number of questions answered correctly and the difficulty of the question.

In recent years, IRT models have also become increasingly popular in health behavior, quality of life, and clinical research. There are many different models for IRT. Three of the most popular are:

The Rasch model

Two-parameter model

Graded Response model

Early IRT models (such as the Rasch model and two-parameter model) concentrate mainly on dichotomous responses. These models were later extended to incorporate other formats, such as ordinal responses, rating scales, partial credit scoring, and multiple category scoring.

Item Response Theory Models Using SAS

Ron Cody and Jeffrey K. Smith’s book, Test Scoring and Analysis Using SAS, uses SAS PROC IRT to show how to develop your own multiple-choice tests, score students, produce student rosters (in print form or Excel), and explore item response theory (IRT).

Aimed at non-statisticians working in education or training, the book describes item analysis and test reliability in easy-to-understand terms and teaches SAS programming to score tests, perform item analysis, and estimate reliability.

For those with a more statistical background, Bayesian Analysis of Item Response Theory Models Using SAS describes how to estimate and check IRT models using the SAS MCMC procedure. Written especially for psychometricians, scale developers, and practitioners, numerous programs are provided and annotated so that you can easily modify them for your applications.

Assessment has played, and continues to play, an integral part in our work and educational settings. IRT models continue to be increasingly popular in many other fields, such as medical research, health sciences, quality-of-life research, and even marketing research. With the use of IRT models, you can not only improve scoring accuracy but also economize test administration by adaptively using only the discriminative items.

Interested in learning more? Check out our chapter previews available for free. Want to learn more about SAS Press? Explore our online bookstore and subscribe to our newsletter to get all the latest discounts, news, and more.

Further resources

SAS Blogs:
New at SAS: Psychometric Testing by Charu Shankar
SAS author’s tip: Bayesian analysis of item response theory models

SAS Communities:
SAS Communities: Custom Task Tuesday: SAS Global Forum/PROC IRT Edition!

SAS Global Forum Paper:
Item Response Theory: What It Is and How You Can Use the IRTProcedure to Apply It by Xinming An and Yiu-Fai Yung

SAS Documentation:
The IRT Procedure
SAS/STAT 14.1 User Guide: The IRT Procedure
SAS/STAT 14.2 User Guide: Help Center

Understanding Item Response Theory with SAS was published on SAS Users.

6月 112019
 

Sometimes the questions are complicated, and the answers are simple. - Dr. Seuss

It seems appropriate that this blog include a Dr. Seuss quote about learning when discussing a new simple, yet deceivingly powerful initiative to help SAS customers with their journey to discover as much as possible about our products. If you haven’t heard of our SAS Help Center, it serves as your gateway to SAS documentation on SAS products and services. It is organized in topic-focused collections, making it easy to navigate.

For some time now, the leadership here in SAS Publications has heard what our customers have been saying about the need for supplemental materials to enrich the value of SAS documentation in the Help Center.

Since we value the customers’ opinion above all, SAS Publications has recently added a new feature to the SAS Help Center that allows customers to learn about related resources associated with the documentation they are reading. With this new feature, we have made it easier for customers to discover and explore the wealth of helpful materials SAS already provides. These SAS-created resources could be videos, blogs, white papers, sample code, examples, and more. The first iteration of this feature quietly went live in the Help Center during September of 2018. And since then, our technical writers have added SAS-authored resources to their Help Center pages.

To find this bounty of useful information, be on the lookout for a sticky “RESOURCES” button on the far-right side of the page when exploring documentation in the SAS Help Center:

Clicking the “RESOURCES” button will open up a list of Related Resources. This will allow the user to select a menu item.

When the chosen resource is selected, it will be displayed in a new tab.

We here in SAS Publications think it is important to highlight work like this because we value a quality learning experience, and take seriously efforts to shape our relationship with our customers. A quality customer experience is one where it is easy to find the resources that directly answer the customer’s questions: the customer shouldn’t have to hunt high and low, and then wonder whether they’ve found “everything” that answers their questions. We will work to ensure customers have a one-stop location tied to our Documentation’s Help Center that will allow them to get the most value out of their SAS investment.

But Publications wants this tool to be used and grow – and we can’t do this alone. We need customers to discover and use this tool, and give us feedback on how we can improve this unassuming, yet important, initiative in future phases.

To that point, please feel free to contact me if you have questions or ideas about the related resources initiative. We look forward to hearing from you, working with you, and most importantly, providing a quality-focused customer experience.

New related resource initiative in the SAS Help Center was published on SAS Users.

6月 112019
 

This article is a follow-on to a recent post from Jeff Owens, Getting started with SAS Containers. In that post, Jeff discussed building and running a single container for a SAS Viya runtime/IDE. Today we will go through how to build and run the full SAS Viya stack - visual components and all - in Kubernetes. Step 1 is building the container images and Step 2 is running the containers. For both steps, you can go to the sas-container-recipes GitHub repo for more detail and to obtain the tools needed to accomplish this task. An in-depth guide and more information is located on the wiki page in the repository.

The project development team at SAS has done an incredible job of making this new and intuitive way to dynamically create large collections of containers easy and foolproof, despite my long-winded explanation...

Building the Container Images

Keeping with the recipes theme, we are going to need to prepare a few ingredients to make this work. Of course, you will need a valid SAS_Viya_deployment_data.zip file containing your ordered products.

Build Machine

First, you need a Build Machine. This can be a lightweight server, but it needs to be running Linux. The build machine in this example is 2cpu x 8GB RAM, running RHEL 7.6. Hint – 2 cores is the minimum but the more you use for the build the better (faster). I have installed Docker version 18.09.5 here and I have a 100GB volume attached to my docker root (by default this is /var/lib/docker but you can easily change the location in your /etc/docker/daemon.json file).

You can review full system requirements in the GitHub repository here. This article covers the "multiple" or "full" deployment types so focus on that column in the table.
This build machine is going to execute the build script which builds each one of your containers, push them to your Docker Registry, and create the corresponding Kubernetes manifests files needed to launch your deployment.

Make sure you have cloned the sas-container-recipes repository to this machine.

Docker Registry

You will need access to a Docker registry. Your build machine must be able to push images into it, and your Kubernetes machines must be able to pull images from it. Prior to building, make sure you runt the docker login myregistry.com command using the build uid. This docker login will ensure a file is present at /home/.docker/config.json. This is a requirement whether you secure the registry with a form of authentication, or not. Note, if your registry does not respond to pings you will need to add the --skip-docker-url-validation parameter to the build command.

Mirror Repo (Optional)

Similar to the single containers build, it is a good idea to create a mirror repository to host your SAS rpms. A local mirror gives you consistent performance during installation and a consistent build. However, if your containers are able to connect to ses.sas.download then you can skip the mirror step. Beware of the network implications and the fluid nature of these repos.

LDAP

Just like any other SAS Viya environment, all users/groups/authentication/authorization are managed by connecting to an external LDAP. This could be a quick-and-dirty OpenLDAP server we stand up ourselves, or a corporate Active Directory server. Regardless, we will have to be able to make this connection if we want to use SAS Viya's visual interfaces. The easiest and best way to handle this connection is with a sitedefault.yml file. Below is a sample sitedefault.yml that would hypothetically connect to host.com's corporate LDAP. You need to construct your own sitedefault file using values for your LDAP. Consult SAS documentation (linked above) for further information.

config:
    application:
        sas.logon.initial.password: sasboot
        sas.identities.providers.ldap.connection:
            host: myldap.host.com
            port: 368
            userDN: 'CN=ldapadmin,DC=host,DC=com'
            password: ldappassword
        sas.identities.providers.ldap.group:
            baseDN: OU=Groups,DC=host,DC=com
        sas.identities.providers.ldap.user:
            baseDN: DC=host,DC=com
        sas.identities:
            administrator: youruserid

Additionally, we will need to make sure a few of our containers have "host integration" with this same LDAP (specifically, the CAS container and the programming container). The way we do that is with a standard sssd.conf file. You should hopefully be able to track down a valid sssd.conf file for your site from an administrator. Hint – it may be necessary to add homedir (/home/%u) and default shell (/bin/bash) overrides to this file depending on your LDAP configuration.

The way one would apply these two files here is:

  1. place sssd.conf in the add-ons/auth-sssd directory and include the --addons/auth-sssd option when you run build.sh, as we do in the example later.
  2. place sitedefault.yml in the top level of sas-container-recipes. If the recipe sees a sitedefault.yml file here, it will base64 encode it and embed it as a value in the consul.yml config map. If you didn't do this beforehand, you can add your sitedefault.yml file later. Remember the step below is optional, post-build. This is necessary if you did not include sitedefault.yml pre-build.
    cat sitedefault.yml | base64 --wrap=0

    Next, copy and paste the output into your consul.yml configmap (by default you can find this in builds/full/manifests/kubernetes/configmaps/consul.yml). You want to add a new key/value similar to the following:

    consul_key_value_data_enc: Y29uZmlnOgogICAgYXBwbGlj......XNvZW1zaXRlLERDPWNvbQo=[

Ingress

Ingress is a crucial component to make this come together because the only way to access your SAS Viya environment is through your Ingress. The recipe gives us an Ingress resource (one of the generated Kubernetes manifests files); however, an Ingress resource is simply an internal HTTP routing rule. We will need to make sure we have manually installed a valid Ingress controller inside of our Kubernetes environment which can be a little tricky if you are new to Kubernetes. The Ingress controller reads and applies routing rules (Ingress resources) such as the ones created by the recipes.

Traefik and Ngnix are the two most popular industry options. Or you might use native Ingresses offered by AWS, Azure, or GCP if you are running your Kubernetes cluster in the cloud. But to reiterate, you will need an Ingress controller up and running.

Once your Ingress controller is up, you need to edit the provided manifests_usermods.yml. You should set SAS_K8S_INGRESS_DOMAIN to be the DNS name that resolves to a Kubernetes node that can reach your Ingress controller. And while you have this file open you can also set a unique name for the Kubernetes namespaces you want these resources to deploy (the default is "sas-viya"). This manifests_usermods.yml file is available in the util/ directory, so if you are going to use this then you will first make a copy of that file in the top-level sas-container-recipes directory and edit it there.

Kubernetes namespace

Build.sh

With all this in place we are ready to build. To summarize, the “pre-build” config needed here are the files we touched in this sas-container-recipes project:

Relevent pre-build files

So, we can go ahead and launch the build script. I prefer using environment variables for easier readability along with copying and pasting when things change - new registries, mirrors, tags, etc.

SAS_VIYA_DEPLOYMENT_DATA_ZIP=/path/to/SAS_Viya_deployment_data.zip
MIRROR_URL=mymirror.com/myrepo #optional
DOCKER_REGISTRY_URL=myregistry.com
SAS_RECIPE_TYPE=full
DOCKER_REGISTRY_NAMESPACE=viya
SAS_DOCKER_TAG=prod
 
./build.sh --type $SAS_RECIPE_TYPE \
--mirror-url $MIRROR_URL \ #optional
--docker-registry-url $DOCKER_REGISTRY_URL \
--docker-registry-namespace $DOCKER_REGISTRY_NAMESPACE \
--zip $SAS_VIYA_DEPLOYMENT_DATA_ZIP \
--tag $SAS_DOCKER_TAG \
--addons "addons/auth-sssd"

Once complete:

  1. We store container images (30-40 of them depending on the software you have ordered) locally in the build host's docker images directory.
  2. All these images also are tagged and pushed to our Docker Registry. For your organizational reference, the naming convention used is:
    $DOCKER_REGISTRY_URL/$ DOCKER_REGISTRY_NAMESPACE/-:$SAS_DOCKER_TAG
  3. All our Kubernetes manifests files are available on the build machine in sas-container-recipes/builds/full/manifests/kubernetes. These fully configured manifest files are ready to use. They reference the images we have built and pushed.
  4. The build log gives us instructions for how to apply these resources to Kubernetes. These are simple commands you should be able to copy and paste to standup our Viya environment).

Build log instructions

For the curious
The list below is what happened during the build process. Feel free to skip this section, you do not need to know how any of this works to use the recipes:

  1. You, the builder invokes build.sh. This is a wrapper script around the greater build framework.  This script created a "builder container."  Check out the Dockerfile in the top level of the recipes directory.  This builder container builds from a golang base image as the build process, written in a few Go files (new as of April 2019).  Several files from the sas-container-recipes project copy into this container, including said Go files.
    • Note, we did not have to install Go on our build machine since Go is running inside a container.
    • If you are interested in seeing what the builder container looks like, you can run this command: docker run -it --rm --entrypoint /bin/bash sas-container-recipes-builder:$SAS_DOCKER_TAG.
    • A 'sas' user is created inside of this container - this user has the same uid as the user who invoked build.sh on the host.
  2. build.sh also created a new subdirectory on the host called 'builds/<buildtype>-<timestamp>'. This will contain logs, manifests, and various templates used during this specific build.
  3. build.sh then runs that builder container and the real work gets underway. The entry point for the builder is:  go run main.go container.go order.go.  All those arguments you specified when invoking build.sh pass right into this Go program.  Also, the newly created "builds" directory mounts into the container at /sas-container-recipes/builds.
    • The host's /var/run/docker.sock file mounts into this container - this allows the builder container to run docker (docker in docker)
  4. This Go program then:
    • Generates a playbook from your deployment data file (SOE zip) using the [sas-orchestration tool](https://support.sas.com/en/documentation/install-center/viya/deployment-tools/34/command-line-interface.html).
    • Creates Kubernetes manifests for the images set to build.
    • Gathers sets of Ansible roles to install in each container, based on the entitlement of your software order.
    • Generates a Dockerfile for each container, where each applicable Ansible role installs in a new Docker layer
    • Creates a "build context" for each container with the generated Dockerfile and the Ansible role files.
    • Starts a docker build process for each container. The Dockerfile installs ansible and executes the playbook "locally" (inside of each container).
    • Pushes these images into your registry as each build finishes.
    • Note, this happens inside of containers, and the builds execute concurrently. Recall this build machine has 2 cores, so only 2 containers build at a time and it took several hours.  If we used a 16-core machine, this whole build would go faster.  In another terminal, look at docker stats during the build.  Another significant “performance” impact is the network bandwidth between your build machine and your registry.

Running the Containers

We are going to run these containers inside of a Kubernetes environment. Here are the finishing touches needed to give us a completely containerized SAS Viya environment running in Kubernetes. Note, that by default this deploys into a new namespace inside of your Kubernetes cluster and isolates the resources from anything else running.

Kubernetes Environment

Since we built the full stack, we'll need to make sure we have sufficient resources to run all of these containers at the same time. We'll need a minimum of 8 cores and 80GB RAM available. Remember CAS is a multithreaded, in-memory runtime, so the more cores and RAM you provide, the more horsepower you'll have for doing actual analytical work with SAS and CAS.

Kubectl

Hopefully, if you've gotten this far you are familiar with kubectl, which is the client tool/interface used with a Kubernetes cluster. Consider it a cli wrapper around the Kubernetes API. But for thoroughness, you will launch your SAS Viya deployment from whatever machine from where you are running kubectl. If this happens to be the same machine you built on, then you can stay inside of the sas-container-recipes directory you started in, and copy and paste those kubectl apply -f... commands. Or you can copy your manifest files somewhere else and modify those commands accordingly. In either instance, once those commands run, your environment is up, and you should be able to access SAS Environment Manager and other SAS web apps. If you added your userid as an administrator in the sitedefault.yml file, then you can log in as yourself with admin access.
Apply the manifests:

Apply the manifests

And after a few minutes your pods should be up (first time takes the longest since images must be pulled). Note that the pod running doesn’t mean all your SAS Viya services are running. It may take up to 30 minutes for all services to be up and stabilized.

Pods list

With your Ingress and DNS rules set up correctly, you should be able to reach your environment:

SAS login screen

Based on properly configured sitedefault.yml and sssd.conf files, you should be able to log in as an LDAP user.

Miscellaneous Notes

Scaling

Once your SAS Viya environment is up and running in Kubernetes, the following kubectl command adds CAS worker nodes to scale out the capacity of our CAS server.

kubectl scale deployment sas-viya-cas-worker --replicas=5 -n sas-viya-prod

Note, there isn’t any value in adding any more workers than you have physical nodes in your cluster.

Performance

SAS is a powerful programming language designed to handle heavy workloads on large data. General hardware performance has historically been a chief concern to customers implementing SAS. Containers bring a whole new wrinkle to the concept of performance given the general notion of hardware abstraction. One performance related question is: how can we ever guarantee the IO provided by the underlying filesystem (SASWORK, CAS_DISK_CACHE)? Like Kubernetes and Storage/State in general, no easy answer exists. It falls back on the Kubernetes operator to make high performance filesystems (i.e. local SSD) available on all nodes a SAS programming or CAS container(s) might land on, and manually edit the corresponding manifest files to leverage those host disks. Alternatively, we can try to limit the burden on these scratch disk spaces. For CAS, this means ensuring we have more RAM available than data in use.

Amnesia

See the summary section below for a caveat about this deployment methodology – this is not quite a complete implementation for “production” types of environments. At least not without the understanding customer configuration requirements. You should have a discussion with your sales team about some of these details. But please be aware building/deploying as we did here leaves us with an “Amnesiac Viya” (this useful term coined by an astute SAS employee). That is, there is no state here. If and when you take your environment down, or scale pods to 0 across services, this will yield a "brand new" or "fresh" environment once brought back up. The good news is this also means if we run into any issues, we can easily delete the whole namespace and restart. If you want to persist any user data, config, reports, code, etc. you will have to manually attach storage to a few locations.

Full vs Multiple

Note, here we used SAS_DEPLOYMENT_TYPE=full. This built the entire Viya stack, visual interfaces, microservices and all. Alternatively, if we set the deployment type to "multiple" we get three container images – programming, httpproxy, and cas. This would be all we need if we wanted to write SAS code, whether we wanted to use SAS Studio or an external IDE like Jupyter. And we could still scale out our CAS cluster the same way as we did in our full environment.

Summary

Just like everyone else, the SAS container strategy is quickly evolving. SAS Viya, as a scalable, highly available services-oriented architecture, is a perfect fit to run in containers inside of the Kubernetes orchestration framework. Kubernetes brings tremendous operational benefits to the table for this type of software. Smoother deployments, higher uptime, instant scale, much more efficient hardware usage to name a few.

As you will see in the build log when running the recipe, this is an "EXPERIMENTAL" deployment process. The recipes are an excellent way to get your hands on a Kubernetes version of SAS Viya early. Future releases of SAS Viya will be fully "containerized" and "kubernetes-ized" so customers won’t be building their own containers in this manner. Rather, SAS will provide a Helm chart to customers that will pull container images straight from SAS and apply them into their Kubernetes environments appropriately. Further, many aspects of SAS Viya’s infrastructure will be redesigned to be more "Kubernetes native," but the general feel of this model is what sysadmins/operators should see from SAS going forward.

Deploying the Full SAS Viya Stack in Kubernetes was published on SAS Users.

6月 062019
 

Want to learn SAS programming but worried about taking the plunge? Over at SAS Press, we are excited about an upcoming publication that introduces newbies to SAS in a peer-review instruction format we have found popular for the classroom. Professors Jim Blum and Jonathan Duggins have written Fundamentals of Programming in SAS using a spiral curriculum that builds upon topics introduced earlier and at its core, uses large-scale projects presented as case studies. To get ready for release, we interviewed our new authors on how their title will enhance our SAS Books collection and which of the existing SAS titles has had an impact on their lives!

What does your book bring to the SAS collection? Why is it needed?

Blum & Duggins: The book is probably unique in the sense that it is designed to serve as a classroom textbook, though it can also be used as a self-study guide. That also points to why we feel it is needed; there is no book designed for what we (and others) do in the classroom. As SAS programming is a broad topic, the goal of this text is to give a complete introduction of effective programming in Base SAS – covering topics such as working with data in various forms and data manipulation, creating a variety of tabular and visual summaries of data, and data validation and good programming practices.

The book pursues these learning objectives using large-scale projects presented as case studies. The intent of coupling the case-study approach with the introductory programming topics is to create a path for a SAS programming neophyte to evolve into an adept programmer by showing them how programmers use SAS, in practice, in a variety of contexts. The reader will gain the ability to author original code, debug pre-existing code, and evaluate the relative efficiencies of various approaches to the same problem using methods and examples supported by pedagogical theory. This makes the text an excellent companion to any SAS programming course.

What is your intended audience for the book?

Blum & Duggins: This text is intended for use in both undergraduate and graduate courses, without the need for previous exposure to SAS. However, we expect the book to be useful for anyone with an aptitude for programming and a desire to work with data, as a self-study guide to work through on their own. This includes individuals looking to learn SAS from scratch or experienced SAS programmers looking to brush up on some of the fundamental concepts. Very minimal statistical knowledge, such as elementary summary statistics (e.g. means and medians), would be needed. Additional knowledge (e.g. hypothesis testing or confidence intervals) could be beneficial but is not expected.

What SAS book(s) changed your life? How? And why?

Blum: I don’t know if this qualifies, but the SAS Programming I and SAS Programming II course notes fit this best. With those, and the courses, I actually became a SAS programmer instead of someone who just dabbled (and dabbled ineffectively). From there, many doors were opened for me professionally and, more importantly, I was able to start passing that knowledge along to students and open some doors for them. That experience also served as the basis for building future knowledge and passing it along, as well.

Duggins: I think the two SAS books that most changed my outlook on programming (which I guess has become most of my life, for better or worse) would either be The Essential PROC SQL Handbook for SAS Users by Katherine Prairie or Jane Eslinger's The SAS Programmer's PROC REPORT Handbook, because I read them at different times in my SAS programming career. Katherine's SQL book changed my outlook on programming because, until then, I had never done enough SQL to consistently consider it as a viable alternative to the DATA step. I had taken a course that taught a fair amount of SQL, but since I had much more experience with the DATA step and since that is what was emphasized in my doctoral program, I didn't use SQL all that often. However, after working through her book, I definitely added SQL to my programming arsenal. I think learning it, and then having to constantly evaluate whether the DATA step or SQL was better suited to my task, made me a better all-around programmer.

As for Jane's book - I read it much later after having used some PROC REPORT during my time as a biostatistician, but I really wasn't aware of how much could be done with it. I've also had the good fortune to meet Jane, and I think her personality comes through clearly - which makes that book even more enjoyable now than it was during my first read!

Read more

We at SAS Press are really excited to add this new release to our collection and will continue releasing teasers until its publication. For almost 30 years SAS Press has published books by SAS users for SAS users. Here is a free excerpt located on our Duggins' author page to whet your appetite. (Please know that this excerpt is an unedited draft and not the final content). Look out for news on this new publication, you will not want to miss it!

Want to find out more about SAS Press? For more about our books, subscribe to our newsletter. You’ll get all the latest news and exclusive newsletter discounts. Also, check out all our new SAS books at our online bookstore.

Interview with new SAS Press authors: Jim Blum and Jonathan Duggins was published on SAS Users.

6月 042019
 


Two sayings I’ve heard countless times throughout my life are “Work smarter, not harder,” and “Use the best tool for the job.” If you need to drive a nail, you pick up a hammer, not a wrench or a screwdriver. In the programming world, this could mean using an existing function library instead of writing your own or using an entirely different language because it’s more applicable to your problem. While that sounds good in practice, in the workplace you don’t always have that freedom.

So, what do you do when you’re given a hammer and told to fasten a screw? Or, like in the title of this article’s case, what do you do when you have Python functions you want to use in SAS?

Recently I was tasked with documenting an exciting new feature for SAS — the ability to call Python functions from within SAS. In this article I will highlight everything I’ve learned along the way to bring you up to speed on this powerful new tool.

PROC FCMP Python Objects

Starting with May 2019 release of SAS 9.4M6, the PROC FCMP procedure added support for submitting and executing functions written in Python from within a SAS session using the new Python object. If you’re unfamiliar with PROC FCMP, I’d suggest reading the documentation. In short, FCMP, or the SAS Function Compiler, enables users to write their own functions and subroutines that can then be called from just about anywhere a SAS function can be used in SAS. Users are not restricted to using Python only inside a PROC FCMP statement. You can create an FCMP function that calls Python code, and then call that FCMP function from the DATA step. You can also use one of the products or solutions that support Python objects including SAS High Performance Risk and SAS Model Implementation Platform.

The Why and How

So, what made SAS want to include this feature in our product? The scenario in mind we imagined when creating this feature was a customer who already had resources invested in Python modeling libraries but now wanted to integrate those libraries into their SAS environment. As much fun as it sounds to convert and validate thousands of lines of Python code into SAS code, wouldn’t it be nice if you could simply call Python functions from SAS? Whether you’re in the scenario above with massive amounts of Python code, or you’re simply more comfortable coding in Python, PROC FCMP is here to help you. Your Python code is submitted to a Python interpreter of your choice. Results are packaged into a Python tuple and brought back inside SAS for you to continue programming.

Programming in Two Languages at Once

So how do you program in SAS and Python at the same time? Depending on your installation of SAS, you may be ready to start, or there could be some additional environment setup you need to complete first. In either case, I recommend pulling up the Using PROC FCMP Python Objects documentation before we continue. The documentation outlines the addition of an output string that must be made to your Python code before it can be submitted from SAS. When you call a Python function from SAS, the return value(s) is stored in a SAS dictionary. If you’re unfamiliar with SAS dictionaries, you can read more about them here Dictionaries: Referencing a New PROC FCMP Data Type.

Getting Started

There are multiple methods to load your Python code into the Python object. In the code example below, I’ll use the SUBMIT INTO statement to create an embedded Python block and show you the basic framework needed to execute Python functions in SAS.

/* A basic example of using PROC FCMP to execute a Python function */
proc fcmp;
 
/* Declare Python object */
declare object py(python);
 
/* Create an embedded Python block to write your Python function */
submit into py;
def MyPythonFunction(arg1, arg2):
	"Output: ResultKey"
	Python_Out = arg1 * arg2
	return Python_Out
endsubmit;
 
/* Publish the code to the Python interpreter */
rc = py.publish();
 
/* Call the Python function from SAS */
rc = py.call("MyPythonFunction", 5, 10);
 
/* Store the result in a SAS variable and examine the value */
SAS_Out = py.results["ResultKey"];
put SAS_Out=;
run;

You can gather from this example that there are essentially five parts to using PROC FCMP Python objects in SAS:

  1. Declaring your Python object.
  2. Loading your Python code.
  3. Publishing your Python code to the interpreter.
  4. Executing your Python Code.
  5. Retrieving your results in SAS.

From the SAS side, those are all the pieces you need to get started importing your Python code. Now what about more complicated functions? What if you have working models made using thousands of lines and a variety of Python packages? You still use the same program structure as before. This time I’ll be using the INFILE method to import my Python function library by specifying the file path to the library. You can follow along in by copying my Python code into a .py file. The file, blackscholes.py, contains this code:

def internal_black_scholes_call(stockPrice, strikePrice, timeRemaining, volatility, rate):
    import numpy
    from scipy import stats
    import math
    if ((strikePrice != 0) and (volatility != 0)):
        d1 = (math.log(stockPrice/strikePrice) + (rate + (volatility**2)\
                       /  2) * timeRemaining) / (volatility*math.sqrt(timeRemaining))
        d2 = d1 - (volatility * math.sqrt(timeRemaining))
        callPrice = (stockPrice * stats.norm.cdf(d1)) - \
        (strikePrice * math.exp( (-rate) * timeRemaining) * stats.norm.cdf(d2))
    else:
        callPrice=0
    return (callPrice)
 
def black_scholes_call(stockPrice, strikePrice, timeRemaining, volatility, rate):
    "Output: optprice"
    import numpy
    from scipy import stats
    import math
    optPrice = internal_black_scholes_call(stockPrice, strikePrice,\
                                           timeRemaining, volatility, rate)
    callPrice = float(optPrice)
    return (callPrice,)

My example isn’t quite 1000 lines, but you can see the potential of having complex functions all callable inside SAS. In the next figure, I’ll call these Python functions from SAS.

(/*Using PROC FCMP to execute Python functions from a file */
proc fcmp;
 
/* Declare Python object */
declare object py(python);
 
/* Use the INFILE method to import Python code from a file */
rc = py.infile("C:\Users\PythonFiles\blackscholes.py");
 
/* Publish the code to the Python interpreter */
rc = py.publish();
 
/* Call the Python function from SAS */
rc = py.call("black_scholes_call", 132.58, 137, 0.041095, .2882, .0222);
 
/* Store the result in a SAS variable and examine the value */
SAS_Out = py.results["optprice"];
put SAS_Out=;
run;

Calling Python Functions from the DATA step

You can take this a step further and make it useable in the DATA step-outside of a PROC FCMP statement. We can use our program from the previous example as a starting point. From there, we just need to wrap the inner Python function call in an outer FCMP function. This function within a function design may be giving you flashbacks of Inception, but I promise you this exercise won’t leave you confused and questioning reality. Even if you’ve never used FCMP before, creating the outer function is straightforward.

/* Creating a PROC FCMP function that calls a Python function  */
proc fcmp outlib=work.myfuncs.pyfuncs;
 
/* Create the outer FCMP function */
/* These arguments are passed to the inner Python function */
function FCMP_blackscholescall(stockprice, strikeprice, timeremaining, volatility, rate);
 
/* Create the inner Python function call */
/* Declare Python object */
declare object py(python);
 
/* Use the INFILE method to import Python code from a file */
rc = py.infile("C:\Users\PythonFiles\blackscholes.py");
 
/* Publish the code to the Python interpreter */
rc = py.publish();
 
/* Call the Python function from SAS */
/* Since this the inner function, instead of values in the call           */
/* you will pass the outer FCMP function arguments to the Python function */
rc = py.call("black_scholes_call", stockprice, strikeprice, timeremaining, volatility, rate);
 
/* Store the inner function Python output in a SAS variable                              */
FCMP_out = py.results["optprice"];
 
/* Return the Python output as the output for outer FCMP function                        */
return(FCMP_out);
 
/* End the FCMP function                                                                 */
endsub;
run;
 
/* Specify the function library you want to call from                                    */
options cmplib=work.myfuncs;
 
/*Use the DATA step to call your FCMP function and examine the result                    */
data _null_;
   result = FCMP_blackscholescall(132.58, 137, 0.041095, .2882, .0222);
   put result=;
run;

With your Python function neatly tucked away inside your FCMP function, you can call it from the DATA step. You also effectively reduced the statements needed for future calls to the Python function from five to one by having an FCMP function ready to call.

Looking Forward

So now that you can use Python functions in SAS just like SAS functions, how are you going to explore using these two languages together? The PROC FCMP Python object expands the capabilities of SAS and by result improves you too as a SAS user. Depending on your experience level, completing a task in Python might be easier for you than completing that same task in SAS. Or you could be in the scenario I mentioned before where you have a major investment in Python and converting to SAS is non-trivial. In either case, PROC FCMP now has the capability to help you bridge that gap.

SAS or Python? Why not use both? Using Python functions inside SAS programs was published on SAS Users.

5月 302019
 

Human behavior is fascinating. We come in so many shapes, sizes and backgrounds. Doesn’t it make sense that any tests we write also accommodate our wonderful differences?

This picture is of Miko, a northern rescue and a recent addition to my family. He’s learning to live in an urban household and doing great with some training. He’s going through so many new tests as he adapts to life in the city, which is quite different from being free in the northern territories. Watch for a later post on his training successes.

I’m so happy to share how SAS has been helping candidates by offering a variety of certification credentials geared towards testing for differences and preferences in thought. If you are wondering – I’ve been addicted to psychometrics for a while now, anything human behavior-related interests me. I thought I would begin with sharing some different types of testing roles that I have held in the past.

1. Psychometric testing

Before I joined SAS, I worked at CSI. To answer that unspoken thought dear reader, CSI has been providing financial training and accreditation since 1964 – way before CSI the TV show became popular.

My role as Test Manager was super exciting for someone with a curiosity for analytics and helping people succeed. In a team of four we scored over 200 exams to provide credentials. Psychometrics was the most exciting part of my job analyzing the performance of test takers to constantly innovate our tests. Psychometric tests are used to identify a candidate's skills, knowledge and personality.

2. Multiple-choice testing

While setting multiple choice exam questions, I learned that it was ideal for the four answer choices to be similar in length, and complexity (e.g. if candidates typically chose option A for a question whose right response was B, we would dig deeper to compare the lengths of the options, the language of the options, and then change the option if that was what the review committee agreed upon).

3. Adaptive testing

Prior to CSI, I worked at the test center of Devry Institute of technology. In adaptive testing, the test’s difficulty adapts to candidate performance. A correct response leads into a more complex question. On the flip side, an incorrect response leads to an easier next question. So that, eventually, we could help candidates decide which engineering program would be the right skill fit.

This is where I met the student who asked, “can my boyfriend write my exam?”

4. Performance testing

With SAS at the forefront of analytics, it should come as no surprise that certification exams have evolved to the next level. As a certification candidate you can now try out performance-based testing.

A performance test requires a candidate to actually perform a task, rather than simply answering questions. An example is writing SAS code. Instead of answering a knowledge-level multiple choice exam about SAS code, the candidate is asked to actually write code to arrive at answers.

Certification at SAS

SAS Certified Specialist: Base Programming Using SAS 9.4 is great for those who can demonstrate ease in putting into practice the knowledge learned in the Foundation Programming classes 1 and 2. During this performance-based exam, candidates will access a SAS environment. Coding challenges will be presented, and you will need to write and execute SAS code to determine the correct answers to a series of questions.

SAS® Certified Base Programmer for SAS®9 credential remains, but the exam will be retired in June 2019.

While writing this post I came across this on Wikipedia: it shows how the study of adaptive behavior goes back to Darwin’s time. It’s a good read for anyone intrigued by the science and art of testing.

“Charles Darwin was the inspiration behind Sir Francis Galton who led to the creation of psychometrics. In 1859, Darwin published his book The Origin of Species, which pertained to individual differences in animals. This book discussed how individual members in a species differ and how they possess characteristics that are more adaptive and successful or less adaptive and less successful. Those who are adaptive and successful are the ones that survive and give way to the next generation, who would be just as or more adaptive and successful. This idea, studied previously in animals, led to Galton's interest and study of human beings and how they differ one from another, and more importantly, how to measure those differences.”

Are you fascinated by the science and art of human behavior as it relates to testing? Are you as excited as I am about the possibilities of performance-based testing? I would love to hear your comments below.

New at SAS: Psychometric testing was published on SAS Users.

5月 292019
 

Interestingly enough, paperclips have their own day of honor. On May 29th, we celebrate #NationalPaperclipDay! That well-known piece of curved wire deserves attention for keeping our papers together and helping us stay organized. Do you remember who else deserved the same attention? Clippit – the infamous Microsoft Office assistant, popularly known as ‘Clippy’.

We saw the last of Clippy in 2004 before it was removed completely from Office 2007 after constant negative criticism. By then, most users considered it useless and decided to turn it off completely, despite the fact that it was supposed to help them perform certain tasks faster. Clippy was a conversational agent, like a chatbot, launched a decade before Apple’s Siri. The cartoonish paperclip-with-eyes resting on a yellow loose-leaf paper bed would pop up to offer assistance every time you opened a Microsoft Office program. Today, people seem to love AI. But a decade ago, why did everyone hate Clippy?

Why was Clippy such a failure?

One of the problems with Clippy was the terrible user experience. Clippy stopped to ask users if they needed help but in doing so would suspend their operations altogether. This was great for first timers, getting to know Clippy, goofing around and training themselves to use the agent. However, later on, mandatory conversations became frustrating. Second, Clippy was designed as a male agent. Clippy was born in a meeting room full of male employees working at Microsoft. The facial features resemble those of a male cartoon or at least that is what women in the testing focus group observed. In a farewell note for Clippy, the company addressed Clippy as ‘he’ affirming his gender. Some women even said that Clippy’s gaze created discomfort while they tried to do their tasks.

Rather than using Natural Language Processing (NLP), Clippy’s actions were invoked by Bayesian algorithms that estimated the probability of a user wanting help based on specific actions. Clippy’s answers were rule based and templated from Microsoft’s knowledge base. Clippy seemed like the agent you would use when you were stuck on a problem, but the fact that he kept appearing on the screen in random situations added negativity to users’ actions. In total, Clippy violated design principles, social norms and process workflows.

What lessons did Clippy teach us?

We can learn from Clippy. The renewed interest in conversational agents in the tech industry often overlooks the aspects that affect efficient communication, resulting in failures. It is important to learn lessons from the past and design interfaces and algorithms that tackle the needs of humans rather than hyping the capabilities of artificial intelligence.

At SAS, we are working to deliver a natural language interaction (NLI) service that converts keyed or spoken natural language text into application-specific, executable code; and using apps like Q –genderless AI voice for virtual assistants. We’re developing different ways to incorporate chatbots into business dashboards or analytics platforms. These capabilities have the potential to expand the audience for analytics results and attract new and less technical users.

“Chatbots are a key technology that could allow people to consume analytics without realizing that’s what they’re doing,” says Oliver Schabenberger, SAS Executive Vice President, Chief Operating Officer and Chief Technology Officer. “Chatbots create a humanlike interaction that makes results accessible to all.”

The use of chatbots is exponentially growing and all kinds of organizations are starting to see the exciting possibilities combining chatbots with AI analytics. Clippy was a trailblazer of sorts but has come to represent what to avoid when designing AI. At SAS we are focused on augmenting the human experience and that it is the customer who needs to be at the center, not the technology.

Interested in seeing what SAS is doing with Natural Language Processing? Check out SAS Visual Text Analytics and try it for free. We also have a brand new SAS Book hot off the press focusing on information extraction models for unstructured text / language data: SAS® Text Analytics for Business Applications: Concept Rules for Information Extraction Models. We even have a free chapter for you to take a peek!

Other Resources:

What can we learn from Clippy about AI? was published on SAS Users.

5月 232019
 

Often, the SAS 9.4 administration environment architecture can seem confusing to new administrators. You may be faced with questions like: What is a tier? Why are there so many servers? What is the difference between distributed and non-distributed installations?

Understanding SAS 9.4 architecture is key to tackling the tasks and responsibilities that come with SAS administration and will help you know where to look to make changes or troubleshoot problems. One of the ways I have come to think about SAS 9.4 architecture is to think of it like building a house.

So, what is the first thing you need to build a house? Besides money and a Home Depot rewards credit card, land is the first thing you need to put the house on. For SAS administration the land is your infrastructure and hardware, and the house you want to build on that land is your SAS software. You, the admin, are the architect. Sometimes building a house can be simple, so only one architect is needed. Other times, for more complex buildings, an entire team of architects is needed to keep things running smoothly.

Once the architect decides how the house should look and function, and the plans are signed off, the foundation is laid. In our analogy, this foundation is the SAS metadata server – the rest of the installation sits on top of it.

Next come the walls and ceilings for either a single-story ranch house (a distributed SAS environment) or a multi-story house (a non-distributed SAS environment). Once the walls are painted, the plumbing installed, and the carpets laid, you have a house made up of different rooms. Each room has a task: a kitchen to make food, a child’s bedroom to sleep in, and a living room to relax and be with family. Each floor and each room serve the same purpose as a SAS server – each server is dedicated to a specific task and has a specific purpose.
Finally, all of the items in each room, such as the bed, toys, and kitchen utensils can be equated to a data source: like a SAS data set, data pulled in from Hadoop or an Excel spreadsheet. Knowing what is in each room helps you find objects by knowing where they should belong.

Once you move into a house, though, the work doesn’t stop there, and the same is true for a SAS installation. Just like the upkeep on a house (painting the exterior, fixing appliances when they break, etc.), SAS administration requires maintenance to keep everything running smoothly.

How this relates to SAS

To pull this analogy back to SAS, let us start with the different install flavors (single house versus townhouse, single story versus multiple stories). SAS can be installed either as a SAS Foundation install or as a metadata-managed install. A SAS Foundation install is the most basic (think Base SAS). A metadata-managed install is the SAS 9 Intelligence Platform, with many more features and functionality than Base SAS. With SAS Foundation, your users work on their personal machines or use Remote Desktop or Citrix. A SAS Foundation install does not involve a centrally metadata managed system, however in a metadata managed install, your users work on the dedicated SAS server. These two different SAS deployments can be installed on physical or virtual machines, and all SAS solution administration is based off of SAS 9.4 platform administration.

We hope you find this overview of SAS platform administration helpful. For more information check out this list of links to additional admin resources from my new book, SAS® Administration from the Ground Up: Running the SAS®9 Platform in a Metadata Server Environment.

SAS 9.4 architecture – building an installation from the ground up was published on SAS Users.

5月 222019
 

This blog shows how the automatically generated concepts and categories in Visual Text Analytics (VTA) can be refined using LITI and Boolean rules. Because of these capabilities highly customized models can be developed in VTA. The rules used in this blog are basic. Developing linguistic rules and accurately categorizing documents requires subject matter expertise and understanding the grammatical structure of the language(s) used.

I will use a data set that contains information on 1527 randomly selected movies: their titles, reviews, MPAA Ratings, Main Genre classifications and Viewer Ratings. Two customized categories will be developed one for Children and the other for Sport movies. Because we are familiar with movies classification and MPAA ratings, it will be relatively easy to understand the rules used in this blog. The overall blog’s objective is to show how to formulate basic rules, thus their use can be extended to other fields.

SAS Visual Text Analytics (VTA) is the SAS offering designed to effectively extract insights from unstructured data in large scale. Offered on the SAS Viya architecture, VTA combines the power of Natural Language Processing (NLP), Machine Learning (ML) and Linguistic Rules. Currently, VTA supports 30 languages and it has an open architecture supporting 3rd-party programming interfaces.

As in all analytical projects, the discovery process in Text Analytics projects requires several iterations where the insights found in one iteration are used in the next iterations. In relationship to the linguistic rules, one must determine if the new rules are an improvement over the ones used in previous iterations, and find how many true positives and false positives are matched by the new rules. This process should be repeated until one obtains the precision required.

Initial Text Analysis Using Visual Analytics

Because Visual Analytics (VA) and VTA are highly integrated, the initial Text Analysis can be done in VA.

Every Text Analytics dataset must have a unique identifier associated to each document. In my blog, Discover Main Topics on #MLKDayofService Tweets Using SAS Visual Text Analytics, I showed how to set a “Unique Row Identifier”, and how to work with the nodes in the Pipeline.

In Visual Analytics (VA) one can do the initial analysis of text data, see the Word Cloud, and a list of topics. In the Options menu, I indicated I wanted a Maximum of Topics to Generate=7.

The photo above shows that there are 364 movies with the term “kid” in the Topic “+show,+kid, +rate,+movie”. We could build a category that groups appropriate movies for kids.

There are 203 documents with the Topic related to science fiction. Therefore, if I wanted to have a category for Sport movies, I would have to build it myself because sports terms appear in fewer documents.

Create a Visual Text Analytics Project

In VTA, a pipeline is a process flow diagram whose nodes represent tasks in the Text Analysis Process, I described in detail how to work with the nodes in my MLKDayofService blog mentioned above. Briefly, from the SAS Home menu select the action Build Models that will take you to SAS Model Studio, where you select and create a New Project.

The photo above shows the data role assignments done in the Data tab.

Notice that there is a Unique Row Identifier for each document, the Text Variable to analyze is Review, and two variables are used as Category: MPAARating and mainGenre. Later on, VTA will use these two variables to automatically create categories and their Boolean rules. Title doesn’t have a role but I want it to be displayed to facilitate the analysis.

Movies are already classified according to their main category (mainGenre), I want to see the Boolean rules that VTA automatically generates for each category, and if I can create new concepts and categories that improve on the initial categorization. For example, I would like:

  1. to find children movies that don’t contain violence,
  2. to find movies that are related to Sports,
  3. to read the reviews of my favorite old movies, and
  4. find movies whose reviews mention some of my favorite movie directors.

Method

I ran two pipelines. The first one had the default pipeline settings and also the option Include predefined concepts enabled for the Concepts node. The objective was to see the rules associated with the genres “Sports”, “Animated” and “Family”, the movies matched by these rules, as well as, the ones that shouldn’t have been matched. In the second pipeline, I developed LITI and Boolean rules with the objective of improving the default categorizations automatically produced in the first pipeline.

In the next sections I will describe how the new categories were built. In real business situations, sometimes we will have pre-defined categories available to us, and other times we will come up with categories that satisfy the business objectives after analyzing many documents.

Customized Concepts built in the Concept Node

In the second pipeline, I developed three customized concepts, I will use one of them “MySports” to build a new category later on.

Basic Boolean operators are used to define new concepts and categories. AND/NOT operators are applied to the whole document. There other operators that search within the same sentence (SENT), the same paragraph (PARA) or a number of terms (DIST).

# Any line that starts with “#” is a comment
# Use CLASSIFIER to match a literal sequence
# Use CONCEPT_RULE to use Boolean and proximity operators. The term extracted should use _c{ }

MySports Concept

I wrote this rule which matches 98 documents, most of them related to Sport movies and with few false positives. This rule will match a document if any of the terms sport, baseball, tennis, football, basketball, racetrack appear anywhere in the document (movie review) but the terms gambling, buddy or sporting do not appear anywhere in the document.

I will use this MySports CONCEPT_RULE to build the new Sports category:

CONCEPT_RULE: (AND, (OR, “_c{sport@}”,”_c{baseball}”,”_c{tennis}”,”_c{football}”,”_c{basketball}”, “_c{racetrack}”),(NOT,”sporting”),(NOT, “gambling”),(NOT,”buddy”))

filmmakersInReview Concept

I built this concept just to illustrate how to use a pre-defined concept, in this case nlpPerson:

CONCEPT_RULE: (DIST_10,(OR,”filmmaker”,”director”,”film producer”,”producer”,”movie maker”),”_c{nlpPerson}”)

favoriteMovies Concept

I built this concept to match my favorite old movies and one of my favorite directors. The first CONCEPT_RULE will match documents that contain in the same sentence the terms Stanley Kubrick and 2001. The second CONCEPT_RULE will match documents that contain the two terms anywhere in the document. Both CONCEPT_RULEs will only extract the first term "Stanley Kubrick":

CLASSIFIER:A Space Odyssey
CLASSIFIER:The Sound of Music
CLASSIFIER:Il Postino
CONCEPT_RULE:(SENT,”_c{Stanley Kubrick}”,”2001″)
CONCEPT_RULE:(AND,”_c{Stanley Kubrick}”,”A Space Odyssey”)

New Concepts in Parsing Text Node

The customized concepts developed in the Concepts node are passed to the Text Parsing Node. Notice the Terms football, sport, sports, baseball and the new Role MySports in the Kept Terms window, as well as the matched documents to the term "football":

Customized Categories in the Category Mode

In the second pipeline I developed new categories using as starting point the rules associated with the genres “Sports”, “Animated” and “Family”.

Sports Category

The input data has only 3 movies in the Sports category, it is difficult to generate a meaningful rule with such a small dataset. Once the first pipeline is ran, there are a total of 8 movies which include the 3 original ones, and 5 that are not related to sports. The automatically generated rule is:

(OR,(AND,”crowd-pleaser”),(AND,”conor”),(AND,”_x000d_stupid”))

For the second pipeline, I developed the MySports rule in the Concepts node as mentioned above, and write this Boolean rule in the Categories node:

(OR,(AND,”crowd-pleaser”),(AND,”conor”),(AND,”_x000d_stupid”),”[MySports]”)

The new rule matches 90 movies, most of them related to Sports. For the next iteration, one would need to look at the movies that don’t relate to Sports, the ones that relate to Sports and were not matched, and improve in the rule above.

ChildrenMovies Category

In the second pipeline, I combined the rules for the Family and Animation categories which were automatically produced in the first pipeline.

For the Family category, there were 6 movies matched by this rule

(OR,(AND,(OR,”adults”,”adult”),”oz”))

It matched “People vs Larry Flynt” which prompted me to use the terms “murder” and “obscenity” in the Concept rule.
The Animation category had 66 matched movies and the automatically generated rule was:

(OR,(AND,”pixar”),(AND,(OR,”animator”,”animators”)),(AND,(OR,”voiced”,”voices”,”voicing”,”voice”),(OR,”cartoon”,”cartoons”)),(AND,(OR,”cartoon characters”,”cartoon character”)),(AND,(OR,”lesson”,”lessons”),”animated”),(AND,”live action”),(AND,”jeffrey”,(OR,”features”,”feature”)),(AND,”3-d”))
I decided to modify these two rules. In the second pipeline, I used this new rule
(OR,(AND,(NOT,(OR,”adults”,”adult”,”suitable for children”,”rated R”,”strip@”,”suck@”,”crude humor”,”gore”,”horror”,”murder”,”obscenity”,”drug use@”)),”Wizard of Oz”),(AND,”pixar”),(AND,(OR,”animator”,”animators”)),(AND,(OR,”voiced”,”voices”,”voicing”,”voice”),(OR,”cartoon”,”cartoons”)),(AND,(OR,”cartoon characters”,”cartoon character”)),(AND,(OR,”lesson”,”lessons”),”animated”),(AND,”live action”),(AND,”jeffrey”,(OR,”features”,”feature”)),(AND,”3-d”))

This produced 73 movies and only two rated “R”. Therefore, I removed both the Animation and the Family categories and created the new category childrenMovies.

Again, to determine if the new rules are an improvement over the previous ones, we must find out how many true positives and false positives are matched by the new rules, and repeat the process until we obtain the precision required.

Conclusion

Because the automatically generated concepts and categories in Visual Text Analytics (VTA) can be refined using LITI and Boolean rules, highly customized models can be developed in VTA.

As in all analytical projects, the discovery process in Text Analytics projects requires several iterations where the insights found in one iteration are used in the next iterations. In relationship to the linguistic rules, one must determine if the new rules are an improvement over the ones used in previous iterations, and find how many true positives and false positives are matched by the new rules. This process should be repeated until one obtains the precision required.

Many thanks to Teresa Jade and Biljana Belamaric Wilsey for reviewing the linguistic rules used in this blog. For more information about Visual Text Analytics, please check out:

Analysis of Movie Reviews using Visual Text Analytics was published on SAS Users.