If you were to ask Tricia Wang, PhD, about real business growth, she would tell you that it lies outside the boundaries of the known. Not everything valuable is measurable, she would say. And big data is hiding new customers in the market from you. Wang is more than just [...]
SAS Technical Support occasionally receives requests from users who want to insert blank rows into their TABULATE procedure tables. PROC TABULATE syntax does not have any specific options that insert blank rows into the table results.
One way to accomplish this task is explained in SAS Sample 45972, "Add a blank row in PROC TABULATE output in ODS destinations." This sample shows you how to add a blank row between class variables in a PROC TABULATE table. The sample code creates a new data set variable that you can use in a CLASS statement in the PROC TABULATE step.
In addition to the method that is shown this sample, there are two other methods that you can use to insert a blank row in PROC TABULATE table results. The following sections explain those methods:
Method 1: Adding a blank row between row dimension variables
This section demonstrates how to add a blank row between row dimension variables. This method expands on the approach that is used in Sample 45972.
In the following example, additional data-set variables are added to the data set in a DATA step. The BLANKROW variable is used as a class variable, and the variables DUMMY1 and DUMMY2 are used as analysis variables in the PROC TABULATE step. This sample code also uses the AGE and SEX variables from the SASHELP.CLASS data set as class variables in PROC TABULATE.
/*** Method 1 ***/ data class; set sashelp.class; blankrow=' '; dummy1=1; dummy2=.; run; proc tabulate data=class missing; class age sex blankrow; var dummy1 dummy2; table age*dummy1=' ' blankrow=' '*dummy2=' ' sex*dummy1=' ' blankrow=' '*dummy2=' ' all*dummy1=' ', sum*F=8. pctsum / misstext=' ' row=float; keylabel sum='# of Students' pctsum='% of Total' all='Total'; title 'Add a Blank Row by Using New Data Set Variables'; run;
This method produces the result that is shown below. You can use ODS HTML, ODS PDF, ODS RTF, or ODS EXCEL statements to display the table.
Method 2: Adding blank rows with user-defined formats and the PRELOADFMT option in the CLASS statement
The second method creates user-defined formats and uses the PRELOADFMT option in a CLASS statement in PROC TABULATE. The VALUE statements that use the NOTSORTED option in the PROC FORMAT step establish the desired order of the results in the TABULATE results. Using the formats and specifying the ORDER=DATA option in the CLASS statement and the PRINTMISS option in the TABLE statement keeps the order requested in the PROC FORMAT VALUE statements and display the blank rows.
/*** Method 2 ***/ proc format; value $sexf (notsorted) 'F'='F' 'M'='M' ' '=' '; value agef (notsorted) 11='11' 12='12' 13='13' 14='14' 15='15' 16='16' .=' '; value $sex2f (notsorted default=8) 'F'='F' 'M'='M' ' '='Missing' '_'=' '; value age2f (notsorted) 11='11' 12='12' 13='13' 14='14' 15='15' 16='16' .=' .' 99=' '; value mymiss 0=' ' other=[8.2]; run; proc tabulate data=sashelp.class missing; class sex age / preloadfmt order=data; table sex age all, N pctn*F=mymiss. / printmiss misstext=' ' style=[cellwidth=2in]; format sex $sexf. age agef.; /* If there are no missing values for the class */ /* variables, use the formats $SEXF and AGEF.*/ /* With missing values for the class variables, */ /* use the formats $SEX2F and AGE2F. */ keylabel N='# of Students' PctN='% of Total' all='Total'; title 'Add a Blank Row by Using the PRELOADFMT Option'; run;
This method produces the result that is shown below. You can use ODS HTML, ODS PDF, ODS RTF, or ODS EXCEL statements to display the table.
If you have any questions about these methods, contact SAS Technical Support. From this link, you can search the Technical Support Knowledge Base, visit SAS Communities to ask for assistance, and contact SAS Technical Support directly.
My first post on customer journey optimization set the scene by introducing the concept of marketing as a game to be won. The rules to this game are complex, some are known, and others can be learned. Above all, the game is built on a shifting landscape of customer, competitor [...]
Referred to as the instigator of innovation, Mick Ebeling works to expand the human possibilities of technology through his company, Not Impossible Labs, whose mission is to change the world through technology and story. Powered by a team of thinkers, doers, creators and hackers, Not Impossible Labs works on inspirational [...]
Every day I’m shufflin'.
-- "Party Rock Anthem," LMFAO
The most popular way to mix a deck of cards is the riffle shuffle, which separates the deck into two pieces and interleaves the cards from each piece. Besides being popular with card players, the riffle shuffle is well-known among statisticians because Bayer and Diaconis (1992) showed that it takes seven riffle shuffles to randomize a 52-card deck. This result annoyed many recreational card players because shuffling seven times "slows down the game".
The second most popular shuffling technique is the overhand shuffle. The overhand shuffle is much less efficient at mixing the cards than a riffle shuffle. A result by Pemantle (1989) indicates that at least 50 overhand shuffles are required to adequately mix the deck, but computer experiments (Pemantle, 1989, p. 49) indicate that in many situations 1,000 or more overhand shuffles are required to randomize a deck. Talk about slowing down the game! This article shows how to simulate an overhand shuffle in SAS and visually compares the mixing properties of the riffle and overhand shuffles.
The overhand shuffle
An overhand shuffle transfers a few cards at a time from the dealer's right hand to his left. The dealer slides a few cards from the top of the deck in his right hand into his left hand. The process is repeated until all cards in his right hand are transferred. Consequently, cards that started near the top of the deck end up near the bottom, and vice versa.
Mathematically, you can model an overhand shuffle by randomly choosing k cut points that separate the deck into k+1 "packets" of contiguous cards. The size of a packet can vary, as can the number of packets. The overhand shuffle reverses the order of the packets while preserving the order of the cards within each packet. When the packet sizes are small, the cards are mixed better than when the packets are bigger.
For example, a deck that contains 10 cards (enumerated 1-10) might be split into five packets such as (1)(234)(5)(678)(9 10), where the parentheses show the cards in each of the packets. In this example, the cut points are the locations after cards 1, 4, 5, and 8. The packets contain 1, 3, 1, 3, and 2 cards. The overhand shuffle reverses the order of the packets, so the deck is reordered as (9 10)(678)(5)(234)(1). This is shown in the following graphic.
In a statistical model of the overhand shuffle, the cut points are assumed to be randomly chosen with probability p from among the N-1 possible locations between adjacent cards. An ideal overhand shuffle uses p = 1/2, which results in packets that are typically sized, 1, 2, and 3 (the packet size is geometrically distributed). However, among amateurs, the probability is much lower and an overhand shuffle reorders only a few large packets, which can lead to poor mixing of the cards.
Simulate an overhand shuffle
I previously showed how to simulate a riffle shuffle in the SAS/IML language. The following program shows how to simulate an overhand shuffle. The deck of cards (represented as a vector of integers with the values 1 to N) is passed in, as is the probability of choosing a cut point. To test the program, I pass in a deck that has 10 cards enumerated by the integers 1–10. The new order of the deck was shown in the previous section.
proc iml; call randseed(1234); /* Simulate overhand shuffle. Randomly split deck into k packets. Reverse the order of the packets. Each position between cards is chosen to be a cut point with probability p. (Default: p=0.5) */ start Overhand(_deck, p=0.5); deck = colvec(_deck); /* make sure input is column vector */ N = nrow(deck); /* number of cards in deck */ b = randfun(N-1, "Bernoulli", p); /* randomly choose cut points */ if all(b=0) then return _deck; /* no cut points were chosen */ cutpt = loc(b=1); /* cut points in range 1:n-1 */ cutpt = 0 || cutpt || N; /* append 0 and n */ rev = n - cutpt + 1; /* reversed positions */ newDeck = j(N, 1, .); do i = 1 to ncol(cutpt)-1; /* for each packet... */ R = (1 + cutpt[i]):cutpt[i+1]; /* positions of cards in right hand */ L = rev[i+1]:(rev[i]-1); /* new positions in left hand */ newDeck[L] = deck[R]; /* transfer packet from right to left hand */ end; return shape(newDeck, nrow(_deck), ncol(_deck)); /* return same shape as input */ finish; deck = 1:10; OverhandDeck = Overhand(deck, 0.5); print deck; print OverhandDeck;
Simulate many overhand shuffles
Let's now consider the standard 52-card deck, represented by the integers 1–52. The following loop performs 10 overhand shuffles (p = 0.5) and uses a heat map to show the order of the cards after each shuffle:
N = 52; /* standard 52-card deck */ deck = 1:N; /* initial order */ numShuffles = 10; /* number of times to shuffle cards */ OverDeck = j(numShuffles+1, N); OverDeck[1,] = deck; do i = 2 to numShuffles + 1; OverDeck[i,] = Overhand(OverDeck[i-1,]); end; ramp = palette("YLORBR",9); call heatmapcont(OverDeck) colorramp=ramp yvalues=0:numShuffles title="Positions of 52 Cards After Overhand Shuffles";
The cell colors in the heat map correspond to the initial position of the cards: light colors indicate cards that are initially near the top of the deck and dark colors represent cards that are initially near the bottom of the deck. The first row shows the initial positions of the cards in the deck. The second row shows the cards after one overhand shuffle. The third row shows the cards after two overhand shuffles, and so on. It is interesting to see how the pattern changes if you use p=0.25 or p=0.1. Try it!
One feature that is apparent in the image is that the relative positions of cards depend on whether you have performed an even or an odd number of overhand shuffles. After an odd number of shuffles, the top of the deck contains many cards that were initially at the bottom, and the bottom of the deck contains cards that were initially at the top. After an even number of shuffles, the top of the deck contains many cards that were initially at the top. Another apparent feature is that this process does not mix the cards very well. The last row in the heat map indicates that after 10 overhand shuffles, many cards are not very far from their initial positions. Furthermore, many cards that were initially next to each other are still within a few positions of each other.
Maybe additional overhand shuffles will address these deficiencies? The following program simulates 100 overhand shuffles. To aid in the visualization, the heat map shows only the results after 2, 4, 6, ..., and 100 shuffles (the even numbers):
deck = 1:N; /* reset initial order */ numShuffles = 100; /* number of times to shuffle cards */ numEvenShuffles = numShuffles / 2; OverDeck = j(numEvenShuffles+1, N); OverDeck[1,] = deck; /* save only the EVEN shuffles */ do i = 2 to numEvenShuffles + 1; OverDeck[i,] = Overhand(Overhand( OverDeck[i-1,] )); /* 2 calls */ end; call heatmapcont(OverDeck) colorramp=ramp yvalues=0:numEvenShuffles title="Positions of 52 Cards After EVEN Overhand Shuffles";
The heat map indicates that the deck is not well-mixed even after 100 overhand shuffles. Look at the last row in the heat map, which shows the positions of the cards in the deck colored by their initial position. There are a small number of light-colored cells on the right side of the heat map (cards that were initially on top; now near the bottom), and a small number of dark-colored cells on the left side of the heat map (cards that were initially on the bottom; now near the top). However, in general, the left side of the heat map contains mostly light-colored cells and the right side contains mostly dark-colored cells, which indicates that cards that have not moved far from their initial positions.
If you use smaller values of p (that is, the average packet size is larger), you will observe different patterns. The theory says that large packets cause nearby cards to tend to stay together for many shuffles. Jonasson (2006) was able to prove that if a deck has N cards, then O(N2 log(N)) overhand shuffles are required to randomize the deck. This compares with (3/2) log2(N) for the riffle shuffle, which is presented in the next section.
A comparison with the riffle shuffle
The riffle shuffle, which randomizes a deck after about seven shuffles. You can simulate a riffle shuffle in SAS and create a similar heat map. The following heat map shows the order of a 52-card deck after each of 10 riffle shuffles:
The heat map shows how quickly a riffle shuffle randomizes a deck. The last row (after 10 riffles) shows that light-colored cells are distributed throughout the deck, as are dark-colored cells and medium-colored cells. Furthermore, this behavior seems to occur after six or seven rows (at least to my untrained eye). I cannot see much difference between the "randomness" in row 6 and row 10. However, theoretically, a card in an earlier row (say, row 5) is more likely to be near cards that were initially close to it. For example, a dark-colored cell in row 5 is more likely to be near another dark-colored cell than would be expected in a randomized deck. However, by the time you perform 7 or more riffle shuffles, two cards that started near each other might be located far apart or might be close to each other. You cannot predict which will occur because the cards have been randomized.
In summary, the overhand shuffle does not mix cards very well. The simulation indicates that an overhand shuffle is not effective at separating cards that begin near each other. Even after 100 overhand shuffles (with p=0.5), the positions of many cards are only slightly different from their initial positions. In contrast, a riffle shuffle is an effective way to randomize a deck in approximately seven shuffles.
The post Shuffling smackdown: Overhand shuffle versus riffle shuffle appeared first on The DO Loop.
Like most people, I believed that process of diagnosing and treating cancer begins with a biopsy. If cancer is suspected, a doctor will extract a small tissue sample -- usually a tiny cylindrical "core sample" -- and examine it for cancer cells. No cancer cells found -- that's good news! But if cancer cells are present, then you have decisions to make about treatment.
A young woman named Richa Sehgal taught me that it's not so simple. There aren't just two types of cells (cancerous and non-cancerous). There are actually several types of cancer cells, and these do not all have the same importance when it comes to effective cancer treatment. I learned this from Richa during her presentation at Analytics Experience 2018 -- a remarkable talk for several reasons, not the least of which is this: Richa Sehgal is a high school student, just 18 years old. I'll have to check the record books, but this might make her the youngest-ever presenter at this premier analytics event.
Last year, Richa served as a student intern at the Canary Center at Stanford for Cancer Early Detection. That's where she learned about the biology of cancer. She was allowed (encouraged!) to attend all lab meetings – and the experience opened her eyes to the challenges of cancer detection.
The importance of cell types and how cancer works
Unlike many technical conference talks that I've attended, Richa did not dive directly into the math or the code that support the techniques she was presenting. Instead, Richa dedicated the first 25 minutes of her talk to teach the audience how cancer works. And that primer was essential to help the (standing-room only!) audience to understand the relevance and value of her analytical solution.
What we call "cancer" is actually a collection of different types of cells. Richa focused on three types: cancer stem cells (CSCs), transient amplifying cells (TACs), and terminally differentiated cells (TDCs). CSCs are the most rare type within a tumor, making up just a few percent of the total mix of cells. But because of their self-renewing qualities and their ability to grow all other types of cancer cells, these are very important to treat. CSCs require targeted therapy -- that is, you can't use the same type of treatment for all cell types. TACs usually require their own treatment, depending on the stage of the disease and the ability of a patient to tolerate the therapy. The presence of TACs can activate CSCs to grow more cancer cells, so if you can't eradicate the CSCs (and that's difficult to manage with 100% certainty, as we'll see) then it is important to treat the TACs. TDCs represent cancer cells that are no longer capable of dividing, and so generally don't require a treatment -- they will die off on their own.
(I know that my explanation here represents a simplistic view of cancer -- but it was enough of a framework to help me to understand the rest of Richa's talk.)
The inexact science of biopsies
Now that we understand that cancer is made up of a variety cell types, it makes sense to hope that when we extract a biopsy, that we get a sample that represents this cell type variability. Richa used an example of sampling a chocolate chip cookie. If you were to use a needle to extract a core sample from a chocolate chip cookie...but didn't manage to extract any portions of the (disappointingly rare) chocolate chips, you might conclude that the cookie was a simple sugar cookie. And as a result, you might treat that cookie differently. (If you encountered a raisin instead...well..that might require a different treatment altogether. Blech.)
But, as Richa told us, we don't yet know enough about the distribution and proximity of the different cell types for different types of cancers. This makes it difficult to design better biopsies. Richa is optimistic that it's just a matter of time -- medical science will crack this and we'll one day have good models of cancer makeup. And when that day comes, Richa has a statistical method to make biopsies better.
Using SAS and Python to model cancer cell clusters
Most high school students wouldn't think to pick up SAS for use in their science fair projects, but Richa has an edge: her uncle works for SAS as a research statistician. However, you don't need an inside connection to get access to SAS for learning. In Richa's case, she used SAS University Edition hosted on AWS -- nothing to install, easy to access, and free to use for any learner.
Since she didn't have real data that represent the makeup of a tumor, Richa created simulations of the cancer cells, their different types and proximity to each other in a 3D model. With this data in hand, she could use cluster analysis (PROC CLUSTER with Ward's method and then PROC TREE) to analyze a distant matrix that she computed. The result shows how close cancer cells of the same type are positioned in proximity. With that information, it's possible to design a biopsy that captures a highly variable collection of cells.
Richa then used the Python package plotly to visualize the 3D model of her cell map. (I didn't have the heart to tell her that she could accomplish this in SAS with PROC SGPLOT -- some things you just have to learn for yourself.)
A bright future -- for all of us
Clearly, Richa is an extremely accomplished young woman. When I asked about her college plans for next year, she told me that she has a long list of "stretch schools" that she's looking at. I'm having a difficult time understanding what constitutes a "stretch" for Richa -- I'm certain that any institution would love to have her.
Richa's accomplishments make me feel optimism for her, but also for the rest of us. As a father of three daughters, I'm encouraged to see young women enter technical fields and be successful. SAS is among the elite technology companies that work to close the analytics skills gap by providing free software, education, and mentoring. Throughout the Analytics Experience 2018 conference, I've heard from many attendees who also saw Richa's talk -- they were similarly impressed and inspired. Presentations like Richa's deliver on the conference tagline: "Analytics redefines innovation. You redefine the future."
"We been through every kind of rain there is. Little bitty stingin' rain, and big ol' fat rain, rain that flew in sideways, and sometimes rain even seemed to come straight up from underneath." Was that a quote from the Forrest Gump movie, or something said regarding Hurricane Florence? Could be either one! Hurricane Florence recently came through [...]
The post Hurricane Florence: rainfall totals in the Carolinas appeared first on SAS Learning Post.
SAS Press author Kirk Paul Lafler's favorite tips using PROC SQL.
The post Three Proc SQL Tips from Bestselling Author Kirk Paul Lafler appeared first on SAS Learning Post.
- Messy data. It might come from a variety of sources, and won't necessarily be organized for analytics or reporting. That's your job.
- Nebulous problem set. Usually the goal of a hackathon is to generate insights, improve a situation, or optimize a process. But you don't know going into it which insights you need, which process is ripe for optimization, or which situations can be improved by using data. Hackathons are as much about discovering opportunities as they are about solving problems.
- Team members with different viewpoints. This is a big strength of hackathons, and it can also present the biggest challenge. Team members bring different skills and ideas. To be successful, you need to be open to those ideas and to allowing team members to contribute in the way that best uses their skills. Think of yourselves as the Oceans Eleven of data analytics.
In my experience, hackathons are often a great melting pot of different tools and technologies. Whatever tech biases you might have in your day job (Windows versus Linux, SAS versus Python, JSON versus CSV) – these melt away when your teammates show up ready to contribute to a common goal using the tools that they each know best.
My favorite hackathon tools
At the Analytics Experience 2018 Hackathon, attendees have the entire suite of SAS tools available. From Base SAS, to SAS Enterprise Guide, to SAS Studio, to SAS Enterprise Miner and the entire SAS Viya framework -- including SAS Visual Analytics, SAS Visual Text Analytics, SAS Data Mining and Machine Learning. As we say here in San Diego, it's the whole enchilada. As the facilitators were presenting the whirlwind tour of all of these goodies, I could see the attendees salivating. Or maybe that was just me.
When it comes to getting my hands dirty with unknown data, my favorite path begins with SAS Enterprise Guide. If you know me, this won't surprise you. Here's why I like it.
Import Data task: Import any data
Hackathon data almost always comes as CSV or Excel spreadsheets. The Import Data task can ingest CSV, fixed-width text, and Excel spreadsheets of any version. Of course most "hackers" worth their salt can write code to read these file types, but the Import Data task helps you to discover what's in the file almost instantly. You can review all of the field names and types, tweak them as you like, and click Finish to produce a data set. There's no faster method of turning raw data into a SAS data set that feeds the next step.
See Tricks for importing text files and Importing Excel files using SAS Enterprise Guide for more details about the ins-and-outs of this task. If you want to ultimately turn this step into repeatable code (a great idea for hackathons), then it's important to know how this task works.
Note: if your data is coming from a web service or API, then it's probably in JSON format. There's no point-and-click task to read that, but a couple of SAS program lines will do the trick.
Query Builder: Filter, compute, summarize, and join
The Query Builder in SAS Enterprise Guide is a one-stop shop for data management. Use this for quick filtering, data cleansing, simple recoding, and summarizing across groups. Later, when you have multiple data sources, the Query Builder provides simple methods to join these – merge on the fly.
Before heading into your next hackathon, it's worth exploring and practicing your skills with the Query Builder. It can do so much -- but some of the functions are a bit hidden. Limber up before you hack!
Characterize Data: Quick data characteristics, with ability to dive deeper
If you've never seen your data before, you'll appreciate this one-click method to report on variable types, frequencies, distinct values, and distributions. The Describe->Characterize Data task provides a good start.
Using SAS Studio? There's a Characterize Data task in there as well. See Marje Fecht's paper: Easing into Data Exploration, Reporting, and Analytics Using SAS Enterprise Guide for more about this and other tasks.
Data tasks: Advanced data reworking: long to wide, wide to long
"Long" data is typically best for reporting, while "wide" data is more suited for analytics and modeling The process of restructuring data from long to wide (or wide to long) is called Transpose. SAS Enterprise Guide has special tasks called "Split Data" (for making wide tables) and "Stack Data" (for making long data). Each method has some special requirements for a successful transformation, so it's worth your time to practice with these tasks before you need them.
Program Editor: Flexible coding environment
The program editor in SAS Enterprise Guide is my favorite place to write and modify SAS code. Here are my favorite tricks for staying productive in this environment including code formatting, shown below.
Have another favorite editor? You can use SAS Enterprise Guide to open your code in your default Windows editor too. That's a great option when you need to do super-fancy text manipulation. (We won't go into the "best programming editor" debate here, but I've got my defaults set up for Notepad++.)
Export and share with others
The hackathon "units of sharing" are code (of course) and data. SAS Enterprise Guide provides several simple methods to share data in a way that just about any other tool can consume:
- Export data as CSV (CSV is the lingua franca of data sharing)
- Export data as Excel (if that's what your teammates are using)
- Send to Excel -- actually my favorite way to generate ad-hoc Excel data, as it automates Microsoft Excel and pipes the data your looking at directly into a new sheet.
- Copy / paste with headers -- low-tech, but this gets you exactly the columns and fields that you want to share with another team member.
When it comes to sharing code, you can use File->Export All Code to capture all SAS code from your project or process flow. However, I prefer to assemble my own "standalone" code piecemeal, so that I can make sure it's going to run the same for someone else as it does for me. To accomplish this, I create a new SAS program node and copy the code for each step that I want to share into it...one after another. Then I test by running that code in a new SAS session. Validating your code in this way helps to reduce friction when you're sharing your work with others.
Hacking your own personal growth
The obvious benefit of hackathons is that at the end of a short, intense period of work, you have new insights and solutions that didn't have before – and might never have arrived at on your own. But the personal benefit comes in the people you meet and the techniques that you learn. I find that I'm able to approach my day job with fresh perspective and ideas – the creativity keeps flowing, and I'm energized to apply what I've learned in my business.
The post Essential SAS tools to bring to your next hackathon appeared first on The SAS Dummy.
Phil Simon shares some lessons from his consulting career.
The post The relationship between data management and analytics: Lessons from the field appeared first on The Data Roundtable.