I attended the Scottish Highland Games this past weekend ... nearby in Scotland County, North Carolina! They put on a great event, with kilt-wearing Scotsmen throwing things, bands playing bagpipes, kids dancing, and clans sharing their family history. And to get into the mood for this event, I decided to [...]
When was the last time you or your colleagues wanted access to data and tools to produce reports and dashboards for a business need? Probably within the last hour. Self-service BI applications – gaining popularity as we speak – make gaining insights and decision making faster. But they've also generated a greater need for governance.
Part of governance is understanding the data lifecycle or data lineage. For example, a co-worker performed some modifications to a dataset and used it to produce a report that you would like to use to help solve a business need. How can you be sure that the information in this report is accurate? How did the producer of the report calculate certain measures? From what original data set was the report based?
SAS provides many tools to help govern platforms and solutions. Let’s look at one of those tools to understand the data lifecycle: SAS Lineage Viewer.
Here we have a report created to explore and visualize telecommunications data using SAS Visual Analytics. The report shows our variable of interest, cross-sell and up-sell flag, and its relationship to other variables. This report will be used to target customers for cross-sell or up-sell.
This report is based on an Analytical Base Table (ABT) that was created by joining two data sets:
- Usage information from a subset of customers who have contacted customer care centers.
- Cleansed demographics data.
The name of the joined dataset the report is based on is LG_FINAL_ABT.
To make sure we understand the data behind this report we’ll explore it using a lineage viewer (you will need to be assigned to the "Data Management Business User” or “Data Management: Lineage” group, which an administrator can help you with). From the applications menu, select Explore Lineage.
We’ll click on Search for Subjects and search for the report we were just reviewing: Telecommunications.
I’ll enter Telecommunications in the search field then select the Telecommunications report.
The first thing I see is the LG_Final_ABT CAS table my report is dependent on.
If I click on the + sign on the top right corner of the data, LG_Final_ABT, I can see all the other relationships to that CAS table. There is a Model Studio project, two Visual Analytics Reports (including the report we looked at), and a data view that are all dependent on the LG_FINAL_ABT CAS Table. This diagram also shows us that the LG_FINAL_ABT CAS table is dependent on the Public CAS library. We also see that the LG_FINAL_ABT CAS table was loaded into CAS from the LG_FINAL_ABT.sashdat file.
Let’s explore the LG_FINAL_ABT.sashdat file to see its lineage. Clicking on the + expands the view. In the following diagram, I expanded all the remaining items to see the full data lifecycle.
This image shows us the whole data life cycle. From LG_FINAL_ABT.sashadat we see that it is dependent on the Create Final LG ABT data preparation plan. That plan is dependent on two CAS tables; LG_CUSTOMER and LG_ORIG_ABT. The data lineage viewer shows us that the LG_CUSTOMER table was loaded in from a csv file (lg_customer.csv) and the LG_ORIG_ABT CAS file was loaded in from a sas data set (lg_orig_abt.sas7dbat).
To dive deeper into the mashups and data manipulations that took place to produce LG_FINAL_ABT.sashdat, we can open the data preparation plan. To do this I’ll right click on Create Final LG ABT and select Actions then Prepare Data.
Here is the data preparation plan. At the top you can see that the creator of this data set performed five steps – Gender Analysis, Standardize, Remove, Rename and Join.
To get details into each of these steps, click on the titles at the top. Clicking on Gender Analysis, I see that a gender analysis was performed based on the customer_name field and the results were added to the data set in a variable named customer_name_GND.
Clicking on the Standardize title, I see that there were two standardization tasks performed on the original data set. One for customer state and the other for customer phone number. I can also see that the results were placed in new fields (customer_state_STND and customer_primary_phone_STND).
Clicking on the Remove title, I see that three variables were dropped from the final dataset. These variables were the original ones that the user had “fixed” in the previous steps: customer_gender, customer_state, and customer_primary_phone.
Clicking on the Rename title, I see that the new variables have been renamed.
The last step in the process is a join. Clicking on the Join title I see that LG_CUSTOMER was joined with LG_ORIG_ABT based on an inner join on Customer_ID.
We have just walked through the data lineage or data lifecycle for the dataset LG_FINAL_ABT, using SAS tools. I now understand how the data in the report we were looking at was generated. I am confident that the information that I gain from the report will be accurate.
Since sharing information and data among co-workers has become so common, it's now more crucial than ever to think about the data lifecycle. When you gain access to a report that you did not create it is always a good idea to check the underlying data to ensure that you understand the data and that any business insights gained are accurate. Also, if you are sharing data with others and you want to make modifications to it, you should always check the lineage to ensure that you won’t be undermining someone else’s work with changes you make. Thanks to SAS Visual Analytics, all the necessary tools one needs to review data lineage are all available within one interface.
Keep track of where data originated with data lineage in SAS was published on SAS Users.
This article compares several ways to find the elements that are common to multiple sets. I test which method is the fastest in the SAS/IML language. However, all algorithms are intrinsically fast, which raises an important question: when is it worth the time and effort to optimize an algorithm?
The idea for this topic came from reading a blog post by 'mikefc' at the "Cool but Useless" blog about the relative performance of various ways to intersect vectors in R. Mikefc used vectors that contained between 10,000 and 100,000 elements and whose intersection was only a few dozen elements. In this article, I increase the sizes of the vectors by an order of magnitude and time the performance of the intersection function in SAS/IML. I also ensure that the sets share a large set of common elements so that the intersection is sizeable.
The intersection of large sets
In general, finding the intersection between sets is a fast operation. In languages such as R, MATLAB, and SAS/IML, you can store the sets in vectors and use built-in functions to form the intersection. Because the intersection of two sets is always smaller than (or equal to) the size of the original sets, computing the "intersection of intersections" is usually faster than computing the intersection of the original (larger) sets.
The following SAS/IML program creates eight SAS/IML vectors that have between 200,000 and 550,000 elements. The elements in the vectors are positive integers, although they could also be character strings or non-integers. The vectors contain a large subset of common elements so that the intersection is sizeable. The vectors A, B, ..., H are created as follows:
proc iml; call randseed(123); NIntersect = 5e4; common = sample(1:NIntersect, NIntersect, "replace"); /* elements common to all sets */ source = 1:1e6; /* elements are positive integers less than 1 million */ BaseN = 5e5; /* approx magnitude of vectors (sets) */ A = sample(source, BaseN, "replace") || common; /* include common elements */ B = sample(source, 0.9*BaseN, "replace") || common; C = sample(source, 0.8*BaseN, "replace") || common; D = sample(source, 0.7*BaseN, "replace") || common; E = sample(source, 0.6*BaseN, "replace") || common; F = sample(source, 0.5*BaseN, "replace") || common; G = sample(source, 0.4*BaseN, "replace") || common; H = sample(source, 0.3*BaseN, "replace") || common;
The intersection of these vectors contains about 31,600 elements. You can implement the following methods that find the intersection of these vectors:
- The XSECT function in SAS/IML accepts up to 15 arguments. You can therefore make a single call to find the intersection.
- If you have dozens or hundreds of sets to intersect, you can loop over the sets and call the XSECT function multiple times. (Recall that the VALUE function enables you to loop over the names of the sets and access the vectors by names.) For example, you can let W1 be the intersection of the first two sets, let W2 be the intersection of W1 with the third set, and so forth. Because the size of the intersection is typically smaller than the size of the sets, later intersections require less time to compute than earlier intersections. (Although I only implement pairwise intersections, you could conceivably intersect more than two sets at a time.)
- As 'mikefc' mentions, it is quicker to intersect small sets than to intersect large sets. Thus you can preprocess the names of the sets and sort them by size. This heuristic might make a difference when the set sizes vary greatly.
The following SAS/IML statements implement these three methods and compute the time required to intersect each:
/* 1. one call to the built-in method */ t0 = time(); w = xsect(a,b,c,d,e,f,g,h); tBuiltin = time() - t0; /* 2. loop over variable names and form pairwise intersections */ varName = "a":"h"; t0 = time(); w = value(varName); do i = 2 to ncol(varName); w = xsect(w, value(varName[i])); /* compute pairwise intersection */ end; tPairwise = time() - t0; /* 3. Sort by size of sets, then loop */ varName = "a":"h"; t0 = time(); len = j(ncol(varName), 1); do i = 1 to ncol(varName); len[i] = ncol(value(varName[i])); /* number of elements in each set */ end; call sortndx(idx, len); /* sort smallest to largest */ sortName = varName[,idx]; w = value(sortName); do i = 2 to ncol(sortName); w = xsect(w, value(sortName[i])); /* compute pairwise intersection */ end; tSort = time() - t0; print tBuiltin tPairwise tSort;
For this example data, each method takes about 0.5 seconds to find the intersection of eight sets that contain a total of 3,000,000 elements. If you rerun the analysis, the times will vary by a few hundredths of a second, so the times are essentially equal. ('mikefc' also discusses a few other methods, including a "tally method." The tally method is fast, but it relies on the elements being positive integers.)
Whether to optimize or not
There is a quote that I like: "In theory, there is no difference between theory and practice. But, in practice, there is." (Published by Walter J. Savitch, who claims to have overheard it at a computer science conference.) For the intersection problem, you can argue theoretically that pre-sorting by length is an inexpensive way to potentially obtain a boost in performance. However, in practice (for this example data), pre-sorting improves the performance by less than 1%. Furthermore, the absolute times for this operation are short, so saving a 1% of 0.5 seconds might not be worth the effort.
Even if you never compute the intersection of sets in SAS, there are two take-aways from this exercise:
- When you measure the performance of algorithms, use example data that is similar to the real data that you expect to encounter. The performance of this problem depends on the size of the sets, the number of sets, and the contents of the sets. The results that I show here are based on my choices for these parameters. Different parameters might yield different results.
- Always consider the total time for an algorithm before you start to optimize it. If your analysis takes 30 seconds, there is no need to optimize a step in the analysis that takes 0.5 seconds. No matter how well you optimize that step, it is not going to substantially shrink the total run time of your analysis.
In my earlier blog, I described how to create maps in SAS Visual Analytics 8.2 if you have an ESRI shapefile with granular geographies, such as counties, that you wish to combine into regions. Since posting this blog in January 2018, I received a lot of questions from users on a range of mapping topics, so I thought a more general post on using – and troubleshooting - custom polygons in SAS Visual Analytics on Viya was in order. Since version 8.3 is now generally available, this post is tailored to the 8.3 version of SAS Visual Analytics, but the custom polygon functionality hasn’t really changed between the 8.2 and 8.3 releases.
What are custom polygons?
Custom polygons are geographic boundaries that enable you to visualize data as shaded areas on the map. They are also sometimes referred to as a choropleth maps. For example, you work for a non-profit organization which is trying to decide where to put a new senior center. So you create a map that shows the population of people over 65 years of age by US census tract. The darker polygons suggest a larger number of seniors, and thus a potentially better location to build a senior center:
SAS Visual Analytics 8.3 includes a few predefined polygonal shapes, including countries and states/provinces. But if you need something more granular, you can upload your own polygonal shapes.
How do I create my own polygonal shapes?
To create a polygonal map, you need two components:
- A dataset with a measure variable and a region ID variable. For example, you may have population as a measure, and census tract ID as a region ID. A simple frequency can be used as a measure, too.
- A “polygon provider” dataset, which contains the same region ID as above, plus geographic coordinates of each vertex in each polygon, a segment ID and a sequence number.
So where do I get this mysterious polygon provider? Typically, you will need to search for a shapefile that contains the polygons you need, and do a little bit of data preparation. Shapefile is a geographic data format supported by ESRI. When you download a shapefile and look at it on the file system, you will see that it contains several files. For example, my 2010 Census Tract shapefile includes all these components:
Sometimes you may see other components present as well. Make sure to keep all components together.
To prepare this data for SAS Visual Analytics, you have two options.
Preparing shapefile for SAS Visual Analytics: The long way
One method to prepare the polygon provider is to run PROC MAPIMPORT to convert the shapefile into a SAS dataset, add a sequence ID field and then load into the Cloud Analytic Services (CAS) server in SAS Viya. The sequence ID is mandatory, as it helps SAS Visual Analytics to draw the lines connecting vertices in the correct order.
A colleague recently reached out for help with a map of Census block groups for Chatham County in North Carolina. Let’s look at his example:
The shapefile was downloaded from here. We then ran the following code on my desktop:
libname geo 'C:\...\Data; proc mapimport datafile="C:\...\Data\Chatham_County__2010_Census_Block_Groups.shp" out=work.chatham_cbg; run; data geo.chatham_cbg; set chatham_cbg; seqno=_n_; run;
We then manually loaded the geo.chatham_cbg dataset in CAS using self-service import in SAS Visual Analytics. If you are not sure how to upload a dataset to CAS, please check the %SHIMPR. The macro will automatically run PROC MAPIMPORT, create a sequence ID variable and load the table into CAS. Here’s an example:
%shpimprt(shapefilepath=/path/Chatham_County__2010_Census_Block_Groups.shp, id=GEOID, outtable=Chatham_CBG, cashost=my_viya_host.com, casport=5570, caslib='Public');
For this macro to work, the shapefile must be copied to a location that your SAS Viya server can access, and the code needs to be executed in an environment that has SAS Viya installed. So, it wouldn’t work if I tried to run it on my desktop, which only has SAS 9.4 installed. But it works beautifully if I run it in SAS Studio on my SAS Viya machine.
Configuring the polygon provider
The next step is to configure the polygon provider inside your report. I provided a detailed description of this in my earlier blog, so here I’ll just summarize the steps:
- Add your data to the SAS Visual Analytics report, locate the region ID variable, right-click and select New Geography
- Give it a name and select Custom Polygonal Shapes as geography type
- Click on the Custom Polygon Provider box and select Define New Polygon Provider
- Configure your polygon provider by selecting the library, table and ID column. The values in your ID column must match the values of the region ID variable in the dataset you are visualizing. The ID column, however, does not need to have the same name as in the visualization dataset.
- If necessary, configure advanced options of the polygon provider (more on that in the troubleshooting section of this blog).
If all goes well, you should see a preview of your polygons and a percentage of regions mapped. Click OK to save your geographic item, and feel free to use it in the Geo Map object.
I followed your instructions, but the map is not working. What am I missing?
I observed a few common troubleshooting issues with custom maps, and all are fairly easy to fix. The table below summarizes symptoms and solutions.
|In the Geographic Item preview, 0% of the regions are mapped. For example:
||Check that the values in the region ID variable match between the main dataset and the polygon provider dataset.|
|I successfully created the map, but the colors of the polygons all look the same. I know I have a range of values, but the map doesn’t convey the differences.||In your main dataset, you probably have missing region ID values or region IDs that don’t exist in the polygon provider dataset. Add a filter to your Geo Map object to exclude region IDs that can’t be mapped.|
|Only a subset of regions is rendered.||You may have too many points (vertices) in your polygon provider dataset. SAS Visual Analytics can render up to 250,000 points. If you have a large number of polygons represented in a detailed way, you can easily exceed this limit. You have two options, which you can mix and match:
(1) Filter the map to show fewer polygons
(2) Reduce the level of detail in the polygon provider dataset using PROC GREDUCE. See example here. Also, if you imported data using the %shpimprt macro, it has an option to reduce the dataset. Here’s a handy link to In the Geographic Item preview, the note shows that 100% of the regions are mapped, but the regions don’t render, or the regions are rendered in the wrong location (e.g., in the middle of the ocean) and/or at an incorrect scale.
|This is probably the trickiest issue, and the most likely culprit is an incorrectly specified coordinate space code (EPSG code). The EPSG code corresponds to the type of projection applied to the latitude and longitude in the polygon provider dataset (and the originating shapefile). Projection is a method of displaying points from a sphere (the Earth) on a two-dimensional plane (flat surface). See this tutorial if you want to know more about projections.
There are several projection types and numerous flavors of each type. The default EPSG code used in SAS Visual Analytics is EPSG:4326, which corresponds to the unprojected coordinate system. If you open advanced properties of your polygon provider, you can see the current EPSG code:
Finding the correct EPSG code can be tricky, as not all shapefiles have consistent and reliable metadata built in. Here are a couple of things you can try:
(1) Open your shapefile as a layer in a mapping application such as ArcMap (licensed by ESRI) or QGIS (open source) and view the properties of the layer. In many cases the EPSG code will appear in the properties.
(2) Go to the location of your shapefile and open the .prj file in Notepad. It will show the projection information for your shapefile, although it may look a bit cryptic. Take note of the unit of measure (e.g., feet), datum (e.g., NAD 83) and projection type (e.g., Lambert Conformal Conic). Then, go to https://epsg.io/ and search for your geography. Going back to the example for Chatham county, I searched for North Carolina. If more than one code is listed, select a few codes that seem to match your .prj information the best, then go back to SAS Visual Analytics and change the polygon provider Coordinate Space property. You may have to try a few codes before you find the one that works best.
|I ruled out a projection issue, the note in Geographic Item preview shows that 100% of the regions are mapped, but the regions still don’t render.||Take a look at your polygon provider preparation code and double-check that the order of observations didn’t accidentally get changed. The order of records may change, for example, if you use a PROC SQL join when you prepare the dataset. If you accidentally changed the order of the records prior to assigning the sequence ID, it can result in an illogical order of points which SAS Visual Analytics will have trouble rendering. Remember, sequence ID is needed so that SAS Visual Analytics can render the outlines of each polygon correctly.
You can validate the order of records by mapping the polygon provider using PROC GMAP, for example:
For example, in image #1 below, the records are ordered correctly. In image #2, the order or records is clearly wrong, hence the lines going crisscross.
As you can see, custom regional maps in SAS Visual Analytics 8.3 are pretty straightforward to implement. The few "gotchas" I described will help you troubleshoot some of the common issues you may encounter.
P.S. I would like to thank Falko Schulz for his help in reviewing this blog.
Troubleshooting custom polygon maps in SAS Visual Analytics 8.3 was published on SAS Users.
You can now conduct a live demo of SAS Visual Analytics on your mobile device to participants who are geographically dispersed by using the Present Screen feature, a tucked-away option in the SAS Visual Analytics app for iPad, iPhone, and Android devices. Let’s say I am looking at a report on my mobile device, and I have questions about a couple of items for my colleagues Joe and Anita, both of whom are located in two different cities. The three of us are able to see the report while I demo it, drawing their attention to specific areas of interest in the report.
Sitting in my office, or from any location where I have a Wi-Fi or cellular connection, I can use the Present Screen feature to do a live shared presentation with Joe and Anita. And I don't have to present just one report. During the live presentation, I can close a report, open a different one, perform interactions, and move around in the app between different reports.
And here’s the real beauty of this feature. Neither Joe nor Anita need to have the SAS Visual Analytics app on a mobile device, or SAS Visual Analytics running on a desktop. The only requirement for participants is that they have a mobile device (could be an iOS, Android, or Windows device), a laptop, or a desktop system. Internet access via Wi-Fi (or a cellular connection for mobile devices), an email client for receiving an email notification, and a Web browser are necessary. VPN connectivity might be required if the participants' organization requires VPN.
Plus, you can conduct your live presentation for up to 10 people! Before we move on, note that you can also use the iOS feature, AirDrop, on your iOS device to engage participants with the Present Screen feature. This is useful if you’re in a room with a bunch of folks who have iOS devices, and you want to do live sharing.
Ready to try it? Here’s a short checklist of what you need to do a live presentation of SAS Visual Analytics reports.
Requirements for the presenter
- Use any one of these devices to present your screen for a shared presentation to your participants: iPad, iPhone, Android tablet or smartphone
- SAS Visual Analytics app installed on your device
- Connection via Wi-Fi or cellular to the server where the report(s) reside
- Subscription to the reports that you want to present and share
- Email client on the iPad or phone
- VPN connectivity if your organization requires it
Couple of things to note
Say you're presenting to 10 participants via email or AirDrop. Note that as the presenter, you must have the SAS Visual Analytics app open in your device with the Present Screen feature selected and active for your guests to be able to see your reports. Once you exit the app, the screen presentation session ends and the email or AirDrop invitations are no longer valid.
And a note on participant information. After a participant forwards an email invitation to view the presentation to a colleague, the recipient is required to enter a name and email address before joining your live screen presentation. You, as the presenter, get to see the names of the folks that have joined your screen presentation. Participant names and email addresses simply let the presenter know who all have joined to presentation. Neither the participants' names nor their email addresses are validated by the SAS Visual Analytics server.
Supported versions for SAS Visual Analytics reports
You can present and share reports with the Present Screen feature for these versions of SAS Visual Analytics:
7.3, 7.4, 8.1, 8.2, 8.3
How to start the screen presentation
In the Analytics report I'm subscribed to from the SAS Visual Analytics app on my iPad, I choose Present Screen.
I am reminded that I can present my screen to a maximum of 10 participants. I click OK.
The app prompts me to send email or choose AirDrop to present my screen to participants - I choose email.
My email client opens, and the app includes instructions along with the link that takes them to my live screen presentation. I enter the email addresses for my participants and send the email.
In the email that Anita receives on her desktop PC, she taps on the link which takes her to her Web browser where my screen presentation is set to start soon.
Anita enters her name and email address, and notes that the presentation has not yet started.
Joe, who has logged on to the presentation from his Android phone, is also presented with the same message on his smartphone.
To begin the screen presentation, I tap on the blinking cursor in the app.
The app reminds me that now my participants can see everything on my iPad screen.
Next, a blue bar at the top of the report indicates that my screen presentation is live and can be seen by Anita and Joe. Now I can begin my report presentation, or exit this report and open a different report to share.
Here is my screen on the iPad Mini where I started my screen presentation:
Anita's screen on her Windows desktop monitor:
Joe's screen on the Android smartphone:
When I am finished with the presentation, I tap on the 'stop' button to end it.
A message displays to indicate that the presentation has ended. Here's an example of that message from Joe's Android smartphone.
It is sometimes necessary for researchers to simulate data with thousands of variables. It is easy to simulate thousands of uncorrelated variables, but more difficult to simulate thousands of correlated variables. For that, you can generate a correlation matrix that has special properties, such as a Toeplitz matrix or a first-order autoregressive (AR(1)) correlation matrix. I have previously written about how to generate a large Toeplitz matrix in SAS. This article describes three useful results for an AR(1) correlation matrix:
- How to generate an AR(1) correlation matrix in the SAS/IML language
- How to use a formula to compute the explicit Cholesky root of an AR(1) correlation matrix.
- How to efficiently simulate multivariate normal variables with AR(1) correlation.
Generate an AR(1) correlation matrix in SAS
The AR(1) correlation structure is used in statistics to model observations that have correlated errors. (For example, see the documentation of PROC MIXED in SAS.) If Σ is AR(1) correlation matrix, then its elements are constant along diagonals. The (i,j)th element of an AR(1) correlation matrix has the form Σ[i,j] = ρ|i – j|, where ρ is a constant that determines the geometric rate at which correlations decay between successive time intervals. The exponent for each term is the L1 distance between the row number and the column number. As I have shown in a previous article, you can use the DISTANCE function in SAS/IML 14.3 to quickly evaluate functions that depend on the distance between two sets of points. Consequently, the following SAS/IML function computes the correlation matrix for a p x p AR(1) matrix:
proc iml; /* return p x p matrix whose (i,j)th element is rho^|i - j| */ start AR1Corr(rho, p); return rho##distance(T(1:p), T(1:p), "L1"); finish; /* test on 10 x 10 matrix with rho = 0.8 */ rho = 0.8; p = 10; Sigma = AR1Corr(rho, p); print Sigma[format=Best7.];
A formula for the Cholesky root of an AR(1) correlation matrix
Every covariance matrix has a Cholesky decomposition, which represents the matrix as the crossproduct of a triangular matrix with itself: Σ = RTR, where R is upper triangular. In SAS/IML, you can use the ROOT function to compute the Cholesky root of an arbitrary positive definite matrix. However, the AR(1) correlation matrix has an explicit formula for the Cholesky root in terms of ρ. This explicit formula does not appear to be well known by statisticians, but it is a special case of a general formula developed by V. Madar (Section 5.1, 2016), who presented a poster at a Southeast SAS Users Group (SESUG) meeting a few years ago. An explicit formula means that you can compute the Cholesky root matrix, R, in a direct and efficient manner, as follows:
/* direct computation of Cholesky root for an AR(1) matrix */ start AR1Root(rho, p); R = j(p,p,0); /* allocate p x p matrix */ R[1,] = rho##(0:p-1); /* formula for 1st row */ c = sqrt(1 - rho**2); /* scaling factor: c^2 + rho^2 = 1 */ R2 = c * R[1,]; /* formula for 2nd row */ do j = 2 to p; /* shift elements in 2nd row for remaining rows */ R[j, j:p] = R2[,1:p-j+1]; end; return R; finish; R = AR1Root(rho, p); /* compare to R = root(Sigma), which requires forming Sigma */ print R[L="Cholesky Root" format=Best7.];
You can compute an AR(1) covariance matrix from the correlation matrix by multiplying the correlation matrix by a positive scalar, σ2.
Efficient simulation of multivariate normal variables with AR(1) correlation
An efficient way to simulate data from a multivariate normal population with covariance Σ is to use the Cholesky decomposition to induce correlation among a set of uncorrelated normal variates. This is the technique used by the RandNormal function in SAS/IML software. Internally, the RandNormal function calls the ROOT function, which can compute the Cholesky root of an arbitrary positive definite matrix.
When there are thousands of variables, the Cholesky decomposition might take a second or more. If you call the RandNormal function thousands of times during a simulation study, you pay that one-second penalty during each call. For the AR(1) covariance structure, you can use the explicit formula for the Cholesky root to save a considerable amount of time. You also do not need to explicitly form the p x p matrix, Σ, which saves RAM. The following SAS/IML function is an efficient way to simulate N observations from a p-dimensional multivariate normal distribution that has an AR(1) correlation structure with parameter ρ:
/* simulate multivariate normal data from a population with AR(1) correlation */ start RandNormalAR1( N, Mean, rho ); mMean = rowvec(Mean); p = ncol(mMean); U = AR1Root(rho, p); /* use explicit formula instead of ROOT(Sigma) */ Z = j(N,p); call randgen(Z,'NORMAL'); return (mMean + Z*U); finish; call randseed(12345); p = 1000; /* big matrix */ mean = j(1, p, 0); /* mean of MVN distribution */ /* simulate data from MVN distribs with different values of rho */ v = do(0.01, 0.99, 0.01); /* choose rho from list 0.01, 0.02, ..., 0.99 */ t0 = time(); /* time it! */ do i = 1 to ncol(v); rho = v[i]; X = randnormalAR1(500, mean, rho); /* simulate 500 obs from MVN with p vars */ end; t_SimMVN = time() - t0; /* total time to simulate all data */ print t_SimMVN;
The previous loop generates a sample that contains N=500 observations and p=1000 variables. Each sample is from a multivariate normal distribution that has an AR(1) correlation, but each sample is generated for a different value of ρ, where ρ = 0.01. 0.02, ..., 0.99. On my desktop computer, this simulation of 100 correlated samples takes about 4 seconds. This is about 25% of the time for the same simulation that explicitly forms the AR(1) correlation matrix and calls RandNormal.
In summary, the AR(1) correlation matrix is an easy way to generate a symmetric positive definite matrix. You can use the DISTANCE function in SAS/IML 14.3 to create such a matrix, but for some applications you might only require the Cholesky root of the matrix. The AR(1) correlation matrix has an explicit Cholesky root that you can use to speed up simulation studies such as generating samples from a multivariate normal distribution that has an AR(1) correlation.
The post Fast simulation of multivariate normal data with an AR(1) correlation structure appeared first on The DO Loop.
For those new to SAS Customer Intelligence 360, its functionality can seem vast, but I found it easy to jump in and get started. With this cloud-based software you can combine multiple data sources to create a full picture of your site visitors and better know your customers. SAS Customer [...]
For those new to SAS Customer Intelligence 360, its functionality can seem vast, but I found it easy to jump in and get started. With this cloud-based software you can combine multiple data sources to create a full picture of your site visitors and better know your customers. SAS Customer [...]
Over the years, the US has drilled for crude oil in several locations, such as Pennsylvania, Texas, Alaska, and the Gulf of Mexico. A few years ago, as the US started drilling more in North Dakota, there were forecasts that we would surpass Saudi Arabia in crude oil production. And recently, [...]
Once a disaster is over, and the frenzy of news stories and social media posts has subsided, it can seem like the crisis has passed. However, for those Hurricane Florence survivors left with ruined homes and businesses, damaged schools and buildings, there remains a struggle to return to normalcy. As [...]