252013
 
Perhaps no other sector has more massive volumes of text data than government.  During the SAS Government Leadership Summit earlier this week, different agencies discussed big data topics, trends and techniques.  And while diverse elements of focus were apparent, shared themes emerged that Juan Zarate, former Deputy Assistant to the [...]
252013
 
This week's SAS tip is from Ron Cody and his latest book Cody's Collection of Popular SAS Programming Tasks and How to Tackle Them. Learn more about this esteemed user and his many bestselling SAS books--as well as get additional bonus content on his author page. The following excerpt is from [...]
242013
 

At SAS, one of our core values is to be swift and agile. So it makes sense that our software development be Agile too. The Agile methodology has been around for more than 10 years and was designed with software development in mind. Today, it is still used predominately for this purpose, but is gaining momentum in other circles as well. In this Wall Street Journal article, parents even confess to bringing these ideas home and implementing Agile with their children.

Within SAS, divisions and teams use Agile in different ways; its nimbleness allows for varying degrees of adoption or implementation. For example, since 2008, R&D teams have been using Agile for requirements management, product implementation and project tracking. Planning is simplified with Agile. Potential product or solution features, known as stories, are ranked and scheduled in a release plan. Then they are assigned to an iteration or sprint (usually 2-4 weeks in length) in the development life cycle.

One of the main benefits of Agile development is that it fosters more cross-functional activity and collaboration, that is, a higher level of engagement and commitment. For example, for any given iteration, developers, testers and product managers work closely together and share updates several times a week in meetings called scrums. At the end of each iteration, all stakeholders sign off on features, and demos are given to internal or external stakeholders, showing the software in proper working condition.

Agile methodology was highlighted as a key development strategy at the Business Intelligence Development Roundtable at SAS Global Forum 2013. Business Intelligence R&D teams adopted Agile when they set out to create SAS’ data visualization software, SAS® Visual Analytics. Developing a new product like Visual Analytics required a nimble approach. Agile’s short, time-bound iterations allowed SAS product developers to incorporate user feedback more quickly and focus more on what worked.

In 2012, senior project managers at SAS surveyed SAS development teams to determine the effectiveness of Agile approaches. Seventy-eight percent of respondents said they would recommend Agile to another team. The survey reported that stronger adoption of Agile methods drives higher productivity and a deeper sense of engagement among teams.

In April 2013, Tim Arthur, Agile champion within SAS Research & Development, presented the results of that survey at the Stanford Strategic Execution Conference. Arthur heard from several Silicon Valley companies that SAS is ahead of the curve and viewed as a model for Agile adoption and scaling. If you're interested in learning more about SAS' approach, you can download Arthur's white paper Agile Adoption: Measuring its Worth.

Does your organization use Agile?  Leave a comment to tell us about your experience or ask for more information about SAS' implementation of Agile.

tags: SAS Programmers
242013
 

In my article "Simulation in SAS: The slow way or the BY way," I showed how to use BY-group processing rather than a macro loop in order to efficiently analyze simulated data with SAS. In the example, I analyzed the simulated data by using PROC MEANS, and I use the NOPRINT option to suppress the ODS output that the procedure would normally produce.

About 50 SAS/STAT procedures support the NOPRINT option in the PROC statement. When you specify the NOPRINT option, ODS is temporarily disabled while the procedure runs. This prevents SAS from displaying tables and graphs that would otherwise be produced for each BY group. For a simulation that computes statistics for thousands of BY groups, suppressing the display of tables results in a substantial savings of time.

Newer SAS procedures do not always support a NOPRINT statement. However, you can still suppress the ODS output. The following macros encapsulate statements that turn the ODS system off and on. I call the %ODSOff macro before I start the BY-group analysis; I call the %ODSOn macro after the analysis completes.

%macro ODSOff(); /* Call prior to BY-group processing */
ods graphics off;
ods exclude all;
ods noresults;
%mend;
 
%macro ODSOn(); /* Call after BY-group processing */
ods graphics on;
ods exclude none;
ods results;
%mend;

For example, if I were using PROC ROBUSTREG to analyze many samples of simulated data, I might use the following pseudo-code:

%ODSOff
proc robustreg data=MySimData;
   BY SampleID;
   model y = x;
   ods output ParameterEstimates = OutputStats;  /* <== insert name of ODS table */
run;
%ODSOn

Even though ODS is suppressed to the display destinations (such as LISTING and HTML), you can capture the statistics that result from each analysis by using an ODS OUTPUT statement, which saves an ODS table to a SAS data set. Other ways to save statistics include using an OUTPUT statement, an OUT= or OUTEST= data set, and so forth.

Be aware that some SAS procedures (such as PROC MIXED) write a NOTE to the SAS log as part of their normal operation. The NOTE might say something like "NOTE: Convergence criteria met." For these procedures, you will also want to turn off notes, lest they fill the SAS log:

%ODSOff
options nonotes;  /* use NONOTES to suppress notes to the log */
proc mixed ...;
model y = ...;
run;
options notes;   /* turn NOTES back on */
%ODSOn

The material in this blog post is taken from my book Simulating Data with SAS, which contains many more tips and techniques for the efficient simulation of data.

tags: Sampling and Simulation, Tips and Techniques
242013
 
You had to be there … well, maybe you didn’t! Major industry events – conventions, trade shows, etc. – are the rock concerts of the corporate world. And similar to rock concerts, attendees get a memorable shared experience while their friends back home get, at best, a decent description or, [...]
232013
 
What do I mean by “analytic workbench?” Basically, the compute-resource environment with which data analysis takes place. How would you describe some of the analytic workbenches in your organization? Not everyone is a power analyst, so not everyone requires power tools. But all of us deal with data at some [...]
232013
 

算法步骤如下:

输入:

n d−dimensional patterns;

k - initial number of clusters;

N - number of clusterings.

t - threshold.

输出: Data partitioning.

Initialization: Set co assoc to a null n * n matrix.

1. Do N times:

1.1. Randomly select k cluster centers.

1.2. Run the K-means algorithm with the above

initialization and produce a partition P.

1.3. Update the co-association matrix:

for each pattern pair, (i; j), in the same cluster in P,

set co assoc(i; j) = co assoc(i; j) + 1/N.

2. Detect consistent clusters in the co-association matrix using a SL technique:

2.1. Find majority voting associations: For each pattern pair, (i; j), such that co assoc(i; j) > t, merge

the patterns in the same cluster; if the patterns were in distinct previously formed clusters, join the clusters;

2.2. For each remaining pattern not included in a cluster, form a single element cluster;

 

数据收集

http://cs.joensuu.fi/sipu/datasets/

Aggregation http://cs.joensuu.fi/sipu/datasets/Aggregation.txt

Spiral http://cs.joensuu.fi/sipu/datasets/spiral.txt

 

加载包

library(ggplot2)

library(cluster)

library(reshape)

library(clusterCrit)

 

数据预处理,数据探索

dataAggregatione <- read.csv("Aggregation.txt", sep = "\t",

                             header = FALSE)

dataAggregationeScaled <- scale(dataAggregatione[, -3])  # 规范化数据集

dataAggregatione <- data.frame(dataAggregationeScaled,

                               name = as.character(c(1:nrow(dataAggregationeScaled))))

rownames(dataAggregatione) <- dataAggregatione$name

ggplot(dataAggregatione, aes(V1, V2)) + geom_point()

 

dataSpiral <- read.csv("spiral.txt", sep = "\t", header = FALSE)

dataSpiralScaled <- scale(dataSpiral[, -3])  # normalize data

dataSpiral <- data.frame(dataSpiralScaled,

                         name = as.character(c(1:nrow(dataSpiralScaled))))

rownames(dataSpiral) <- dataSpiral$name

ggplot(dataSpiral, aes(V1, V2)) + geom_point()




 

应用kmeans算法进行聚类

如何确定K?这里选择K为:2-50,度量聚类标准选择三种方法,DunnCalinski-HarabaszSilhouette

聚类算法验证指标:

https://www.siam.org/proceedings/datamining/2009/dm09_067_vendraminl.pdf

 

根据聚类分析标准,选择最优的k

针对aggregation数据集

set.seed(1234)

#初始化

vals <- matrix(rep(NA, 49 * 3), ncol = 3, dimnames = list(c(),

                                                          c("Dunn", "Calinski-Harabasz", "Silhouette"))) 

#算法迭代

for (k in 2:50) {

  cl <- kmeans(dataAggregatione[, c(1, 2)], k) #聚类

  vals[(k - 1), 1] <- as.numeric(intCriteria(

    as.matrix(dataAggregatione[, c(1, 2)]), cl$cluster,

    "Dunn"))

  vals[(k - 1), 2] <- as.numeric(intCriteria(

    as.matrix(dataAggregatione[, c(1, 2)]), cl$cluster,

    "Calinski_Harabasz"))

  vals[(k - 1), 3] <- as.numeric(intCriteria(

    as.matrix(dataAggregatione[, c(1, 2)]), cl$cluster,

    "Silhouette"))

}

vals <- data.frame(K = c(2:50), vals)

choosen_k <- matrix(c(vals[bestCriterion(vals[, 2], "Dunn"), "K"],

                      vals[bestCriterion(vals[, 3], "Calinski_Harabasz"), "K"],

                      vals[bestCriterion(vals[, 4], "Silhouette"), "K"]),

                    ncol = 3,

                    dimnames = list(c("Aggregation"),

                                    c("Dunn", "Calinski_Harabasz", "Silhouette")))

choosen_k

  Dunn Calinski_Harabasz Silhouette
Aggregation   46                45          4

 

可以发现不同的评价标准,k值选择不一样

 

Spiral数据集

set.seed(1234)

vals <- matrix(rep(NA, 49 * 3), ncol = 3, dimnames = list(c(),

                                                          c("Dunn", "Calinski-Harabasz", "Silhouette")))

for (k in 2:50) {

  cl <- kmeans(dataSpiral[, c(1, 2)], k)

  vals[(k - 1), 1] <- as.numeric(intCriteria(

    as.matrix(dataSpiral[, c(1, 2)]), cl$cluster,

    "Dunn"))

  vals[(k - 1), 2] <- as.numeric(intCriteria(

    as.matrix(dataSpiral[, c(1, 2)]), cl$cluster,

    "Calinski_Harabasz"))

  vals[(k - 1), 3] <- as.numeric(intCriteria(

    as.matrix(dataSpiral[, c(1, 2)]), cl$cluster,

    "Silhouette"))

}

vals <- data.frame(K = c(2:50), vals)

choosen_k <- matrix(c(vals[bestCriterion(vals[, 2], "Dunn"), "K"],

                      vals[bestCriterion(vals[, 3], "Calinski_Harabasz"), "K"],

                      vals[bestCriterion(vals[, 4], "Silhouette"), "K"]),

                    ncol = 3,

                    dimnames = list(c("Spiral"),

                                    c("Dunn", "Calinski_Harabasz", "Silhouette")))

choosen_k

  Dunn Calinski_Harabasz Silhouette
Spiral   34                50         37

可视化聚类结果

kmeansResultsAggreation <- kmeans(x = dataAggregatione[, c(1, 2)],

                                  centers = 3)$cluster

dataAggregatione$clusterSimpleKmeans <- as.character(kmeansResultsAggreation)

ggplot(dataAggregatione, aes(V1, V2)) +

  geom_point(aes(colour = clusterSimpleKmeans)) +

  opts(legend.position = "none")

 

看结果没有划分开,红色圈圈标注

 

 

kmeansResultsSpiral <- kmeans(x = dataSpiral[, c(1, 2)],

                              centers = 37)$cluster

dataSpiral$clusterSimpleKmeans <- as.character(kmeansResultsSpiral)

ggplot(dataSpiral, aes(V1, V2)) +

  geom_point(aes(colour = clusterSimpleKmeans)) +

  opts(legend.position = "none")      

 

 

Evidence Accumulation Clustering算法

Ensemble思路

createCoAssocMatrix <- function(Iter, rangeK, dataSet) {

  nV <- dim(dataSet)[1]

  CoAssoc <- matrix(rep(0, nV * nV), nrow = nV)

 

  for (j in 1:Iter) {

    jK <- sample(c(rangeK[1]:rangeK[2]), 1, replace = FALSE)

    jSpecCl <- kmeans(x = dataSet, centers = jK)$cluster

    CoAssoc_j <- matrix(rep(0, nV * nV), nrow = nV)

    for (i in unique(jSpecCl)) {

      indVenues <- which(jSpecCl == i)

      CoAssoc_j[indVenues, indVenues] <- CoAssoc_j[indVenues, indVenues] + (1/Iter)

    }

    CoAssoc <- CoAssoc + CoAssoc_j

  }

  return(CoAssoc)

}

 

eac <- function(Iter, rangeK, dataset, hcMethod = "single") {

  CoAssocSim <- createCoAssocMatrix(Iter, rangeK, dataset)

 

  # transform from similiarity into distance matrix

  CoAssocDist <- 1 - CoAssocSim 

  hclustM <- hclust(as.dist(CoAssocDist), method = hcMethod)

  # determine the cut

  cutValue <- hclustM$height[which.max(diff(hclustM$height))] 

  return(cutree(hclustM, h = cutValue))

}

 

EAC aggregation数据集

set.seed(1234)

EACResults_Aggregatione <- eac(Iter = 200, rangeK = c(2, 50),

                               dataset = dataAggregatione[, c(1, 2)], hcMethod = "single")

table(EACResults_Aggregatione)

 

dataAggregatione$clusterEAC <- as.character(EACResults_Aggregatione)

ggplot(dataAggregatione, aes(V1, V2)) + geom_point(aes(colour = clusterEAC)) +

  opts(legend.position = "none")



 

set.seed(1234)

EACResults_Spiral <- eac(Iter = 200, rangeK = c(2, 50),

                         dataset = dataSpiral[, c(1, 2)], hcMethod = "single")

table(EACResults_Spiral)

 

dataSpiral$clusterEAC <- as.character(EACResults_Spiral)

ggplot(dataSpiral, aes(V1, V2)) + geom_point(aes(colour = clusterEAC)) +

  opts(legend.position = "none")

 

 


 从上图可以看出,EA聚类算法较kmeans效果要好,基本可以将数据进行划分

 

三个臭皮匠,顶个诸葛亮,哈哈


from:r-bloggers 


 


 青春就应该这样绽放  游戏测试:三国时期谁是你最好的兄弟!!  你不得不信的星座秘密
 Posted by at 8:09 上午