At SAS, one of our core values is to be swift and agile. So it makes sense that our software development be Agile too. The Agile methodology has been around for more than 10 years and was designed with software development in mind. Today, it is still used predominately for this purpose, but is gaining momentum in other circles as well. In this Wall Street Journal article, parents even confess to bringing these ideas home and implementing Agile with their children.
Within SAS, divisions and teams use Agile in different ways; its nimbleness allows for varying degrees of adoption or implementation. For example, since 2008, R&D teams have been using Agile for requirements management, product implementation and project tracking. Planning is simplified with Agile. Potential product or solution features, known as stories, are ranked and scheduled in a release plan. Then they are assigned to an iteration or sprint (usually 2-4 weeks in length) in the development life cycle.
One of the main benefits of Agile development is that it fosters more cross-functional activity and collaboration, that is, a higher level of engagement and commitment. For example, for any given iteration, developers, testers and product managers work closely together and share updates several times a week in meetings called scrums. At the end of each iteration, all stakeholders sign off on features, and demos are given to internal or external stakeholders, showing the software in proper working condition.
Agile methodology was highlighted as a key development strategy at the Business Intelligence Development Roundtable at SAS Global Forum 2013. Business Intelligence R&D teams adopted Agile when they set out to create SAS’ data visualization software, SAS® Visual Analytics. Developing a new product like Visual Analytics required a nimble approach. Agile’s short, time-bound iterations allowed SAS product developers to incorporate user feedback more quickly and focus more on what worked.
In 2012, senior project managers at SAS surveyed SAS development teams to determine the effectiveness of Agile approaches. Seventy-eight percent of respondents said they would recommend Agile to another team. The survey reported that stronger adoption of Agile methods drives higher productivity and a deeper sense of engagement among teams.
In April 2013, Tim Arthur, Agile champion within SAS Research & Development, presented the results of that survey at the Stanford Strategic Execution Conference. Arthur heard from several Silicon Valley companies that SAS is ahead of the curve and viewed as a model for Agile adoption and scaling. If you're interested in learning more about SAS' approach, you can download Arthur's white paper Agile Adoption: Measuring its Worth.
Does your organization use Agile? Leave a comment to tell us about your experience or ask for more information about SAS' implementation of Agile.
In my article "Simulation in SAS: The slow way or the BY way," I showed how to use BY-group processing rather than a macro loop in order to efficiently analyze simulated data with SAS. In the example, I analyzed the simulated data by using PROC MEANS, and I use the NOPRINT option to suppress the ODS output that the procedure would normally produce.
About 50 SAS/STAT procedures support the NOPRINT option in the PROC statement. When you specify the NOPRINT option, ODS is temporarily disabled while the procedure runs. This prevents SAS from displaying tables and graphs that would otherwise be produced for each BY group. For a simulation that computes statistics for thousands of BY groups, suppressing the display of tables results in a substantial savings of time.
Newer SAS procedures do not always support a NOPRINT statement. However, you can still suppress the ODS output. The following macros encapsulate statements that turn the ODS system off and on. I call the %ODSOff macro before I start the BY-group analysis; I call the %ODSOn macro after the analysis completes.
%macro ODSOff(); /* Call prior to BY-group processing */ ods graphics off; ods exclude all; ods noresults; %mend; %macro ODSOn(); /* Call after BY-group processing */ ods graphics on; ods exclude none; ods results; %mend;
For example, if I were using PROC ROBUSTREG to analyze many samples of simulated data, I might use the following pseudo-code:
%ODSOff proc robustreg data=MySimData; BY SampleID; model y = x; ods output ParameterEstimates = OutputStats; /* <== insert name of ODS table */ run; %ODSOn
Even though ODS is suppressed to the display destinations (such as LISTING and HTML), you can capture the statistics that result from each analysis by using an ODS OUTPUT statement, which saves an ODS table to a SAS data set. Other ways to save statistics include using an OUTPUT statement, an OUT= or OUTEST= data set, and so forth.
Be aware that some SAS procedures (such as PROC MIXED) write a NOTE to the SAS log as part of their normal operation. The NOTE might say something like "NOTE: Convergence criteria met." For these procedures, you will also want to turn off notes, lest they fill the SAS log:
%ODSOff options nonotes; /* use NONOTES to suppress notes to the log */ proc mixed ...; model y = ...; run; options notes; /* turn NOTES back on */ %ODSOn
The material in this blog post is taken from my book Simulating Data with SAS, which contains many more tips and techniques for the efficient simulation of data.
算法步骤如下:
输入:
n d−dimensional patterns;
k - initial number of clusters;
N - number of clusterings.
t - threshold.
输出: Data partitioning.
Initialization: Set co assoc to a null n * n matrix.
1. Do N times:
1.1. Randomly select k cluster centers.
1.2. Run the K-means algorithm with the above
initialization and produce a partition P.
1.3. Update the co-association matrix:
for each pattern pair, (i; j), in the same cluster in P,
set co assoc(i; j) = co assoc(i; j) + 1/N.
2. Detect consistent clusters in the co-association matrix using a SL technique:
2.1. Find majority voting associations: For each pattern pair, (i; j), such that co assoc(i; j) > t, merge
the patterns in the same cluster; if the patterns were in distinct previously formed clusters, join the clusters;
2.2. For each remaining pattern not included in a cluster, form a single element cluster;
数据收集
http://cs.joensuu.fi/sipu/datasets/
Aggregation http://cs.joensuu.fi/sipu/datasets/Aggregation.txt
Spiral http://cs.joensuu.fi/sipu/datasets/spiral.txt
加载包
library(ggplot2)
library(cluster)
library(reshape)
library(clusterCrit)
数据预处理,数据探索
dataAggregatione <- read.csv("Aggregation.txt", sep = "\t",
header = FALSE)
dataAggregationeScaled <- scale(dataAggregatione[, -3]) # 规范化数据集
dataAggregatione <- data.frame(dataAggregationeScaled,
name = as.character(c(1:nrow(dataAggregationeScaled))))
rownames(dataAggregatione) <- dataAggregatione$name
ggplot(dataAggregatione, aes(V1, V2)) + geom_point()
dataSpiral <- read.csv("spiral.txt", sep = "\t", header = FALSE)
dataSpiralScaled <- scale(dataSpiral[, -3]) # normalize data
dataSpiral <- data.frame(dataSpiralScaled,
name = as.character(c(1:nrow(dataSpiralScaled))))
rownames(dataSpiral) <- dataSpiral$name
ggplot(dataSpiral, aes(V1, V2)) + geom_point()
应用kmeans算法进行聚类
如何确定K?这里选择K为:2-50,度量聚类标准选择三种方法,Dunn,Calinski-Harabasz,Silhouette
聚类算法验证指标:
https://www.siam.org/proceedings/datamining/2009/dm09_067_vendraminl.pdf
根据聚类分析标准,选择最优的k值
针对aggregation数据集
set.seed(1234)
#初始化
vals <- matrix(rep(NA, 49 * 3), ncol = 3, dimnames = list(c(),
c("Dunn", "Calinski-Harabasz", "Silhouette")))
#算法迭代
for (k in 2:50) {
cl <- kmeans(dataAggregatione[, c(1, 2)], k) #聚类
vals[(k - 1), 1] <- as.numeric(intCriteria(
as.matrix(dataAggregatione[, c(1, 2)]), cl$cluster,
"Dunn"))
vals[(k - 1), 2] <- as.numeric(intCriteria(
as.matrix(dataAggregatione[, c(1, 2)]), cl$cluster,
"Calinski_Harabasz"))
vals[(k - 1), 3] <- as.numeric(intCriteria(
as.matrix(dataAggregatione[, c(1, 2)]), cl$cluster,
"Silhouette"))
}
vals <- data.frame(K = c(2:50), vals)
choosen_k <- matrix(c(vals[bestCriterion(vals[, 2], "Dunn"), "K"],
vals[bestCriterion(vals[, 3], "Calinski_Harabasz"), "K"],
vals[bestCriterion(vals[, 4], "Silhouette"), "K"]),
ncol = 3,
dimnames = list(c("Aggregation"),
c("Dunn", "Calinski_Harabasz", "Silhouette")))
choosen_k
Dunn Calinski_Harabasz Silhouette
Aggregation 46 45 4
可以发现不同的评价标准,k值选择不一样
Spiral数据集
set.seed(1234)
vals <- matrix(rep(NA, 49 * 3), ncol = 3, dimnames = list(c(),
c("Dunn", "Calinski-Harabasz", "Silhouette")))
for (k in 2:50) {
cl <- kmeans(dataSpiral[, c(1, 2)], k)
vals[(k - 1), 1] <- as.numeric(intCriteria(
as.matrix(dataSpiral[, c(1, 2)]), cl$cluster,
"Dunn"))
vals[(k - 1), 2] <- as.numeric(intCriteria(
as.matrix(dataSpiral[, c(1, 2)]), cl$cluster,
"Calinski_Harabasz"))
vals[(k - 1), 3] <- as.numeric(intCriteria(
as.matrix(dataSpiral[, c(1, 2)]), cl$cluster,
"Silhouette"))
}
vals <- data.frame(K = c(2:50), vals)
choosen_k <- matrix(c(vals[bestCriterion(vals[, 2], "Dunn"), "K"],
vals[bestCriterion(vals[, 3], "Calinski_Harabasz"), "K"],
vals[bestCriterion(vals[, 4], "Silhouette"), "K"]),
ncol = 3,
dimnames = list(c("Spiral"),
c("Dunn", "Calinski_Harabasz", "Silhouette")))
choosen_k
Dunn Calinski_Harabasz Silhouette
Spiral 34 50 37
可视化聚类结果
kmeansResultsAggreation <- kmeans(x = dataAggregatione[, c(1, 2)],
centers = 3)$cluster
dataAggregatione$clusterSimpleKmeans <- as.character(kmeansResultsAggreation)
ggplot(dataAggregatione, aes(V1, V2)) +
geom_point(aes(colour = clusterSimpleKmeans)) +
opts(legend.position = "none")
看结果没有划分开,红色圈圈标注
kmeansResultsSpiral <- kmeans(x = dataSpiral[, c(1, 2)],
centers = 37)$cluster
dataSpiral$clusterSimpleKmeans <- as.character(kmeansResultsSpiral)
ggplot(dataSpiral, aes(V1, V2)) +
geom_point(aes(colour = clusterSimpleKmeans)) +
opts(legend.position = "none")
Evidence Accumulation Clustering算法
Ensemble思路
createCoAssocMatrix <- function(Iter, rangeK, dataSet) {
nV <- dim(dataSet)[1]
CoAssoc <- matrix(rep(0, nV * nV), nrow = nV)
for (j in 1:Iter) {
jK <- sample(c(rangeK[1]:rangeK[2]), 1, replace = FALSE)
jSpecCl <- kmeans(x = dataSet, centers = jK)$cluster
CoAssoc_j <- matrix(rep(0, nV * nV), nrow = nV)
for (i in unique(jSpecCl)) {
indVenues <- which(jSpecCl == i)
CoAssoc_j[indVenues, indVenues] <- CoAssoc_j[indVenues, indVenues] + (1/Iter)
}
CoAssoc <- CoAssoc + CoAssoc_j
}
return(CoAssoc)
}
eac <- function(Iter, rangeK, dataset, hcMethod = "single") {
CoAssocSim <- createCoAssocMatrix(Iter, rangeK, dataset)
# transform from similiarity into distance matrix
CoAssocDist <- 1 - CoAssocSim
hclustM <- hclust(as.dist(CoAssocDist), method = hcMethod)
# determine the cut
cutValue <- hclustM$height[which.max(diff(hclustM$height))]
return(cutree(hclustM, h = cutValue))
}
EAC aggregation数据集
set.seed(1234)
EACResults_Aggregatione <- eac(Iter = 200, rangeK = c(2, 50),
dataset = dataAggregatione[, c(1, 2)], hcMethod = "single")
table(EACResults_Aggregatione)
dataAggregatione$clusterEAC <- as.character(EACResults_Aggregatione)
ggplot(dataAggregatione, aes(V1, V2)) + geom_point(aes(colour = clusterEAC)) +
opts(legend.position = "none")
set.seed(1234)
EACResults_Spiral <- eac(Iter = 200, rangeK = c(2, 50),
dataset = dataSpiral[, c(1, 2)], hcMethod = "single")
table(EACResults_Spiral)
dataSpiral$clusterEAC <- as.character(EACResults_Spiral)
ggplot(dataSpiral, aes(V1, V2)) + geom_point(aes(colour = clusterEAC)) +
opts(legend.position = "none")
从上图可以看出,EA聚类算法较kmeans效果要好,基本可以将数据进行划分
三个臭皮匠,顶个诸葛亮,哈哈
from:r-bloggers
青春就应该这样绽放 游戏测试:三国时期谁是你最好的兄弟!! 你不得不信的星座秘密
