python

8月 102014
 
Thanks Jared Hobbs’ sas7bdat package, Python can read SAS’s data sets quickly and precisely. And it will be great to have a few extension functions to enhance this package with SQLite and Pandas.
The good things to transfer SAS libraries to SQLite:
  1. Size reduction:
    SAS’s sas7bdat format is verbose. So far successfully loaded 40GB SAS data to SQLite with 85% reduction of disk usage.
  2. Save the cost to buy SAS/ACCESS
    SAS/ACCESS costs around $8,000 a year for a server, while SQLite is accessible for most common softwares.
The good things to transfer SAS data set to Pandas:
  1. Pandas’ powerful Excel interface:
    Write very large Excel file quickly as long as memory can hold data.
  2. Validation of statistics
    Pandas works well with statsmodels and scikit-learn. Easy to validate  SAS’s outputs. 
5月 292014
 
The instruction is from:  https://github.com/shk3/edx-downloader

1: install youtube-dl (http://rg3.github.io/youtube-dl/download.html)

sudo wget https://yt-dl.org/downloads/2014.05.19/youtube-dl -O /usr/local/bin/youtube-dl
sudo chmod a+x /usr/local/bin/youtube-dl


2: install beautifulsoup4

since it requires easy_install, if this is not installed on ubuntu, then run sudo apt-get install python-setuptools to install it.

after it is installed, run:  easy_install beautifulsoup4 to install beautifulsoup4

3: download edx-downloader-master.zip and unzip it. Then run:
 python edx-dl.py -u your_username -p your_passed

And you will get:

Welcome Huiming Song
You can access 4 courses
1 - CS188.1x Artificial Intelligence -> Started
2 - TW3421x Credit Risk Management -> Started
3 - LFS101x Introduction to Linux -> Not yet
4 - MAS.S69x Big Data and Social Physics -> Started

Enter Course Number:

After you enter the number, it will begin to download!

CS188.1x Artificial Intelligence has 11 weeks so far
1 - Download Week 1 videos
2 - Download Week 2 videos
3 - Download Week 3 videos
4 - Download Week 4 videos
5 - Download Week 5 videos
6 - Download Week 6 videos
7 - Download Week 7 videos
8 - Download Week 8 videos
9 - Download Week 9 videos
10 - Download Week 10 videos
11 - Download Week 11 videos
12 - Download them all

 
Enter Your Choice: 12
 

Processing 'https://courses.edx.org/courses/BerkeleyX/CS188.1x/2013_Spring/courseware/Week_1/Welcome_to_CS188/'...
Processing 'https://courses.edx.org/courses/BerkeleyX/CS188.1x/2013_Spring/courseware/Week_1/Lecture_1_Introduction_to_AI/'...
Processing 'https://courses.edx.org/courses/BerkeleyX/CS188.1x/2013_Spring/courseware/Week_1/Python_Refresher/'...
Processing 'https://courses.edx.org/courses/BerkeleyX/CS188.1x/2013_Spring/courseware/Week_2/Lecture_2_Uninformed_Search/'...
Processing 'https://courses.edx.org/courses/BerkeleyX/CS188.1x/2013_Spring/courseware/Week_2/Lecture_2_Uninformed_Search_continued/'...
Processing 'https://courses.edx.org/courses/BerkeleyX/CS188.1x/2013_Spring/courseware/Week_2/Lecture_3_Informed_Search/'...
Processing 'https://courses.edx.org/courses/BerkeleyX/CS188.1x/2013_Spring/courseware/Week_2/Lecture_3_Informed_Search_continued/'...


Great!
5月 272014
 

MRJob is a python lib that can run map-reduce job on local machine or hadoop cluster. More details are in here in github. It enables you to write and run map-reduce job with python.

The author gave an example using mrjob to get the Pearson correlation between different pairs of movies. The original blog is here.

The raw data is from groupens. It provides 100,000 ratings from 1,000 users 0n 1700 movies. You can download the zip file and the u.data and u.item are the files used here. u.data stores the information of user_id, movie_id, rating and time. u.item stores the movie_id, movie name.

First step, we will read in the data and reshape the data in the following format(user_id|movie_name|rating):

196|Kolya (1996)|3
186|L.A. Confidential (1997)|3
22|Heavyweights (1994)|1
244|Legends of the Fall (1994)|2
166|Jackie Brown (1997)|1
298|Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)|4
115|Hunt for Red October, The (1990)|2
253|Jungle Book, The (1994)|5
305|Grease (1978)|3
6|Remains of the Day, The (1993)|3


The corresponding python code is:
if __name__ == '__main__':
user_items = []
items = []

with open ('u.data') as f:
for line in f:
user_items.append(line.split('\t'))

with open ('u.item') as f:
for line in f:
items.append(line.split('|'))

print 'user_items[0] = ', user_items[0]
print 'items[0] = ', items[0]

items_hash = {}
for i in items:
items_hash[i[0]] = i[1]

print 'items_hash[1] = ', items_hash['1']

for ui in user_items:
ui[1] = items_hash[ui[1]]

print 'user_items[0] = ', user_items[0]

with open ('rating.csv', 'w') as f:
for ui in user_items:
f.write(ui[0] + '|' + ui[1] +'|' + ui[2] + '\n')

Hash table(dictionary) is used to map movid_id and movie name.

Second step, aggregate rating data by user_id. For each user, list all the movies this user rated and the corresponding rating. The data is like:
"user_id"  [total_number_of_movies_rated, total_rating_score, [["movie_1", score_1], ["movie_2", score_2]]]

The real data is like:

"100"   [58, 178.0, [["Air Force One (1997)", 4.0], ["Amistad (1997)", 4.0], ["Anna Karenina (1997)", 3.0], ["Apostle, The (1997)", 4.0], ["Apt Pupil (1998)", 5.0], ["As Good As It Gets (1997)", 5.0], ["Big Bang Theory, The (1994)", 4.0], ["Boogie Nights (1997)", 3.0], ["Career Girls (1997)", 1.0], ["Chairman of the Board (1998)", 1.0], ["Chasing Amy (1997)", 3.0], ["Conspiracy Theory (1997)", 4.0], ["Contact (1997)", 4.0], ["Dante's Peak (1997)", 3.0], ["Dark City (1998)", 4.0], ["Desperate Measures (1998)", 3.0], ["English Patient, The (1996)", 3.0], ["Eve's Bayou (1997)", 2.0], ["Evita (1996)", 3.0], ["Flubber (1997)", 2.0], ["Full Monty, The (1997)", 4.0], ["Full Speed (1996)", 2.0], ["G.I. Jane (1997)", 3.0], ["Game, The (1997)", 3.0], ["Gattaca (1997)", 3.0], ["Good Will Hunting (1997)", 4.0], ["Great Expectations (1998)", 3.0], ["Half Baked (1998)", 1.0], ["Hard Rain (1998)", 3.0], ["Jackal, The (1997)", 3.0], ["Jackie Brown (1997)", 3.0], ["Kull the Conqueror (1997)", 2.0], ["Kundun (1997)", 4.0], ["L.A. Confidential (1997)", 4.0], ["Liar Liar (1997)", 4.0], ["Life Less Ordinary, A (1997)", 3.0], ["Man Who Knew Too Little, The (1997)", 3.0], ["Money Talks (1997)", 1.0], ["Mother (1996)", 1.0], ["Other Voices, Other Rooms (1997)", 3.0], ["Peacemaker, The (1997)", 4.0], ["Phantoms (1998)", 2.0], ["Postman, The (1997)", 4.0], ["Rainmaker, The (1997)", 3.0], ["Replacement Killers, The (1998)", 4.0], ["Rosewood (1997)", 2.0], ["Scream (1996)", 2.0], ["Scream 2 (1997)", 2.0], ["Seven Years in Tibet (1997)", 4.0], ["Soul Food (1997)", 1.0], ["Sphere (1998)", 4.0], ["Starship Troopers (1997)", 3.0], ["Titanic (1997)", 5.0], ["Tomorrow Never Dies (1997)", 4.0], ["Twisted (1996)", 3.0], ["Volcano (1997)", 3.0], ["Wag the Dog (1997)", 4.0], ["Wedding Singer, The (1998)", 2.0]]]

The corresponding python code is:

from mrjob.job import MRJob

class Step1(MRJob):
def group_by_user_rating(self, key, line):
user_id, item_id, rating = line.split('|')
yield user_id, (item_id, float(rating))

def count_ratings_users_freq(self, user_id, values):
item_count = 0
item_sum = 0
final = []

for item_id, rating in values:
item_count += 1
item_sum += rating
final.append((item_id, rating))

yield user_id, (item_count, item_sum, final)

def steps(self):
return [self.mr(mapper = self.group_by_user_rating, reducer = self.count_ratings_users_freq), ]

if __name__ == '__main__':
Step1.run()

Third Step: calculate the correlations. In the blog there are several types of measures are listed:

1) Pearson correlation:
corr(X, Y) = cov(X, Y) / sqrt( var(X) * var(Y) ).

2) Cossine similarity = dotproduct(X, Y) / ( norm(X) * norm(Y) )

3) Jaccard = | intersection(X, Y) | / | union(X, Y) |

movie and rating data from Step 2 is used to generate rating vectors for each movie pair. For each user, all the pairs of rating combinations are listed. So, if user_1 rates 50 movies, then there will be C(50, 2) pairs of rating. This output data will be a little big because of the combination. Sample data output is:

('101 Dalmatians (1996)', '12 Angry Men (1957)') (2.0, 5.0)
('101 Dalmatians (1996)', '20,000 Leagues Under the Sea (1954)') (2.0, 3.0)
('101 Dalmatians (1996)', '2001: A Space Odyssey (1968)') (2.0, 4.0)
('101 Dalmatians (1996)', 'Abyss, The (1989)') (2.0, 3.0)
('101 Dalmatians (1996)', 'Ace Ventura: Pet Detective (1994)') (2.0, 3.0)
('101 Dalmatians (1996)', 'Air Bud (1997)') (2.0, 1.0)
('101 Dalmatians (1996)', 'Akira (1988)') (2.0, 4.0)
('101 Dalmatians (1996)', 'Aladdin (1992)') (2.0, 4.0)
('101 Dalmatians (1996)', 'Alien (1979)') (2.0, 5.0)
('101 Dalmatians (1996)', 'Aliens (1986)') (2.0, 5.0)
('101 Dalmatians (1996)', 'All Dogs Go to Heaven 2 (1996)') (2.0, 1.0)
('101 Dalmatians (1996)', 'Amadeus (1984)') (2.0, 5.0)


For example, if there are 20 users rating ('101 Dalmatians (1996)', '12 Angry Men (1957)') then there will be an array with dim 20*2 which list the 20 users rating for these two movies.

After the rating vector(array) is generated, Pearson correlation can be calculated with the formular above. the final output is like:

["Titanic (1997)", "Trial and Error (1997)"]    [0.11502250920209461, 15]
["Titanic (1997)", "Trial by Jury (1994)"]      [0.3273268353539886, 5]
["Titanic (1997)", "Trigger Effect, The (1996)"]        [0.17194746164433358, 17]
["Titanic (1997)", "True Crime (1995)"] [0.6123724356957946, 5]
["Titanic (1997)", "True Lies (1994)"]  [0.43510398213401047, 83]
["Titanic (1997)", "True Romance (1993)"]       [-0.08940478530189423, 43]
["Titanic (1997)", "Truman Show, The (1998)"]   [0.11298653657320641, 8]
["Titanic (1997)", "Trust (1990)"]      [0.1889822365046136, 9]
["Titanic (1997)", "Truth About Cats & Dogs, The (1996)"]       [0.2251764830596235, 102]
["Titanic (1997)", "Truth or Consequences, N.M. (1997)"]        [0, 1]
["Titanic (1997)", "Turbo: A Power Rangers Movie (1997)"]       [0.4999999999999999, 3]
["Titanic (1997)", "Turbulence (1997)"] [0.8157894736842107, 7]


As is shown, Titanic has a correlation equalling to 0.435 with True Lies. Maybe the reason is they are both directed by James Cameron. It also has high correlation with Turbulence, however, since there are only 7 users rating them, the confidence that they have high correlation is not strong.

The last part of code is:
from itertools import combinations
from math import sqrt

class Step2(MRJob):
def pairwise_items(self, user_id, values):
values = eval(values.split('\t')[1])
item_count, item_sum, rating = values
for item1, item2 in combinations(rating, 2):
yield (item1[0], item2[0]), (item1[1], item2[1])
print (item1[0], item2[0]), (item1[1], item2[1])

def calculate_similarity(self, pair_key, lines):
sum_xx, sum_xy, sum_yy, sum_x, sum_y, n = (0.0, 0.0, 0.0, 0.0, 0.0, 0)
item_pair, co_ratings = pair_key, lines
item_xname, item_yname = item_pair
for item_x, item_y in co_ratings:
sum_xx += item_x * item_x
sum_yy += item_y * item_y
sum_xy += item_x * item_y
sum_y += item_y
sum_x += item_x
n += 1
#print item_xname, item_yname, item_x, item_y

similarity = self.normalized_correlation(n, sum_xy, sum_x, sum_y, sum_xx, sum_yy)
yield (item_xname, item_yname), (similarity, n)

def steps(self):
return [self.mr(mapper=self.pairwise_items, reducer = self.calculate_similarity), ]

def normalized_correlation(self, n, sum_xy, sum_x, sum_y, sum_xx, sum_yy):
numerator = (n * sum_xy - sum_x * sum_y)
denominator = sqrt(n * sum_xx - sum_x * sum_x) * sqrt(n * sum_yy - sum_y * sum_y)
return numerator / denominator if denominator else 0

def calculate_ranking(self, item_keys, values):
similarity, n = values
item_x, item_y = item_keys
if int(n) > 1:
yield (item_x, similarity), (item_y, n)

def top_similar_items(self, key_sim, similar_ns):
item_x, similarity = key_sim
for item_y, n in similar_ns:
print '%s; %s; %f; %d' % (item_x, item_y, similarity, n)


if __name__ == '__main__':
Step2.run()
5月 262014
 
Flask is a light-weight web framework for Python, which is well documented and clearly written. Its Github depository provides a few examples, which includes minitwit. The minittwit website enjoys a few basic features of social network such as following, login/logout. The demo site on GAE is http://minitwit-123.appspot.com. The Github repo is https://github.com/dapangmao/minitwit.
Google App Engine or GAE is a major public clouder service besides Amazon EC2. Among the four languages(Java/Python/Go/PHP) it supports, GAE is friendly to Python users, possibly because Guido van Rossum worked there and personally created Python datastore interface. As for me, it is a good choice for a Flask app.

Step1: download GAE SDK and GAE Flask skeleton

GAE’s Python SDK tests the staging app and eventuall pushes the app to the cloud.
A Flask skeleton can be dowloaded from Google Developer Console. It contains three files:
  • app.yaml: specify the entrance of run-time
  • appengine_config.py: add the external libraries such as Flask to system path
  • main.py: the root Python program

Step2: schema design

The dabase used for the original minitwit is SQLite. The schema consists of three tables: user, follower and message, which makes a normalized database together. GAE has two Datastore APIs: DB and NDB. Since neither of them supports joining (in this case one-to-many joining for user to follower), I move the follwer table as an nested text propery into the user table, which eliminatse the need for joining.
As the result, the main.py has two data models: User and Message. They will create and maintain two kinds (or we call them as tables) with the same names in Datastore.
class User(ndb.Model):
username = ndb.StringProperty(required=True)
email = ndb.StringProperty(required=True)
pw_hash = ndb.StringProperty(required=True)
following = ndb.IntegerProperty(repeated=True)
start_date = ndb.DateTimeProperty(auto_now_add=True)

class Message(ndb.Model):
author = ndb.IntegerProperty(required=True)
text = ndb.TextProperty(required=True)
pub_date = ndb.DateTimeProperty(auto_now_add=True)
email = ndb.StringProperty(required=True)
username = ndb.StringProperty(required=True)

Step3: replace SQL statements

The next step is to replace SQL operations in each of the routing functions with NDB’s methods. NDB’s two fundamental methods are get() that retrieves data from Datastore as a list, and put() that pushes list to Datastore as a row. In short, data is created and manipulated as individual object.
For example, if a follower needs to add to a user, I first retrieve the user by its ID that returns a list like [username, email, pw_hash, following, start_date], where following itself is a list. Then I insert the new follower into the following element and save it back again.
u = User.get_by_id(cid)
if u.following is None:
u.following = [whom_id]
u.put()
else:
u.following.append(whom_id)
u.put()
People with experience in ORM such as SQLAlchemy will be comfortable to implement the changes.

Setp4: testing and deployment

Without the schema file, now the minitwit is a real single file web app. It’s time to use GAE SDK to test it locally, or eventually push it to the cloud. On GAE, We can check any error or warning through the Logs tab to find bugs, or view the raw data through the Datastore Viewer tab.
In conclusion, GAE has a few advantages and disadvantages to work with Flask as a web app.
  • Pro:
    • It allows up to 25 free apps (great for exercises)
    • Use of database is free
    • Automatical memoryCached for high IO
  • Con:
    • Database is No-SQL, which makes data hard to port
    • More expensive for production than EC2
5月 222014
 
In his book Machine Learning in Action, Peter Harrington provides a solution for parameter estimation of logistic regression . I use pandas and ggplot to realize a recursive alternative. Comparing with the iterative method, the recursion costs more space but may bring the improvement of performance.
# -*- coding: utf-8 -*-
"""
Use recursion and gradient ascent to solve logistic regression in Python
"""


import pandas as pd
from ggplot import *

def sigmoid(inX):
return 1.0/(1+exp(-inX))

def grad_ascent(dataMatrix, labelMat, cycle):
"""
A function to use gradient ascent to calculate the coefficients
"""

if isinstance(cycle, int) == False or cycle < 0:
raise ValueError("Must be a valid value for the number of iterations")
m, n = shape(dataMatrix)
alpha = 0.001
if cycle == 0:
return ones((n, 1))
else:
weights = grad_ascent(dataMatrix, labelMat, cycle-1)
h = sigmoid(dataMatrix * weights)
errors = (labelMat - h)
return weights + alpha * dataMatrix.transpose()* errors

def plot(vector):
"""
A funtion to use ggplot to visualize the result
"""

x = arange(-3, 3, 0.1)
y = (-vector[0]-vector[1]*x) / vector[2]
new = pd.DataFrame()
new['x'] = x
new['y'] = array(y).flatten()
infile.classlab = infile.classlab.astype(str)
p = ggplot(aes(x='x', y='y', colour='classlab'), data=infile) + geom_point()
return p + geom_line

# Use pandas to manipulate data
if __name__ == '__main__':
infile = pd.read_csv("https://raw.githubusercontent.com/pbharrin/machinelearninginaction/master/Ch05/testSet.txt", sep='\t', header=None, names=['x', 'y', 'classlab'])
infile['one'] = 1
mat1 = mat(infile[['one', 'x', 'y']])
mat2 = mat(infile['classlab']).transpose()
result1 = grad_ascent(mat1, mat2, 500)
print plot(result1)
​r
5月 012014
 
The line-by-line feature in Python allows it to count hard disk-bound data. The most frequently used data structures in Python are list and dictionary. Many cases the dictionary has advantages since it is a basically a hash table that many realizes O(1) operations.
However, for the tasks of counting values, the two options make no much difference and we can choose any of them for convenience. I listed two examples below.

Use a dictionary as a counter

There is a question to count the strings in Excel.
Count the unique values in one column in EXCEL 2010. The worksheet has 1 million rows and 10 columns.
or numbers.
For example,
A5389579_10
A1543848_6
A5389579_8
Need to cut off the part after (including) underscore such as from A5389579_10 to A5389579
Commonly Excel on a desktop can’t handle this size of data, while Python would easily handle the job.
# Load the Excel file by the xlrd package
import xlrd
book = xlrd.open_workbook("test.xlsx")
sh = book.sheet_by_index(0)
print sh.name, sh.nrows, sh.ncols
print "Cell D30 is", sh.cell_value(rowx=29, colx=3)

# Count the unique values in a dictionary
c = {}
for rx in range(sh.nrows):
word = str(sh.row(rx)[1].value)[:-3]
try:
c[word] += 1
except:
c[word] = 1

print c

Use a list as a counter

There is a question to count emails.
A 3-column data set includes sender, receiver and timestamp. How to calculate the time between the sender sends the email
and the receiver sends the reply email?
The challenge is to scale up the small sample data to larger size. The solution I have has the complexity of O(nlogn), which is only limited by the sorting step.
raw_data = """
SENDER|RECEIVER|TIMESTAMP
A B 56
A A 7
A C 5
C D 9
B B 12
B A 8
F G 12
B A 18
G F 2
A B 20
"""


# Transform the raw data to a nested list
data = raw_data.split()
data.pop(0) # Remove the Head
data = zip(data[0::3], data[1::3], map(lambda x: int(x), data[2::3]))

# Sort the nested list by the timestamp
from operator import itemgetter
data.sort(key=itemgetter(2))
for r in data:
print r

# Count the time difference in a list
c = []
while len(data) != 1:
y = data.pop(0)
for x in data:
if x[0] == y[1] and x[1] == y[0]:
diff = x[2] - y[2]
print y, x, '---->', diff
c.append(diff)
break # Only find the quickest time to respond
print c

P.S.

I come up with the O(n) solution below, which utilizes two hash tables to decrease the complexity.
__author__ = 'dapangmao'

def find_duration(data):
# Construct two hash tables
h1 = {}
h2 = {}
# Find the starting time for each ID pair
for x in data:
if x[0] != x[1]:
key = x[0] + x[1]
try:
h1[key] = x[2]
except:
h1[key] = min(h1[key], x[2])
# Find the minimum duration for each ID pair
for x in data:
key = x[1] + x[0]
if h1.has_key(key):
duration = x[2] - h1[key]
try:
h2[key] = duration
except:
h2[key] = min(h2[key], duration)
return h2

if __name__ == "__main__":
raw_data = """
SENDER|RECEIVER|TIMESTAMP
A B 56
A A 7
A C 5
C D 9
B B 12
B A 8
F G 12
B A 18
G F 2
A B 20
"""


# Transform the raw data to a nested list
data = raw_data.split()
data.pop(0) # Remove the Head
data = zip(data[0::3], data[1::3], map(lambda x: int(x), data[2::3]))
# Verify the result
print find_duration(data)
4月 022014
 

Last year I gave a talk in SESUG 2013 on list manipulation on SAS using a collection of function-like macros. Today I just explored in my recently upgraded SAS 9.4 that I can play with list natively, which means I can create a list, slice a list and do other list operations in Data Steps! This is not documented yet(which means it will not be supported by the software vendor) and I can see warning message in Log window like “WARNING: List object is preproduction in this release”,  and it is still limited somehow, so use it in your own risk (and of course, fun).  Adding such versatile list object will definitely make SAS programmers more powerful. I will keep watch its further development.

*************Update********

Some readers emailed to me that they can’t get the expected results as I did here. I think it’s best to check your own system:

I. Make sure you use the latest SAS software. I only tested on a 64-bit Window 7 machine with SAS 9.4 TS1M1:

SAS94

II. Make sure all hotfixes were applied (You can use this SAS Hot Fix Analysis, Download and Deployment Tool).

hotfix

*************Update End********

The followings are some quick plays and I will report more after more research:

1. Create a List

It’s easy to create a list:

data _null_;
a = ['apple', 'orange', 'banana'];
put a;
run;

the output in Log window:

list1

You can also transfer a string to a list:

data _null_;
a = ‘SAS94′
b =list(a)
put b;
run;

list2

2. Slice a List

Slicing a list is also pretty straightforward, like in R and Python:

data _null_;
a = ['apple', 'orange', 'banana'];
b = a[0];
c = a[:-1];
d = a[1:2];
put a;
put b;
put c;
put d;
run;

list3

3. List is Immutable in SAS!?

I felt much confortable to play list operations in SAS but a weird thing just happened. I tried to change a value in a list:

data _null_;
a = ['apple', 'orange', 'banana'];
a[0] = ‘Kivi’;
put a;
run;

Unexpectedly, I got an error:

list4

hhh, I need to create a new list to hold such modification? This is funny.

Based on my quick exploration, the list object in SAS is pretty intuitive from a programmers’ point of view. But since it’s undocumented and I don’t know how long it will stay in “preproduction” phase,  just be careful to implement it in your production work.

Personally I feel very exciting to “hack” such wonderful list features in SAS 9.4. If well implemented, it will easily beat R and Python (which claim themselves supporting rich data types and objects) as a scripting language for SAS programmers. I will keep update in this page.

3月 282014
 
To perform data analysis efficiently, I need a full stack programming language rather than frequently switching from one language to another. That means — this language can hold large quantity of data, manipulate data promptly and easily (e.g. if-then-else; iteration), connect to various data sources such as relational database and Hadoop, apply some statistical models, and report result as graph, table or web. SAS is famous for its capacity to realize such a data cycle, as long as you are willing to pay the annual license fee.
SAS’s long-standing competitor, R, still keeps growing. However, in the past years, the Python community has launched a crazy movement to port R’s jewels and ideas to Python, which resulted in a few solid applications such as pandas and ggplot. With the rapid accumulation of the data-related tools in Python, I feel more comfortable to work with data in Python than R, because I have a bias that Python’s interpreter is more steady than R’s while dealing with data, and sometimes I just want to escape from R’s idiosyncratic syntax such as x<-4 or foo.bar.2000=10.

Actually there is no competition between SAS and R at all: these two dwell in two parallel universes and rely on distinctive ecosystems. SAS, Python, Bash and Perl process data row-wise, which means they input and output data line by line. R, Matlab, SAS/IML, Python/pandas and SQL manipulate data column-wise. The size of data for row-wise packages such as SAS are hard-disk-bound at the cost of low speed due to hard disk. On the contrary, the column-wise packages including R are memory-bound given the much faster speed brought by memory. 
Let’s go back to the comparison between SAS and Python. For most parts I am familiar with in SAS, I can find the equivalent modules in Python. I create a table below to list the similar components between SAS and Python.
SASPython
DATA stepcore Python
SAS/STATStatsModels
SAS/Graphmatplotlib
SAS Statistical Graphicsggplot
PROC SQLsqlite3
SAS/IMLNumPy
SAS Windowing EnvironmentQt Console for iPython
SAS StudioIPython notebook
SAS In-Memory Analytics for HadoopSpark with Python
This week SAS announced some promising products. Interesting, they can be traced to some of the Python’s similar implementations. For example, SAS Studio, a fancy web-based IDE with the feature of code completion, opens an HTML server at local machine and uses a browser to do coding, which is amazingly similar to iPython notebook. Another example is SAS In-Memory Analytics for Hadoop. Given that the old MapReduce path for data analysis is painfully time-consuming and complicated, aggregating memory instead of hard disk across many nodes of a Hadoop cluster is certainly faster and more interactive. Based on the same idea, Apache Spark, which fully supports Python scripting, has just been released to CDH 5.0. It will be interesting to compare Python and SAS’s in-memory ability for data analysis at the level of Hadoop.
Before there is a new killer app for R, at least for now, Python steals R’s thunder to be an open source alternative for SAS.
3月 282014
 
To perform data analysis efficiently, I need a full stack programming language rather than frequently switching from one language to another. That means — this language can hold large quantity of data, manipulate data promptly and easily (e.g. if-then-else; iteration), connect to various data sources such as relational database and Hadoop, apply some statistical models, and report result as graph, table or web. SAS is famous for its capacity to realize such a data cycle, as long as you are willing to pay the annual license fee.
SAS’s long-standing competitor, R, still keeps growing. However, in the past years, the Python community has launched a crazy movement to port R’s jewels and ideas to Python, which resulted in a few solid applications such as pandas and ggplot. With the rapid accumulation of the data-related tools in Python, I feel more comfortable to work with data in Python than R, because I have a bias that Python’s interpreter is more steady than R’s while dealing with data, and sometimes I just want to escape from R’s idiosyncratic syntax such as x<-4 or foo.bar.2000=10.
Let’s go back to the comparison between SAS and Python. For most parts I am familiar with in SAS, I can find the equivalent modules in Python. I create a table below to list the similar components between SAS and Python.
SASPython
DATA steppandas
SAS/STATStatsModels
SAS/Graphmatplotlib
SAS Statistical Graphicsggplot
PROC SQLsqlite3
SAS/IMLNumPy
SAS Windowing EnvironmentQt Console for iPython
SAS StudioiPython notebook
SAS In-Memory Analytics for HadoopSpark with Python
This week SAS announced some promising products. Interesting, they can be traced to some of the Python’s similar implementations. For example, SAS Studio, a fancy web-based IDE with the feature of code completion, opens an HTML server at local machine and uses a browser to do coding, which is amazingly similar to iPython notebook. Another example is SAS In-Memory Analytics for Hadoop. Given that the old MapReduce path for data analysis is painfully time-consuming and complicated, aggregating memory instead of hard disk across many nodes of a Hadoop cluster is certainly faster and more interactive. Based on the same idea, Apache Spark, which fully supports Python scripting, has just been released to CDH 5.0. It will be interesting to compare Python and SAS’s in-memory ability for data analysis at the level of Hadoop.
Before there is a new killer app for R, at least for now, Python steals R’s thunder to be an open source alternative for SAS.