python

12月 242014
 
Sample.txt
Requirements:
1. separate valid SSN and invalid SSN
2. count the number of valid SSN
402-94-7709 
283-90-3049
124-01-2425
1231232
088-57-9593
905-60-3585
44-82-8341
257581087
327-84-0220
402-94-7709

Thoughts

SSN indexed data is commonly seen and stored in many file systems. The trick to accelerate the speed on Spark is to build a numerical key and use the sortByKey operator. Besides, the accumulator provides a global variable existing across machines in a cluster, which is especially useful for counting data.

Single machine solution

#!/usr/bin/env python
# coding=utf-8
htable = {}
valid_cnt = 0
with open('sample.txt', 'rb') as infile, open('sample_bad.txt', 'wb') as outfile:
for l in infile:
l = l.strip()
nums = l.split('-')
key = -1
if l.isdigit() and len(l) == 9:
key = int(l)
if len(nums) == 3 and map(len, nums) == [3, 2, 4]:
key = 1000000*int(nums[0]) + 10000*int(nums[1]) + int(nums[2])
if key == -1:
outfile.write(l + 'n')
else:
if key not in htable:
htable[key] = l
valid_cnt += 1

with open('sample_sorted.txt', 'wb') as outfile:
for x in sorted(htable):
outfile.write(htable[x] + 'n')

print valid_cnt

Cluster solution

#!/usr/bin/env python
# coding=utf-8
import pyspark
sc = pyspark.SparkContext()
valid_cnt = sc.accumulator(0)

def is_validSSN(l):
l = l.strip()
nums = l.split('-')
cdn1 = (l.isdigit() and len(l) == 9)
cdn2 = (len(nums) == 3 and map(len, nums) == [3, 2, 4])
if cdn1 or cdn2:
return True
return False

def set_key(l):
global valid_cnt
valid_cnt += 1
l = l.strip()
if len(l) == 9:
return (int(l), l)
nums = l.split('-')
return (1000000*int(nums[0]) + 10000*int(nums[1]) + int(nums[2]), l)

rdd = sc.textFile('sample.txt')
rdd1 = rdd.filter(lambda x: not is_validSSN(x))

rdd2 = rdd.filter(is_validSSN).distinct()
.map(lambda x: set_key(x))
.sortByKey().map(lambda x: x[1])

for x in rdd1.collect():
print 'Invalid SSNt', x

for x in rdd2.collect():
print 'valid SSNt', x

print 'nNumber of valid SSN is {}'.format(valid_cnt)

# Save RDD to file system
rdd1.saveAsTextFile('sample_bad')
rdd2.saveAsTextFile('sample_sorted')
sc.stop()
12月 132014
 
In a class of a few children, use SQL to find those who are male and weight over 100.
class.txt (including Name Sex Age Height Weight)
Alfred M 14 69.0 112.5 
Alice F 13 56.5 84.0
Barbara F 13 65.3 98.0
Carol F 14 62.8 102.5
Henry M 14 63.5 102.5
James M 12 57.3 83.0
Jane F 12 59.8 84.5
Janet F 15 62.5 112.5
Jeffrey M 13 62.5 84.0
John M 12 59.0 99.5
Joyce F 11 51.3 50.5
Judy F 14 64.3 90.0
Louise F 12 56.3 77.0
Mary F 15 66.5 112.0
Philip M 16 72.0 150.0
Robert M 12 64.8 128.0
Ronald M 15 67.0 133.0
Thomas M 11 57.5 85.0
William M 15 66.5 112.0

Thoughts

The challenge is to transform unstructured data to structured data. In this question, a schema has to be applied including column name and type, so that the syntax of SQL is able to query the pure text.

Single machine solution

Straight-forward and simple if with Python’s built-in module sqlite3.
import sqlite3

conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute("""CREATE TABLE class
(Name text, Sex text, Age real, Height real, Weight real)"""
)

with open('class.txt') as infile:
for l in infile:
line = l.split()
c.execute('INSERT INTO class VALUES (?,?,?,?,?)', line)
conn.commit()

for x in c.execute("SELECT * FROM class WHERE Sex = 'M' AND Weight > 100"):
print x
conn.close()

Cluster solution

Spark SQL is built on Hive, and seamlessly queries the JSON formatted data that is semi-structured. To dump the JSON file on the file system will be the first step.
import os
import subprocess
import json
from pyspark import SparkContext
from pyspark.sql import HiveContext
sc = SparkContext()
hiveCtx = HiveContext(sc)
def trans(x):
return {'Name': x[0], 'Sex': x[1], 'Age': int(x[2]),
'Height': float(x[3]), 'Weight': float(x[4])}
# Remove the output directory for JSON if it exists
if 'class-output' in os.listdir('.'):
subprocess.call(['rm', '-rf', 'class-output'])

rdd = sc.textFile('class.txt')
rdd1 = rdd.map(lambda x: x.split()).map(lambda x: trans(x))
rdd1.map(lambda x: json.dumps(x)).saveAsTextFile('class-output')

infile = hiveCtx.jsonFile("class-output/part-00000")
infile.registerTempTable("class")

query = hiveCtx.sql("""SELECT * FROM class WHERE Sex = 'M' AND Weight > 100
"""
)
for x in query.collect():
print x

sc.stop()
 In a conclusion, JSON should be considered if SQL is desired on Spark.
12月 082014
 
Given the friend pairs in the sample text below (each line contains two people who are friends), find the stranger that shares the most friends with me.
sample.txt
me Alice
Henry me
Henry Alice
me Jane
Alice John
Jane John
Judy Alice
me Mary
Mary Joyce
Joyce Henry
Judy me
Judy Jane
John Carol
Carol me
Mary Henry
Louise Ronald
Ronald Thomas
William Thomas

Thoughts

The scenario is commonly seen for a social network user. Spark has three methods to query such data:
  • MapReduce
  • GraphX
  • Spark SQL
If I start with the simplest MapReduce approach, then I would like to use two hash tables in Python. First I scan all friend pairs and store the friends for each person in a hash table. Second I use another hash table to count my friends’ friends and pick out the strangers to me.

Single machine solution

#!/usr/bin/env python
# coding=utf-8
htable1 = {}
with open('sample.txt', 'rb') as infile:
for l in infile:
line = l.split()
if line[0] not in htable1:
htable1[line[0]] = [line[1]]
else:
htable1[line[0]] += [line[1]]
if line[1] not in htable1:
htable1[line[1]] = [line[0]]
else:
htable1[line[1]] += [line[0]]

lst = htable1['me']
htable2 = {}
for key, value in htable1.iteritems():
if key in lst:
for x in value:
if x not in lst and x != 'me': # should only limit to strangers
if x not in htable2:
htable2[x] = 1
else:
htable2[x] += 1

for x in sorted(htable2, key = htable2.get, reverse = True):
print "The stranger {} has {} common friends with me".format(x, \
htable2[x])
The result shows that John has three common friends like I do, followed by Joyce who has two. Therefore, John will be the one who is most likely to be recommended by the social network.

Cluster solution

If the log file for the friend pairs is quite big, say, like several GB size, the single machine solution is not able to load the data into the memory and we have to seek help from a cluster.
Spark provides the pair RDD that is similar to a hash table and essentially a key-value structure. To translate the single machine solution to a cluster one, I use the operators from Spark’s Python API including map, reduceByKey, filter, union and sortByKey.
#!/usr/bin/env python
# coding=utf-8
import pyspark
sc = pyspark.SparkContext()
# Load data from hdfs
rdd = sc.textFile('hdfs://sample.txt')
# Build the first pair RDD
rdd1 = rdd.map(lambda x: x.split()).union(rdd.map(lambda x: x.split()[::-1]))
# Bring my friend list to local
lst = rdd1.filter(lambda x: x[0] == 'me').map(lambda x: x[1]).collect()
# Build the second pari RDD
rdd2 = rdd1.filter(lambda x: x[0] in lst).map(lambda x: x[1]) \
.filter(lambda x: x != 'me' and x not in lst) \
.map(lambda x: (x, 1)).reduceByKey(lambda a, b: a + b) \
.map(lambda (x, y): (y, x)).sortByKey(ascending = False)
# Save the result to hdfs
rdd2.saveAsTextFile("hdfs://sample_output")
# Bring the result to local since the sample is small
for x, y in rdd2.collect():
print "The stranger {} has {} common friends with me".format(y, x)

sc.stop()
The result is the same. In this experiment, most time is spent on the data loading process from HDFS to the memory. The following MapReduce operations actually costs just a small fraction of overall time. In conclusion, Spark fits well on an iterative data analysis against existing RDD.
11月 292014
 
There is always a trade-off between time complexity and space complexity for computer programs. Deceasing the time cost will increase space cost, and vice versa, The ideal solution to parallelize the program to multiple cores if there is a multiple-core computer, or even scale it out to multiple machines across a cluster, which would eventually reduce both time complexity and space complexity.
Spark is currently the hottest platform for cluster computing on top of Hadoop, and its Python interface provides map, reduce and many other methods, which allow a mapRecdue job in a straightforward way, and therefore easily migrate an algorithm from a single machine to a cluster of many machines.
  • Minimize space complexity
There is a question to look for the only single number from a mostly paired-number array.
Single Number
  Given an array of integers, every element appears twice except for one.
Find that single one.
Note:
Your algorithm should have a linear runtime complexity.
Could you implement it without using extra memory?
The optimal space complexity for this question is O(1) by using the bit manipulator xor. For a cluster, since Spark aggregates memory acrosss machines, the space complexity may become O(1/k), where k is the number of the machines in the cluster.
# Space.py
import pyspark
from random import shuffle
from operator import xor
sc = pyspark.Spark.Context()

# Create the test case and the target is 99
testCase = range(0, 99) + range(0, 100)
shuffle(testCase)
# Run the testing with Spark
result = sc.parallelize(testCase).reduce(xor)
# Show the result
print result
sc.stop()
  • Minimize time complexity
There is a question to implement the function (or a method) that returns the square root of an integer.
Sqrt(x)
Implement int sqrt(int x).
Compute and return the square root of x.
The optimal solution could achieve the time complexity of O(lgN) by using binary search. If we pass the sqrt function to Spark, then the time complexity will decreased to O(lgN/k), where k is the number of the machines in the cluster.
# Time.py
import pyspark
sc = pyspark.Spark.Context()
# Implement binary search for square root function
def sqrt(x):
if x < 0 or not isinstance(x, int):
raise ValueError
hi, lo = x/2 + 1, 0
while hi >= lo:
mid = (hi + lo) / 2
if mid * mid > x:
hi = mid - 1
else:
lo = mid + 1
return int(hi)

# Test the square root algorithm
testCase = sc.parallelize(xrange(1, 100))
result = testCase.map(lambda x: sqrt(x))
# Show the result
for x in result.collect():
print x
sc.stop()
  • Find the worst rating by accounts
There is a question to find the worst one among a few rating letters for each of the account numbers.
Want to find the worst rating for each account number.
sample.txt is below
Account_number    Rating
1 A
1 B
2 A
2 B
2 C
3 A
3 C
the desired result should be like
1            B
2 C
3 C
The question is essentially one of the grouping questons. Spark’s pair RDD, which reflects the key-value relationship for groups, supplies a one-line solution for it.
import pyspark
sc = pyspark.SparkContext()

# Squeeze the letters by keys
rdd = sc.textFile('sample.txt')
result = rdd.map(lambda x: x.split()).filter(x: x[0].isdigit()).reduceByKey(max)
# Show the result
for x in result.collect():
print x
sc.stop()
In a conclusion, Spark significantly changes the way we think about data analysis.
10月 212014
 
The most hard part in testing is to write test cases, which is time-consuming and error-prone. Fortunately, besides Python built-in modules such as doctest, unittest, there are quite a few third-party packages that could help with automated testing. My favorite one is pytest, which enjoys proven record and syntax sugar.

Step 1: test-driven development

For example, there is a coding challenge on Leetcode:
Find Minimum in Rotated Sorted Array
Suppose a sorted array is rotated at some pivot unknown to you beforehand.
(i.e., 0 1 2 4 5 6 7 might become 4 5 6 7 0 1 2).
Find the minimum element.
You may assume no duplicate exists in the array.
The straightforward way to find a minimal element in an array(or list in Python) is sequential searching, which goes through every element and has a time complexity of O(N). If the array is sorted, then the minimal one is the first element that only costs O(1).
However, this question provides a rotated sorted array, which suggests a binary search and reduces the complexity from O(N) to O(logN).
As usual, write the test cases first. The great thing for pytest is that it significantly simplies the effort to code the test cases: in this example, I only use 3 lines to generate 101 test cases to cover all conditions from 0 to 99 and also include an null test.
Next step is to code the function. It is easy to transplant the iterative approach of binary search to this question. If the pointer is between a sorted segment, then return the most left element as minimal. Otherwise, adjust the right boundary and the left boundary.
# test1.py
import pytest

# Prepare 101 test cases
array = list(range(100))
_testdata = [[array[i: ] + array[ :i], 0] for i in range(100)]
_testdata += [pytest.mark.empty(([], None))]

# Code the initial binary search function
def findMinPrev(num):
lo, hi = 0, len(num) - 1
while lo <= hi:
if num[lo] <= num[hi]:
return num[lo]
mid = (lo + hi) / 2
if num[mid] < num[hi]:
hi = mid - 1
else:
lo = mid + 1

@pytest.mark.parametrize('input, expected', _testdata)
def test_findMinPrev(input, expected):
assert findMinPrev(input) == expected
After running the py.test -v test1.py command, part of the results shows below. 65 tests passed and 36 failed; the failed cases return the much bigger values that suggests out of boundary during loops, and the selection of the boudaries may be too aggresive.
test1.py:20: AssertionError
_________________________ test_findMinPrev[input98-0] _________________________

input = [98, 99, 0, 1, 2, 3, ...], expected = 0

@pytest.mark.parametrize('input, expected', _testdata)
def test_findMinPrev(input, expected):
> assert findMinPrev(input) == expected
E assert 98 == 0
E + where 98 = findMinPrev([98, 99, 0, 1, 2, 3, ...])

test1.py:20: AssertionError
==================== 36 failed, 65 passed in 0.72 seconds =====================
Now I adjust the right boundary slightly and finally come up with a solution that passes all the tests.
def findMin(num):
lo, hi = 0, len(num) - 1
while lo <= hi:
if num[lo] <= num[hi]:
return num[lo]
mid = (lo + hi) / 2
if num[mid] < num[hi]:
hi = mid
else:
lo = mid + 1

Step 2: performance profiling

Besides the right solution, I am also interested in if the binary search method has indeed improved the performance. This step I choose line_profiler given its line-by-line ability of profiling. I take the most basic one (the sequential search) as benchmark, and also include the method that applies the min function since a few functions similar to it in Pyhton implement vectorizaiton to speed up. The test case is a rotated sorted array with 10 million elements.
# test2.py
from line_profiler import LineProfiler
from sys import maxint

@profile
def findMinRaw(num):
"""Sequential searching"""
if not num:
return
min_val = maxint
for x in num:
if x < min_val:
min_val = x
return min_val

@profile
def findMinLst(num):
"""Searching by list comprehension"""
if not num:
return
return min(num)

@profile
def findMin(num):
""""Binary search"""
lo, hi = 0, len(num) - 1
while lo <= hi:
if num[lo] <= num[hi]:
return num[lo]
mid = (lo + hi) / 2
if num[mid] < num[hi]:
hi = mid
else:
lo = mid + 1

# Prepare a rotated array
array = list(range(10000000))
_testdata = array[56780: ] + array[ :56780]
# Test the three functions
findMinRaw(_testdata)
findMinLst(_testdata)
findMin(_testdata)
After running kernprof -l -v test2.py, I have the output as below. The sequential search has hit the loops 10000001 times and costs almost 14 seconds. The min function encapsulate all details inside and uses 0.5 seconds which is 28 times faster. On the contrary, the binary search method only takes 20 loops to find the minimal value and spends just 0.0001 seconds. As a result, while dealing with large number, an improved algorithm can really save time.
Total time: 13.8512 s
File: test2.py
Function: findMinRaw at line 4

Line # Hits Time Per Hit % Time Line Contents
==============================================================
4 @profile
5 def findMinRaw(num):
6 1 13 13.0 0.0 if not num:
7 return
8 1 3 3.0 0.0 min_val = maxint
9 10000001 16031900 1.6 47.5 for x in num:
10 10000000 17707821 1.8 52.5 if x < min_val:
11 2 5 2.5 0.0 min_val = x
12 1 3 3.0 0.0 return min_val

Total time: 0.510298 s
File: test2.py
Function: findMinLst at line 15

Line # Hits Time Per Hit % Time Line Contents
==============================================================
15 @profile
16 def findMinLst(num):
17 1 4 4.0 0.0 if not num:
18 return
19 1 1243016 1243016.0 100.0 return min(num)

Total time: 0.000101812 s
File: test2.py
Function: findMin at line 22

Line # Hits Time Per Hit % Time Line Contents
==============================================================
22 @profile
23 def findMin(num):
24 1 15 15.0 6.0 lo, hi = 0, len(num) - 1
25 20 40 2.0 16.1 while lo <= hi:
26 20 48 2.4 19.4 if num[lo] <= num[hi]:
27 1 2 2.0 0.8 return num[lo]
28 19 54 2.8 21.8 mid = (lo + hi) / 2
29 19 50 2.6 20.2 if num[mid] < num[hi]:
30 5 10 2.0 4.0 hi = mid
31 else:
32 14 29 2.1 11.7 lo = mid + 1
9月 252014
 
Function or method is the most basic unit in Python programming. Test-driven development is a key for a developer to assure the code quality of those units. In his book, Harry Percival illustrated a few great examples about how to use TDD with Python and Django. It seems that for web development, TDD including unit testing and integration testing is the cornerstone for every success. For data analysis, coding mostly relies on built-in packages instead large framework like Django, which makes TDD easier. In my opnion, TDD in data analysis could have three steps.
  • Step 1: requirement analysis
    Before writing any code for data analysis, the programmer should seriously ask the customer or himself about the requirements.
    • What are the input parameter?
    • what if the input data doesn’t fit the assumptions?
    • What is the purpose of this funtction or method? what are the desired outputs?
  • For example, there is a recent coding challenge called Maximum Product Subarray.
      > Find the contiguous subarray within an array (containing at least one number) which has the largest product.
    For example, given the array [2,3,-2,4],
    the contiguous subarray [2,3] has the largest product = 6.
  • OK, understanding this question is quite straight-forward. Given a array(or a list in Python), you return the integer that is the maximum product from a continuous subarry out of the input array.
      def maxProduct(A):
    """ A function to find the maximum product value for a continuous subarray.
    :param A: an array or list
    :type A: list
    """

    if A is None or not isinstance(A, list):
    return None
    if len(A) == 1:
    return A[0]
    pass
    • A production version of the codes above should be more like:
      class FunctionInputError(Exception):
      pass

      def maxProduct(A):
      """ A function to find the maximum product value for a continuous subarray.
      :param A: an array or list
      :type A: list
      """

      if A is None or not isinstance(A, list):
      raise FunctionInputError('must give a list as input')
      if len(A) == 1:
      return A[0]
      pass
  • Step 2: write test cases
    Given not a single line of logic codes has been writen yet, I call the current step as black-box testing, which means that I want this funtion to fail every test cases. Python has a built-in module doctest, which allows embedding the test cases within the docstring. I write six test cases, run the psedu-function below and arbitrarily specify the result to be -1. As expected, it fails all the six test cases with horrible red warnings under the Python shell. That is a good thing: it proves that the testing works.
      import doctest
    def maxProduct(A):
    """ A function to find the maximum product value for a continuous subarray.
    :param A: an array or list
    :type A: list

    - testcase1
    >>> maxProduct([0, 2, 3,-2,4, 1, -1])
    48

    - testcase2
    >>> maxProduct([2, 3, 0, 2, 4, 0, 3, 4])
    12

    - testcase3
    >>> maxProduct([0, 5, 3, -1, -2, 0, -2, 4, 0, 3, 4])
    30

    - testcase4
    >>> maxProduct([0, 1, -3, -4, -5, -1, 1, 2, 1, -1, -100, 0, -100000])
    12000

    - testcase5
    >>> maxProduct([0, -3000, -2, 0, -100, -100, 0, -9, -8, 1, 1, 2])
    10000

    - testcase6
    >>> maxProduct([0, -2, 0])
    0
    """

    if A is None or not isinstance(A, list):
    return None
    if len(A) == 1:
    return A[0]
    return -1

    doctest.testmod()
  • Step 3: implement the logic
    It’s time to tackle the most difficult part: write the real function. Think about time complexity (it is best to use only one iteration around the input array which means O(n)), and space complexity (it is best not to use extra space). Run testmod() again and again to find mistakes and modify the codes accordingly. Finally I come with a solution with a helper shadow function _maxProduct. And it passes the six test cases. Althoug I am not sure that this function does not have any bug, at least it works now.
      import doctest
    from sys import maxint

    def maxProduct(A):
    """ A function to find the maximum product value for a continuous subarray.
    :param A: an array or list
    :type A: list

    - testcase1
    >>> maxProduct([0, 2, 3,-2,4, 1, -1])
    48

    - testcase2
    >>> maxProduct([2, 3, 0, 2, 4, 0, 3, 4])
    12

    - testcase3
    >>> maxProduct([0, 5, 3, -1, -2, 0, -2, 4, 0, 3, 4])
    30

    - testcase4
    >>> maxProduct([0, 1, -3, -4, -5, -1, 1, 2, 1, -1, -100, 0, -100000])
    12000

    - testcase5
    >>> maxProduct([0, -3000, -2, 0, -100, -100, 0, -9, -8, 1, 1, 2])
    10000

    - testcase6
    >>> maxProduct([0, -2, 0])
    0
    """

    if A is None or not isinstance(A, list):
    return None
    if len(A) == 1:
    return A[0]
    return max(_maxProduct(A), _maxProduct([a for a in reversed(A)]))

    def _maxProduct(A):
    max_val_forward = 1
    rst = -maxint
    for a in A:
    if a != 0:
    max_val_forward *= a
    rst = max(rst, max_val_forward)
    else:
    rst = max(0, rst)
    max_val_forward = 1
    return rst

    if __name__ == "__main__":
    doctest.testmod()
In conclusion, the most important thing about TDD in data analysis is writing test cases, which really needs a lot of training and exercises.
9月 012014
 

pandas groupby: split-apply-combine,
In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(99)
In [8]:
df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
                  'key2': ['one', 'two', 'one', 'two', 'one'],
                  'data1': np.random.randn(5),
                  'data2': np.random.randn(5)})
df
Out[8]:

data1data2key1key2
0-0.142359-0.069031aone
12.0572220.755180atwo
20.2832620.825647bone
31.329812-0.113069btwo
4-0.154622-2.367838aone
In [19]:
# groupby will return a series of two elements tuples; differences:
for gb in df.groupby('key1'):
    print gb
    print '-------------------'
print '------------------This is the split line----------------------'
for name, group in df.groupby(['key1']):
    print name
    print '----------------------'
    print group
print '------------------This is the split line----------------------'
for name, group in df.groupby(['key1', 'key2']):
    print name
    print '----------------------'
    print group
    print type(group)
('a',       data1     data2 key1 key2
0 -0.142359 -0.069031 a one
1 2.057222 0.755180 a two
4 -0.154622 -2.367838 a one)
-------------------
('b', data1 data2 key1 key2
2 0.283262 0.825647 b one
3 1.329812 -0.113069 b two)
-------------------
------------------This is the split line----------------------
a
----------------------
data1 data2 key1 key2
0 -0.142359 -0.069031 a one
1 2.057222 0.755180 a two
4 -0.154622 -2.367838 a one
b
----------------------
data1 data2 key1 key2
2 0.283262 0.825647 b one
3 1.329812 -0.113069 b two
------------------This is the split line----------------------
('a', 'one')
----------------------
data1 data2 key1 key2
0 -0.142359 -0.069031 a one
4 -0.154622 -2.367838 a one
<class 'pandas.core.frame.DataFrame'>
('a', 'two')
----------------------
data1 data2 key1 key2
1 2.057222 0.75518 a two
<class 'pandas.core.frame.DataFrame'>
('b', 'one')
----------------------
data1 data2 key1 key2
2 0.283262 0.825647 b one
<class 'pandas.core.frame.DataFrame'>
('b', 'two')
----------------------
data1 data2 key1 key2
3 1.329812 -0.113069 b two
<class 'pandas.core.frame.DataFrame'>
In [18]:
# change groupby output to dict
pieces = dict(list(df.groupby('key1')))
print pieces
{'a':       data1     data2 key1 key2
0 -0.142359 -0.069031 a one
1 2.057222 0.755180 a two
4 -0.154622 -2.367838 a one, 'b': data1 data2 key1 key2
2 0.283262 0.825647 b one
3 1.329812 -0.113069 b two}
In [27]:
# groupby througn dict or series (usually for columns)
np.random.seed(9999)
people = pd.DataFrame(np.random.randn(5, 5),
                      columns = list('abcde'),
                      index = ['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people.ix[2:3, ['b', 'c']] = np.nan
mapping = {'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f': 'orange'}
by_column = people.groupby(mapping, axis = 1)
by_column.sum()
Out[27]:

bluered
Joe-0.440531-0.278096
Steve0.8547861.354338
Wes-1.646232-1.021736
Jim-2.071417-1.432511
Travis1.853391-1.253057
In [28]:
# groupby througn index level(or column level)
np.random.seed(9999)
columns = pd.MultiIndex.from_arrays([['us', 'us', 'us', 'jp', 'jp'], [1, 3, 5, 1, 3]], names = ['cty', 'tenor'])
hier_df = pd.DataFrame(np.random.randn(4, 5), columns = columns)
print hier_df
print '-----------------------'
hier_df.groupby(level = 'cty', axis = 1).count()
cty          us                            jp          
tenor 1 3 5 1 3
0 -0.515039 0.591897 0.047873 -0.488405 -0.354953
1 -0.391813 1.864261 -1.371490 2.226275 -0.118110
2 0.119848 0.182599 -0.657739 -1.646232 -1.141584
3 -1.439089 0.744992 -1.739220 -0.332197 -0.738413
-----------------------
Out[28]:
ctyjpus
023
123
223
323
In [38]:
# use different function for different columns
def peak_to_peak(arr):
    return arr.max() - arr.min()
tips = pd.read_csv('bookdata/ch08/tips.csv')
tips['tip_pct'] = tips['tip'] / tips['total_bill']
grouped = tips.groupby(['sex', 'smoker'])
grouped_pct = grouped['tip_pct']
 
print grouped_pct.agg(['mean', 'std', peak_to_peak])
 
print '------------------- rename columns ----------------------'
print grouped_pct.agg([('foo', 'mean'), ('bar', 'std')])
 
print '------------------- function on all columns ----------------------'
functions = ['count', 'mean', 'max']
print grouped['tip_pct', 'total_bill'].agg(functions)
 
print '------------------- rename columns ----------------------'
ftuples = [('Im Mean', 'mean'), ('Im var', np.var)]
print grouped['tip_pct', 'total_bill'].agg(ftuples)
 
print '------------------- diff funs on diff cols ----------------------'
print grouped.agg({'tip': [np.max, 'mean'], 'size': 'sum'})
                   mean       std  peak_to_peak
sex smoker
Female No 0.156921 0.036421 0.195876
Yes 0.182150 0.071595 0.360233
Male No 0.160669 0.041849 0.220186
Yes 0.152771 0.090588 0.674707
------------------- rename columns ----------------------
foo bar
sex smoker
Female No 0.156921 0.036421
Yes 0.182150 0.071595
Male No 0.160669 0.041849
Yes 0.152771 0.090588
------------------- function on all columns ----------------------
tip_pct total_bill
count mean max count mean max
sex smoker
Female No 54 0.156921 0.252672 54 18.105185 35.83
Yes 33 0.182150 0.416667 33 17.977879 44.30
Male No 97 0.160669 0.291990 97 19.791237 48.33
Yes 60 0.152771 0.710345 60 22.284500 50.81
------------------- rename columns ----------------------
tip_pct total_bill
Im Mean Im var Im Mean Im var
sex smoker
Female No 0.156921 0.001327 18.105185 53.092422
Yes 0.182150 0.005126 17.977879 84.451517
Male No 0.160669 0.001751 19.791237 76.152961
Yes 0.152771 0.008206 22.284500 98.244673
------------------- diff funs on diff cols ----------------------
tip size
amax mean sum
sex smoker
Female No 5.2 2.773519 140
Yes 6.5 2.931515 74
Male No 9.0 3.113402 263
Yes 10.0 3.051167 150
In [42]:
# apply function on each piece of groupby
def top(df, n=5, columns = 'tip_pct'):
    return df.sort_index(by = columns)[-n:]
tips.groupby(['smoker', 'day']).apply(top, n=2, columns = 'total_bill')
Out[42]:



total_billtipsexsmokerdaytimesizetip_pct
smokerday








NoFri9122.493.50MaleNoFriDinner20.155625
9422.753.25FemaleNoFriDinner20.142857
Sat5948.276.73MaleNoSatDinner40.139424
21248.339.00MaleNoSatDinner40.186220
Sun11238.074.00MaleNoSunDinner30.105070
15648.175.00MaleNoSunDinner60.103799
Thur8534.835.17FemaleNoThurLunch40.148435
14241.195.00MaleNoThurLunch50.121389
YesFri9028.973.00MaleYesFriDinner20.103555
9540.174.73MaleYesFriDinner40.117750
Sat10244.302.50FemaleYesSatDinner30.056433
17050.8110.00MaleYesSatDinner30.196812
Sun18440.553.00MaleYesSunDinner20.073983
18245.353.50MaleYesSunDinner30.077178
Thur8332.685.00MaleYesThurLunch20.152999
19743.115.00FemaleYesThurLunch40.115982

 Posted by at 2:03 下午
9月 012014
 
Python 没有赋值,只有引用。你这样相当于创建了一个引用自身的结构,所以导致了无限循环。为了理解这个问题,有个基本概念需要搞清楚。

Python 没有「变量」,我们平时所说的变量其实只是「标签」。执行
values = [0, 1, 2]
的时候,Python 做的事情是首先创建一个列表对象 [0, 1, 2],然后给它贴上名为 values 的标签。如果随后又执行
values = [3, 4, 5]
的话,Python 做的事情是创建另一个列表对象 [3, 4, 5],然后把刚才那张名为 values 的标签从前面的 [0, 1, 2] 对象上撕下来,重新贴到 [3, 4, 5] 这个对象上。

至始至终,并没有一个叫做 values 的列表对象容器存在,Python 也没有把任何对象的值复制进 values 去。过程如图所示:
执行
values[1] = values
的时候,Python 做的事情则是把 values 这个标签所引用的列表对象的第二个元素指向 values 所引用的列表对象本身。执行完毕后,values 标签还是指向原来那个对象,只不过那个对象的结构发生了变化,从之前的列表 [0, 1, 2] 变成了 [0, ?, 2],而这个 ? 则是指向那个对象本身的一个引用。如图所示:

要达到你所需要的效果,即得到 [0, [0, 1, 2], 2] 这个对象,你不能直接将 values[1] 指向 values 引用的对象本身,而是需要吧 [0, 1, 2] 这个对象「复制」一遍,得到一个新对象,再将 values[1] 指向这个复制后的对象。Python 里面复制对象的操作因对象类型而异,复制列表 values 的操作是
values[:]
所以你需要执行
values[1] = values[:]
Python 做的事情是,先 dereference 得到 values 所指向的对象 [0, 1, 2],然后执行 [0, 1, 2][:] 复制操作得到一个新的对象,内容也是 [0, 1, 2],然后将 values 所指向的列表对象的第二个元素指向这个复制二来的列表对象,最终 values 指向的对象是 [0, [0, 1, 2], 2]。过程如图所示:
往更深处说,values[:] 复制操作是所谓的「浅复制」(shallow copy),当列表对象有嵌套的时候也会产生出乎意料的错误,比如
a = [0, [1, 2], 3]
b = a[:]
a[0] = 8
a[1][1] = 9
问:此时 a 和 b 分别是多少?

正确答案是 a 为 [8, [1, 9], 3],b 为 [0, [1, 9], 3]。发现没?b 的第二个元素也被改变了。想想是为什么?不明白的话看下图
正确的复制嵌套元素的方法是进行「深复制」(deep copy),方法是
import copy

a = [0, [1, 2], 3]
b = copy.deepcopy(a)
a[0] = 8
a[1][1] = 9

 Posted by at 1:54 下午
9月 012014
 

df.ix, df.loc, reindex, drop (axis = 1), filtering, df.add (axis = 0), apply (axis = 1), sort_index, rank, idxmin / idxmax, argmin / argmax, unique, value_counts(), isin, hierarchical indexing (multi index), set_index
In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
In [5]:
# create a DataFrame, with labels and column names
df = pd.DataFrame(np.arange(30).reshape(5, 6), index = ['one', 'two', 'three', 'four', 'five'], columns = list('abcdef'))
df
Out[5]:

abcdef
one012345
two67891011
three121314151617
four181920212223
five242526272829
In [8]:
# slice a subset from the DataFrame with index and column names
df.ix[['one', 'five'], ['a', 'b', 'c', 'd']]
Out[8]:

abcd
one0123
five24252627
In [27]:
# use integer to select rows or columns does not work (use irow, icol or iloc instead)
df[1]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-27-3a7a20322ab2> in <module>()
1 # use integer to select rows or columns does not work (use irow, icol or iloc instead)

----> 2 df[1]

/home/shm/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
1682 return
self._getitem_multilevel(key)
1683 else:
-> 1684 return self._getitem_column(key)
1685
1686 def _getitem_column(self, key):

/home/shm/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_column(self, key)
1689 # get column

1690 if self.columns.is_unique:
-> 1691 return self._get_item_cache(key)
1692
1693 # duplicate columns & possible reduce dimensionaility

/home/shm/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
1050
res = cache.get(item)
1051 if res is None:
-> 1052 values = self._data.get(item)
1053 res = self._box_item_values(item, values)
1054 cache[item] = res

/home/shm/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item)
2535

2536 if not isnull(item):
-> 2537 loc = self.items.get_loc(item)
2538 else:
2539 indexer = np.arange(len(self.items))[isnull(self.items)]

/home/shm/anaconda/lib/python2.7/site-packages/pandas/core/index.pyc in get_loc(self, key)
1154
loc : int if unique index, possibly slice or mask if not
1155 """
-> 1156 return self._engine.get_loc(_values_from_object(key))
1157
1158 def get_value(self, series, key):

/home/shm/anaconda/lib/python2.7/site-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3650)()

/home/shm/anaconda/lib/python2.7/site-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3528)()

/home/shm/anaconda/lib/python2.7/site-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11908)()

/home/shm/anaconda/lib/python2.7/site-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11861)()

KeyError: 1

In [25]:
print df.iloc[1:3, 1:5]
print "------------------------"
print df.iloc[-1, :]
        b   c   d   e
two 7 8 9 10
three 13 14 15 16
------------------------
a 24
b 25
c 26
d 27
e 28
f 29
Name: five, dtype: int64
In [31]:
# reindex to change the order of rows or columns or both
newcolumns = list('fedabc')
df_reindex = df.reindex(['three', 'two', 'five', 'four', 'one', 'six'], columns = newcolumns, fill_value = 6)
df_reindex
Out[31]:

fedabc
three171615121314
two11109678
five292827242526
four232221181920
one543012
six666666
In [39]:
# drop to discard rows or columns(axis = 1) 
# drop rows
print df.drop(['one', 'four'])
print '--------------------'
# drop columns
print df.drop('a', axis = 1)
        a   b   c   d   e   f
two 6 7 8 9 10 11
three 12 13 14 15 16 17
five 24 25 26 27 28 29
--------------------
b c d e f
one 1 2 3 4 5
two 7 8 9 10 11
three 13 14 15 16 17
four 19 20 21 22 23
five 25 26 27 28 29
In [42]:
# filtering: select * from data where c > 5;
df[df['c'] > 15]
Out[42]:

abcdef
four181920212223
five242526272829
In [54]:
# add two DF together, they will match automatically by index and columns; the rest is null
df2 = df.iloc[:3, :3] * .02
print df2 + df
print '--------------------------'
print df2.add(df, fill_value = -999)
           a      b      c   d   e   f
five NaN NaN NaN NaN NaN NaN
four NaN NaN NaN NaN NaN NaN
one 0.00 1.02 2.04 NaN NaN NaN
three 12.24 13.26 14.28 NaN NaN NaN
two 6.12 7.14 8.16 NaN NaN NaN
--------------------------
a b c d e f
five -975.00 -974.00 -973.00 -972 -971 -970
four -981.00 -980.00 -979.00 -978 -977 -976
one 0.00 1.02 2.04 -996 -995 -994
three 12.24 13.26 14.28 -984 -983 -982
two 6.12 7.14 8.16 -990 -989 -988
In [58]:
# DF with Series: broadcasting along row
series = df.ix['one'] * 2
df - series
Out[58]:

abcdef
one0-1-2-3-4-5
two654321
three121110987
four181716151413
five242322212019
In [61]:
# DF with Series: broadcasting along column(axis = 0)
series2 = df['a'] * 2
df.sub(series2, axis = 0)
Out[61]:

abcdef
one012345
two-6-5-4-3-2-1
three-12-11-10-9-8-7
four-18-17-16-15-14-13
five-24-23-22-21-20-19
In [64]:
# apply a function to a row or a column from DF
f = lambda x: x.min() - x.max()
print df.apply(f)
print '------------------------'
print df.apply(f, axis = 1)
a   -24
b -24
c -24
d -24
e -24
f -24
dtype: int64
------------------------
one -5
two -5
three -5
four -5
five -5
dtype: int64
In [74]:
# sort_index to sort by rows or columns, in SAS it is:
# *proc sort data=df1;
# *   by descending c x;
# *run;
np.random.seed(9999)
df1 = pd.DataFrame(np.random.randn(5, 3), columns = list('xyz'))
df1['c'] = [1, 1, 1, 2, 3]
df1.sort_index(by = ['c', 'x'], ascending = [True, False])
Out[74]:

xyzc
21.864261-1.3714902.2262751
1-0.488405-0.354953-0.3918131
0-0.5150390.5918970.0478731
3-0.1181100.1198480.1825992
4-0.657739-1.646232-1.1415843
In [81]:
# statistics
print df.mean()
print '--------------'
print df.mean(axis = 1)
print '--------------'
print df.cumsum()
a    12
b 13
c 14
d 15
e 16
f 17
dtype: float64
--------------
one 2.5
two 8.5
three 14.5
four 20.5
five 26.5
dtype: float64
--------------
a b c d e f
one 0 1 2 3 4 5
two 6 8 10 12 14 16
three 18 21 24 27 30 33
four 36 40 44 48 52 56
five 60 65 70 75 80 85
In [85]:
# idxmin, idxmax, argmin, argmax
print df.idxmin()
a    one
b one
c one
d one
e one
f one
dtype: object
In [89]:
# unique, value_counts(), isin
print df['a'].unique()
print '--------------------'
print df['a'].value_counts()
print '--------------------'
print df['a'].isin([0, 6, 7, 7])
[ 0  6 12 18 24]
--------------------
6 1
12 1
24 1
18 1
0 1
dtype: int64
--------------------
one True
two True
three False
four False
five False
Name: a, dtype: bool
In [101]:
# hierarchical indexing (multi-index)
df2 = pd.DataFrame(np.arange(12).reshape(4, 3),
                   index = [['a', 'a', 'b', 'b'], [1, 1, 2, 2]],
                   columns = [['ohio', 'ohio', 'colorado'], ['Green', 'Red', 'Green']])
df2
Out[101]:


ohiocolorado


GreenRedGreen
a1012
1345
b2678
291011
In [102]:
# swaplevel
df2.index.names = ['key1', 'key2']
df2.columns.names = ['state', 'color']
df2.swaplevel('key1', 'key2')
Out[102]:

stateohiocolorado

colorGreenRedGreen
key2key1


1a012
a345
2b678
b91011
In [104]:
# statistics over axis
df2.sum(level = 'key2')
Out[104]:
stateohiocolorado
colorGreenRedGreen
key2


1357
2151719
In [107]:
# set_index: use columns from DF as the row index, drop = False to keep in DF
# reset_index: to change index to a column
df.set_index(['a', 'b'], drop = False)  # now column 'a' 'b' becomes index
Out[107]:


abcdef
ab





01012345
6767891011
1213121314151617
1819181920212223
2425242526272829
8月 262014
 


1: shape, dtype, axis, indexing, slicing

In [30]:
import numpy as np
np.random.seed(9999)
array1 = np.random.randn(6, 3)  # np.array is used to create array
# shape returns the dimension of the array
array1
Out[30]:
array([[-0.51503939,  0.59189672,  0.0478734 ],
[-0.48840469, -0.35495284, -0.39181335],
[ 1.86426106, -1.37148975, 2.22627536],
[-0.11810965, 0.11984837, 0.18259889],
[-0.65773926, -1.64623164, -1.14158407],
[-1.43908939, 0.74499231, -1.73922014]])
In [31]:
array1.shape
Out[31]:
(6L, 3L)
In [32]:
# dtype returns the dtype of the array
array1.dtype
Out[32]:
dtype('float64')
In [33]:
# np.astype to change dtype
array2 = array1.astype(float32)
array2.dtype
Out[33]:
dtype('float32')
In [34]:
# asis = 0 processing rows and return the result the same length as columns
np.sum(array1, axis = 0)
Out[34]:
array([-1.35412131, -1.91593683, -0.81586991])
In [35]:
# slicing is similar to the usage in R
# slicing(it is the view of the original array): subset array
array1[:5, :3]
Out[35]:
array([[-0.51503939,  0.59189672,  0.0478734 ],
[-0.48840469, -0.35495284, -0.39181335],
[ 1.86426106, -1.37148975, 2.22627536],
[-0.11810965, 0.11984837, 0.18259889],
[-0.65773926, -1.64623164, -1.14158407]])
In [36]:
array1[[-2, -3, -5]]
Out[36]:
array([[-0.65773926, -1.64623164, -1.14158407],
[-0.11810965, 0.11984837, 0.18259889],
[-0.48840469, -0.35495284, -0.39181335]])
In [37]:
array2 = np.arange(12).reshape(3, 4)
array2
Out[37]:
array([[ 0,  1,  2,  3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [38]:
array2.T
Out[38]:
array([[ 0,  4,  8],
[ 1, 5, 9],
[ 2, 6, 10],
[ 3, 7, 11]])