4月 022017
 

python ngram


# -*- coding: utf-8 -*-
# @DATE    : 2017/4/1 10:39
# @Author  : 
# @File    : ngram.py
from collections import defaultdict


def gen_n_gram(input, sep=" ", n=2):
    input = input.split(sep)
    output = {}
    for i in xrange(len(input) - n + 1):
        gram = "".join(input[i: i + n])
        output.setdefault(gram, 0)
        output[gram] += 1
    return output


def dict_sum(*dict):
    ret = defaultdict(int)
    for d in dict:
        for k, v in d.items():
            ret[k] += v
    return ret


def sum_n_gram(inputs, sep=" ", n=2):
    output_sum = defaultdict(int)
    for input in inputs:
        output_sum = dict_sum(output_sum, gen_n_gram(input))
    output_sum = sorted(output_sum.items(), key=lambda x: x[1], reverse=True)
    return output_sum


if __name__ == "__main__":
    inputs = ["a a a j 9 3 h d e", "a j 9 3 h", "g g h 9 3"]
    print(gen_n_gram("a a a j 9 3 h d e"))
    output = sum_n_gram(inputs)
    print(output)
    output_file = "dict.txt"
    cnt = len(output)
    with open(output_file, "w") as out:
        for i, value in enumerate(output):
            if i + 1 <</span> cnt:
                out.write("{}:{}n".format(value[0], value[1]))
            else:
                out.write("{}:{}".format(value[0], value[1]))

运行日志


{'aa': 2, 'de': 1, 'j9': 1, 'aj': 1, '3h': 1, '93': 1, 'hd': 1}
[('93', 3), ('aa', 2), ('aj', 2), ('j9', 2), ('3h', 2), ('de', 1), ('gg', 1), ('h9', 1), ('hd', 1), ('gh', 1)]

Process finished with exit code 0

 
 Posted by at 12:32 下午
4月 012017
 

Security analytics has gotten a lot of attention in the industry the last few years. That’s not surprising. After all, security analytics can help organizations: Transition from reactive threat firefighting to proactive security risk management. Exploit all available security data to develop better insights and priorities. Maximize the effectiveness of [...]

Security analytics skeptics have nothing to fear was published on SAS Voices by Liz Goldberg

4月 012017
 

Solar farm on SAS campus The full text of Fermat's statement, written in Latin, reads "Cubum autem in duos cubos, aut quadrato-quadratum in duos quadrato-quadratos, et generaliter nullam in infinitum ultra quadratum potestatem in duos eiusdem nominis fas est dividere cuius rei demonstrationem mirabilem sane detexi. Hanc marginis exiguitas non caperet."

The English translation is: "It is impossible for a cube to be the sum of two cubes, a fourth power to be the sum of two fourth powers, or in general for any number that is a power greater than the second to be the sum of two like powers. I have discovered a truly marvelous demonstration of this proposition that this margin is too narrow to contain."

Here at SAS, we don’t take challenges lightly. After a short but intensive brainstorming, we came up with a creative and powerful SAS code that effectively proves this long-standing theorem. And it is so simple and short that not only can it be written on the margins of this blog, it can be tweeted!

Drum roll, please!

Here is the SAS code:

data _null_; 
	do n=3 by 1; 
		do a=1 by 1; 
			do b=1 by 1; 
				do c=1 by 1; 
					e = a**n + b**n = c**n;
					if e then stop; 
				end;
			end; 
		end;
	end;
run;

Or written compactly, without unnecessary spaces:

data _null_;do n=3 by 1;do a=1 by 1;do b=1 by 1;do c=1 by 1;e=a**n+b**n=c**n;if e then stop;end;end;end;end;run;

which is exactly 112 character long – well below the Twitter 140-character threshold.

Don’t be fooled by the utter simplicity and seeming unfeasibility of this code.  For the naysayers, let me clarify that we run this code in a distributed multithreaded environment where each do-loop runs as a separate thread.

We also use some creative coding techniques:

1.     Do-loop with just two options, count= and by=, but without the to= option (e.g. do c=1 by 1;). It is a valid syntax in SAS and serves the purpose of creating infinite loops when they are necessary (like in this case). You can easily test it by running the following SAS code snippet:

data _null_;
	start = datetime();
	do i=1 by 1;
		if intck('sec',start,datetime()) ge 20 then leave;
	end;
run;

The if-statement here is added solely for the purpose of specifying a wait time (e.g. 20) sufficient for persuading you in the loop’s infiniteness. Skeptics may increase this number to their comfort level or even remove (or comment out) the if-statement and enjoy the unconstrained eternity.

2.     Expression with two “=” signs in it (e.g. e = a**n + b**n = c**n;) Again, this is a perfectly valid expression in SAS and serves the purpose of assigning a variable the value of 0 or 1 resulting from a logical comparison operation. This expression can be rewritten as

e = a**n + b**n eq c**n;

or even more explicitly as

e = (a**n + b**n eq c**n);

As long as the code runs, the theorem is considered proven. If it stops, then the theorem is false.

You can try running this code on your hardware, at your own risk, of course.

We have a dedicated 128-processor UNIX server powered by an on-campus solar farm that has been autonomously running the above code for 40 years now, and there was not a single instance when it stopped running. Except pausing for the scheduled maintenance and equipment replacements.

During the course of this historic experience, we have accumulated an unprecedented amount of big data (all in-memory), converted it into event stream processing, and become a leader in data mining and business analytics.

This leads us to the following scientific conclusion: whether you are a pure mathematician or an empiricist, you can rest assured that Fermat's Last Theorem has been proven with a probability asymptotic to 1 beyond a reasonable doubt.

Have a happy 91-st day of the year 2017!

 

SAS code to prove Fermat's Last Theorem was published on SAS Users.

4月 012017
 

There's an old song that starts out, "You Can Get Anything You Want at Alice's Restaurant."  Well, maybe you are too young to know that song, but if you’re a SAS users, you’ll be glad to know that you can capture anything produced by any SAS procedure (even if the [...]

The post Capturing output from any procedure with an ODS OUTPUT statement appeared first on SAS Learning Post.

4月 012017
 

March Madness is in full swing. And the success of the Dance Card formula powered by SAS -- along with stories about teams like the New York Mets, the Boston Bruins, the Orlando Magic and more, all using analytics -- demonstrates how sports and analytics are becoming more and more [...]

How to make sense of the Madness in March was published on SAS Voices by Hwa Truong

3月 312017
 

If you're into data visualization, here's something that might interest you - a free eBook showing several ways to use SAS to visually analyze your data. (Did I mention it's FREE?!?!) We've picked juicy chapters from several books and upcoming books (and a few other sources), to show you what [...]

The post How about a free eBook on data visualization using SAS! appeared first on SAS Learning Post.

3月 312017
 

Until the robotic overlords take over, you need people — not just technology and data — to drive growth and innovation in your analytics programs. But how can you plan for the talent you need today and the talent you'll need in the future as your goals and your use [...]

5 mistakes to avoid when devising (or revising) a talent management strategy was published on SAS Voices by Analise Polsky

3月 312017
 

The U.S. Marshals Service is the federal agency known for bringing wanted fugitives to justice. Often, the Marshals Service gets attention for these arrests, but once the publicity has died down they face a basic challenge --- where to put the individuals in their custody. The agency uses data to [...]

U.S. Marshals Service use analytics to save more than $200 million was published on SAS Voices by Steve Bennett

3月 302017
 

Users frequently ask how to plot their data as markers on a map. There are several ways to do this using SAS software. If you're a Visual Analytics user, you can do it using a point-and-click interface. But if you're a coder, you might need a little help... In this [...]

The post Plotting markers on a map at zip code locations, using GMap or SGplot appeared first on SAS Learning Post.