In natural language processing, word vectors play a key role in making technologies such as machine translation and speech recognition possible. A word vector is a row of numeric values where each point captures a dimension of the word’s meaning. Each value represents how closely it relates to the concept behind that dimension, so the semantics of the word is embedded across the dimensions of the vector. Since similar words have similar vectors, representing words as vectors like this would simplify and unify vectors' operations.
Word vectors are generated by a training performed word-word co-occurrence statistics on a large corpus. You can use pre-trained word vectors like GloVe, provided by Stanford University.
Let's talk about how to transform word vector tables from long to wide in SAS, so we can potentially get sentence vectors to process further. Suppose we generate word vectors from the following 3 sentences:
Jack went outside.
Jill likes to draw in the afternoon.
Tony is a boy.
Each word has 2 numeric values (Vector1, Vector2), each value represents how closely the word relates to the concept defined by that dimension. The value numbers (here VNUM=2) may range from hundreds to thousands in real text analysis scenarios.
The sample code below generates an upper sample table and sorts it for further processing.
data HAVE; length Word $ 45; input SentenceID Word Vector1-Vector2; /*300+*/ datalines; 1 Jacky 0.24011 0.400996 1 went -0.047581 0.868716 1 outside -1.197891 1.162238 2 Jill -0.199579 0.251252 2 likes -1.935640 -0.288264 2 to -0.526053 -1.143420 2 draw -0.736289 -0.794812 2 in -2.757234 0.506639 2 the -0.736289 -0.794812 2 afternoon -0.047581 0.868716 3 Tony 0.34032 0.600983 3 is 0.147531 0.968817 3 a 1.347543 2.568323 3 boy -3.257891 3.172238 run; proc sort data=HAVE; by SentenceID; run; proc print data=have;run;
If we want to transform the upper long table to a wide table as seen below, how can we do this as efficiently and simply as possible? The upper 14 words belong to 3 sentences that would result in the following 3 rows with 22 columns (1 + WNUM + WNUM x VNUM=1 + 7 + 7 x 2 = 22).
Please note that we can calculate the max word number (WNUM) in a sentence at runtime with SAS code below. For the upper case, the value of WNUM is 7.
proc sql noprint; select max(count) into :wnum from ( select count(Word) as count from HAVE group by SentenceID ); quit;
In fact, we don’t need any SAS PROC to handle this kind of transformation. A SAS Data step provides an efficient and convenient way to transform data. The key is to use an ARRAY to map all word vectors from the source table, and then define two ARRAYs to store output words and vectors in a wide style. These two arrays for output words and vectors need to be RETAIN during the implicit loop and KEEP for OUTPUT while it reaches the last SentenceId.
You can see the full SAS code below with detailed comments.
/*Long table to Wide table*/ %let vnum=2; /*vector numbers for a word*/ %let wnum=7; /*max word number in a sentence*/ data WANT; set HAVE; by Sentenceid; array _vector_ [*] vector:; /*Map to source vectors*/ array _word [ %eval(1*&wnum)] $ 45; /*Array to store WORD in wide table*/ array _vector [ %eval(&wnum*&vnum)];/*Array to store VECTORS in wide table*/ retain _word: _vector:; /*RETAIN during the implicit loop*/ retain _offset_ 0; /*Offset of a WORD in a sentence, base 0*/ if first.Sentenceid then do; call missing(of _word[*]); call missing(of _vector[*]); _offset_=0; end; else _offset_=_offset_+1; _word[ _offset_+1 ]=word; /*Cache current word to array WORD at [ _offset_+1]*/ do i=1 to dim(_vector_); /*Cache each vectors to array VECTORS at [_offset_* &vnum +i]*/ _vector[_offset_* &vnum +i]=_vector_[i]; end; keep Sentenceid _word: _vector: ; /*Keep for output when it hit last.Sentenceid*/ if last.Sentenceid then output; /*Output the cached WORD and VECTORS*/ run; proc print data=want;run;
Accordingly, if we need to transform a word vector back from wide style to long style, we need to generate &WNUM rows x &VNUM columns for each sentence, and it’s the reversed process for upper logic. The full SAS code with detailed comments is listed below:
/*Wide table to Long table*/ data HAVE2; set WANT; array _word [*] _word:; /*Array _word mapping to WORD in wide table*/ array _vector_ [*] _vector:; /*Array _vector mapping to VECTORS in wide table*/ length Word $ 45; /*Output Word in the long table*/ array Vector[&vnum]; /*Output Vectors in the long table*/ do i=1 to &wnum; /*Unpack word from array _word*/ word=_word[i]; if word=" " then continue; do j=1 to &vnum; /*Unpack vectors from array _vector*/ oo= (j+&vnum * (i-1)); Vector[j]=_vector_[j + &vnum *(i-1)]; end; keep Sentenceid Word Vector:; output; /*One row in wide table generate &wnum rows*/ end; run; proc print data=HAVE2;run;
To wrap the upper bi-directional transformation process for general repurposing in text analysis, we provide two SAS MACROs listed below:
%Long2Wide(data=Have, vnum=2, wnum=7, sid=SentenceId, word=Word, out=Want); proc print data=Want;run; %Wide2Long(data=Want, vnum=2, wnum=7, sid=Sentenceid, out=Have2, outword=Word, outvector=Vector); proc print data=Have2;run;
We have demonstrated how to transform a word vector table from a long style to a wide style (or vice versa) efficiently with a SAS DATA step. We have also provided two well-wrapped SAS MACROs for general re-use purposes. To learn more, please check out these additional resources:
- Word scatter plot with SAS
- How to calculate Word Mover's Distance with SAS
- Break a sentence into words in SAS