Many of us in the SAS world are responsible for the data that feeds the various business intelligence, analytics and business solutions provided by SAS. We’ve been involved in data integration, data migration, data quality, ETL / ELT, data access projects that support SAS solutions or are used for other [...]
Seth Grimes posted a fantastic article on Text Data Quality yesterday. A must read for anyone in this space. The article points to some of the text quality issues I have mentioned in my last two blogs. Text is in a league of its own when it comes to data quality. And the more you have to work with social media generated data, the more you will run into non-standard text and the need for text cleansing. I presented a workshop where I talked about "The Ten Transgressions of Text" at the Text Analytics Summit in June:
4. Shrt-hnd or clipped text (e.g. hmm tink nid >2 twitter acs; els msgs all jumbled up btwn personal & thots! dilemma!)
6. !!NOISY TEXT!!
8. ♪ Voice ♫
9. Email / Attachments
10. Poor grammar
Customers ask me if we can automatically remove profanity from documents and, yes, WE CAN!
My interest in the sorts of shortened/clipped texts that you get in text messages or via Twitter is huge. There is a lot text analytics users and vendors can do to work with this data. Terms like "cul8r" (see you later), or "LOL" (laughing out loud / lots of love) could be expanded into their intended forms, mapped to other synonyms(we provide ontologies to handle this), or left as is. When a shortened term can mean several different things depending on context, that's when the linguistics can help. I see a big need for including this new 'language' into standard language dictionaries.
Adhering to the standard rules of grammar looks like a thing of the past. As traditional print media loses favor, so will standard grammar in social media (blogs, micro-blogs such as FaceBook, Twitter, Bebo etc.). I'm excited to see how other natural language processing technologies will change to accommodate the new breed of user.