Traditionally, text classification has relied on bag-of-words count features. For some experiments, I was wondering if using n-gram counts could make for a good feature set. Once I generated the features, I knew I was in trouble. For the WSJ corpus, I got about 20 million features for a trigram model. Just checked out the literature and found this paper that n-gram features don't help much:
A Study Using n-gram Features for Text Categorization, Johannes Furnkranz
Bigram and trigram features may give modest gains, but feature selection is obviously required. Feature selection based on document frequency, term frequency would be a simple approach.
A Study Using n-gram Features for Text Categorization, Johannes Furnkranz
Bigram and trigram features may give modest gains, but feature selection is obviously required. Feature selection based on document frequency, term frequency would be a simple approach.
No comments:
Post a Comment