What is an N-gram?
by Stephen M. Walker II, Co-Founder / CEO
What is an N-gram?
An N-gram is a contiguous sequence of 'n' items from a given sample of text or speech. The items can be phonemes, syllables, letters, words, or base pairs, depending on the application. For instance, in the domain of text analysis, if 'n' is 1, we call it a unigram; if 'n' is 2, it is a bigram; if 'n' is 3, it is a trigram, and so on.
N-grams are used in various areas of computational linguistics and text analysis. They are a simple and effective method for text mining and natural language processing (NLP) tasks, such as text prediction, spelling correction, language modeling, and text classification.
For example, consider the sentence "The cow jumps over the moon". If N=2 (known as bigrams), then the n-grams would be: "the cow", "cow jumps", "jumps over", "over the", "the moon".
In the context of language modeling, an N-gram model predicts the occurrence of a word based on the occurrence of its N – 1 previous words. For instance, a bigram model (N = 2) predicts the occurrence of a word given only its previous word (as N – 1 = 1). Similarly, a trigram model (N = 3) predicts the occurrence of a word based on its previous two words (as N – 1 = 2).
N-grams can be used to predict the next item in a sequence, making them useful for language models in speech recognition, typing prediction, and other generative tasks. They can also serve as features for algorithms that classify documents into categories, such as spam filters or sentiment analysis.
In Python, you can generate N-grams using the NLTK library or by defining a function that splits sentences into tokens and collects the N-grams.
How are n-grams used in sentiment analysis?
N-grams are used in sentiment analysis to capture the context of words and phrases, which is crucial for understanding the overall sentiment of a text. They can provide features that are insensitive to the ordering of words in a sentence, which is helpful for tasks such as sentiment analysis.
In sentiment analysis, N-grams are used to identify and analyze the sentiment of phrases, not just individual words. For example, the bigram "not bad" has a positive sentiment, contrary to what might be inferred if the words "not" and "bad" were analyzed separately.
N-grams can be used as features in machine learning models for sentiment analysis. For instance, a bag of words representation of a document can be created by counting the frequency of each N-gram. This approach can improve the performance of sentiment analysis models, as demonstrated in a case study where a model trained with N-grams ranging from unigrams to 5-grams achieved the best performance.
However, it's important to note that using N-grams can significantly increase the feature space, which might lead to more sparse input features and potentially hamper model performance. Therefore, the optimal range of N-grams to use may vary depending on the specific task and dataset, and it's often determined through experimentation.
In Python, you can perform sentiment analysis using N-grams with libraries such as NLTK. For example, you can use the ngrams
function from NLTK to generate N-grams, and then use a machine learning model, such as Naive Bayes, for sentiment classification.