What is an N-gram?

by Stephen M. Walker II, Co-Founder / CEO

What is an N-gram?

An N-gram is a contiguous sequence of 'n' items from a given sample of text or speech. The items can be phonemes, syllables, letters, words, or base pairs, depending on the application. For instance, in the domain of text analysis, if 'n' is 1, we call it a unigram; if 'n' is 2, it is a bigram; if 'n' is 3, it is a trigram, and so on.

N-grams are used in various areas of computational linguistics and text analysis. They are a simple and effective method for text mining and natural language processing (NLP) tasks, such as text prediction, spelling correction, language modeling, and text classification.

For example, consider the sentence "The cow jumps over the moon". If N=2 (known as bigrams), then the n-grams would be: "the cow", "cow jumps", "jumps over", "over the", "the moon".

In the context of language modeling, an N-gram model predicts the occurrence of a word based on the occurrence of its N – 1 previous words. For instance, a bigram model (N = 2) predicts the occurrence of a word given only its previous word (as N – 1 = 1). Similarly, a trigram model (N = 3) predicts the occurrence of a word based on its previous two words (as N – 1 = 2).

N-grams can be used to predict the next item in a sequence, making them useful for language models in speech recognition, typing prediction, and other generative tasks. They can also serve as features for algorithms that classify documents into categories, such as spam filters or sentiment analysis.

In Python, you can generate N-grams using the NLTK library or by defining a function that splits sentences into tokens and collects the N-grams.

How are n-grams used in sentiment analysis?

N-grams are used in sentiment analysis to capture the context of words and phrases, which is crucial for understanding the overall sentiment of a text. They can provide features that are insensitive to the ordering of words in a sentence, which is helpful for tasks such as sentiment analysis.

In sentiment analysis, N-grams are used to identify and analyze the sentiment of phrases, not just individual words. For example, the bigram "not bad" has a positive sentiment, contrary to what might be inferred if the words "not" and "bad" were analyzed separately.

N-grams can be used as features in machine learning models for sentiment analysis. For instance, a bag of words representation of a document can be created by counting the frequency of each N-gram. This approach can improve the performance of sentiment analysis models, as demonstrated in a case study where a model trained with N-grams ranging from unigrams to 5-grams achieved the best performance.

However, it's important to note that using N-grams can significantly increase the feature space, which might lead to more sparse input features and potentially hamper model performance. Therefore, the optimal range of N-grams to use may vary depending on the specific task and dataset, and it's often determined through experimentation.

In Python, you can perform sentiment analysis using N-grams with libraries such as NLTK. For example, you can use the ngrams function from NLTK to generate N-grams, and then use a machine learning model, such as Naive Bayes, for sentiment classification.

More terms

What is description logic?

Description Logic (DL) is a family of formal knowledge representation languages. It is used to represent and reason about the knowledge of an application domain. DLs are more expressive than propositional logic but less expressive than first-order logic. However, unlike first-order logic, the core reasoning problems for DLs are usually decidable, and efficient decision procedures have been designed and implemented for these problems.Description Logic (DL) is a family of formal knowledge representation languages. It is used to represent and reason about the knowledge of an application domain. DLs are more expressive than propositional logic but less expressive than first-order logic. However, unlike first-order logic, the core reasoning problems for DLs are usually decidable, and efficient decision procedures have been designed and implemented for these problems.

Read more

What is SPARQL?

SPARQL is a robust query language specifically designed for querying and manipulating data stored in the Resource Description Framework (RDF) format, which is a standard for representing information on the Semantic Web. In the context of AI, SPARQL's ability to uncover patterns and retrieve similar data from large RDF datasets is invaluable. It facilitates the extraction of pertinent information, generation of new RDF data for AI model training and testing, and evaluation of AI models for enhanced performance.

Read more

It's time to build

Collaborate with your team on reliable Generative AI features.
Want expert guidance? Book a 1:1 onboarding session from your dashboard.

Start for free