What is an N-gram?

by Stephen M. Walker II, Co-Founder / CEO

What is an N-gram?

An N-gram is a contiguous sequence of 'n' items from a given sample of text or speech. The items can be phonemes, syllables, letters, words, or base pairs, depending on the application. For instance, in the domain of text analysis, if 'n' is 1, we call it a unigram; if 'n' is 2, it is a bigram; if 'n' is 3, it is a trigram, and so on.

N-grams are used in various areas of computational linguistics and text analysis. They are a simple and effective method for text mining and natural language processing (NLP) tasks, such as text prediction, spelling correction, language modeling, and text classification.

For example, consider the sentence "The cow jumps over the moon". If N=2 (known as bigrams), then the n-grams would be: "the cow", "cow jumps", "jumps over", "over the", "the moon".

In the context of language modeling, an N-gram model predicts the occurrence of a word based on the occurrence of its N – 1 previous words. For instance, a bigram model (N = 2) predicts the occurrence of a word given only its previous word (as N – 1 = 1). Similarly, a trigram model (N = 3) predicts the occurrence of a word based on its previous two words (as N – 1 = 2).

N-grams can be used to predict the next item in a sequence, making them useful for language models in speech recognition, typing prediction, and other generative tasks. They can also serve as features for algorithms that classify documents into categories, such as spam filters or sentiment analysis.

In Python, you can generate N-grams using the NLTK library or by defining a function that splits sentences into tokens and collects the N-grams.

How are n-grams used in sentiment analysis?

N-grams are used in sentiment analysis to capture the context of words and phrases, which is crucial for understanding the overall sentiment of a text. They can provide features that are insensitive to the ordering of words in a sentence, which is helpful for tasks such as sentiment analysis.

In sentiment analysis, N-grams are used to identify and analyze the sentiment of phrases, not just individual words. For example, the bigram "not bad" has a positive sentiment, contrary to what might be inferred if the words "not" and "bad" were analyzed separately.

N-grams can be used as features in machine learning models for sentiment analysis. For instance, a bag of words representation of a document can be created by counting the frequency of each N-gram. This approach can improve the performance of sentiment analysis models, as demonstrated in a case study where a model trained with N-grams ranging from unigrams to 5-grams achieved the best performance.

However, it's important to note that using N-grams can significantly increase the feature space, which might lead to more sparse input features and potentially hamper model performance. Therefore, the optimal range of N-grams to use may vary depending on the specific task and dataset, and it's often determined through experimentation.

In Python, you can perform sentiment analysis using N-grams with libraries such as NLTK. For example, you can use the ngrams function from NLTK to generate N-grams, and then use a machine learning model, such as Naive Bayes, for sentiment classification.

More terms

What is Prompt Engineering for LLMs?

Prompt engineering for Large Language Models (LLMs) like Llama 2 or GPT-4 involves crafting inputs (prompts) that effectively guide the model to produce the desired output. It's a skill that combines understanding how the model interprets language with creativity and experimentation.

Read more

What is Amazon Bedrock?

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models from leading AI companies like AI21 Labs, Anthropic, Cohere, and Stability AI, along with a broad set of capabilities for building generative AI applications with security, privacy, and responsible AI.

Read more

It's time to build

Collaborate with your team on reliable Generative AI features.
Want expert guidance? Book a 1:1 onboarding session from your dashboard.

Start for free