Text Similarity: Selecting similar articles from a dataset

Published by FirstAlign

One of the very basic Machine Learning tasks is to compare objects and decide whether it is the same or different. It can be done for pictures; it can be done for voice, but what about text?

In this article, we are going to discuss what is ‘text similarity’ and ‘how is it used?’. We are also going to create a system that will take the title of an article and will return 10 articles similar to it.

Photo in part by Annie Spratt on Unsplash

Text Similarity 

Text similarity is a measure of how close two texts are in terms of syntactic as well as semantic meaning. This is useful when we want to group, categorize or label similar documents. It is one of the most popular machine learning tasks.

To do this we employ the following practical process.

Text similarity flow chart

To check text similarity, we have to perform the following steps.

Step 1: Import Dataset

In the performance of any machine learning operation, the first thing we need is data.

In this case, we will be using a set of articles from medium.com, which has been collected from this source. This dataset contains 336 articles, from which we will extract and derive two columns. Column one containing the title of the article, and column two containing the complete text of the article.

The code snippet below shows how to import the dataset and details a snapshot of the final information.

Step 2: Data Preprocessing

Data Preprocessing steps

Data Processing consists of three steps. Once we have got the text from the article, we firstly need to clean that text. To do this we apply a regular expression. This way we will be able to remove numbers and special characters, leaving us only with a string containing alphabet.

We then need to remove the words with high frequency but low statistical importance such as “a” and “an”. To do this we will apply stop word removal.

Now we have text that only contains the alphabet and doesn’t contain any stop words, we will need to remove any ambiguity between the text. We achieve this via stemming, see my previous article on this subject here. We stem the word9s) to its dictionary version, meaning that where the word is “changing” or “changed” it will be converted to “change”. Weighting should be given to change not “changing” or “changed” separately.

Applying Preprocessing

Step 3: Apply Countvectorizer

After applying the pre-processing step, the data we have is still in textual form. We can’t pass raw text to the algorithm so we need to convert it into a matrix of features. To do this we use a countvectorizer. The countvectorizer creates a matrix of text including their frequency in each article. The sklearn package is used for this process.

Now we have data that can be passed to the algorithm. We do this in the next step.

Applying countvectorizer

Step 4: Apply Cosine Similarity

We measure the similarity between the vector(s) with the help of cosine angle. This looks for differences in multidimensional space. Mathematically it is equal to the dot product of two vectors divided by the product of their magnitude.

The formula for cosine similarity

We will apply Cosine Similarity to the matrix of features from applying countvectorizer.  Our similarity engine is then created. Now its time to test it.

Step 5: Test the system

To test how our similarity engine is performing we pass it a title. The systems then searches and i returns 10 articles similar to it. In the below example.

Results

We can see from the system test that it is performing quite well. There is a strong similarity with the returning articles to the one given. This clearly demonstrates we can create an engine through which we can perform a ‘text similarity’ exercise. Utilizing cosine similarity we achieved solid, first pass, results.

The complete code is available at GitHub. I hope you enjoyed the article, stay tuned until next. Happy coding ❤

Know more about us:

You might also be interested to follow FirstAlign to get the latest updates https://www.linkedin.com/company/firstalign

Published by FirstAlign

Click here to connect with us

Spread the word

Related posts