Introduction
After my latest published article about satellite image analysis “Image Analysis and Mapping in Earth Engine Using NDVI“, now it is another article about image analysis again. Unlike the previous article, this article discusses general image analysis, not satellite image analysis. The goal of this discussion is to detect whether two products are the same or not. Each of the two products has image and text names. If the pair of products have similar or the same images or text names, that means that the two products are the same. The data comes from a competition held in Kaggle.
There are 4 basic packages used in this script: NumPy, pandas, matplotlib, and seaborn. There are also other specific packages. “Image” loads and shows image data. “imagehash” computes the similarity of two images. “fuzzywuzzy” detects the similarity of two texts. The package “metric” computes the accuracy score of the true label and predicted label.
# import packages import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from PIL import Image import imagehash from fuzzywuzzy import fuzz from sklearn.tree import DecisionTreeClassifier from sklearn import metrics
Image Similarity
The similarity of the two images is detected using the package “imagehash”. If two images are identical or almost identical, the imagehash difference will be 0. Two images are more similar if the imagehash difference is closer to 0.
Comparing the similarity of two images using imagehash consists of 5 steps. (1) The images are converted into greyscale. (2) The image sizes are reduced to be smaller, for example, into 8×8 pixels by default. (3) The average value of the 64 pixels is computed. (4)The 64 pixels are checked whether they are bigger than the average value. Now, each of the 64 pixels has a boolean value of true or false. (5) Imagehash difference is the number of different values between the two images. Please observe the below illustration.
Image_1 (average: 71.96875)
48 |
20 |
34 |
40 |
40 |
32 |
30 |
32 |
34 |
210 |
38 |
50 |
42 |
41 |
230 |
40 |
47 |
230 |
33 |
44 |
34 |
50 |
245 |
50 |
43 |
230 |
46 |
50 |
36 |
34 |
250 |
30 |
30 |
200 |
190 |
38 |
41 |
240 |
39 |
39 |
38 |
7 |
200 |
210 |
220 |
240 |
50 |
48 |
48 |
8 |
45 |
43 |
47 |
37 |
37 |
47 |
10 |
8 |
6 |
5 |
6 |
6 |
5 |
5 |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
TRUE |
FALSE |
FALSE |
FALSE |
FALSE |
TRUE |
FALSE |
FALSE |
TRUE |
FALSE |
FALSE |
FALSE |
FALSE |
TRUE |
FALSE |
FALSE |
TRUE |
FALSE |
FALSE |
FALSE |
FALSE |
TRUE |
FALSE |
FALSE |
TRUE |
TRUE |
FALSE |
FALSE |
TRUE |
FALSE |
FALSE |
FALSE |
FALSE |
TRUE |
TRUE |
TRUE |
TRUE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
Image_2 (average: 78.4375)
41 |
20 |
39 |
43 |
34 |
39 |
30 |
32 |
35 |
195 |
44 |
46 |
35 |
48 |
232 |
40 |
30 |
243 |
38 |
31 |
34 |
46 |
213 |
50 |
49 |
227 |
44 |
33 |
35 |
224 |
230 |
30 |
46 |
203 |
225 |
44 |
46 |
181 |
184 |
40 |
38 |
241 |
247 |
220 |
228 |
210 |
36 |
38 |
42 |
8 |
35 |
39 |
47 |
31 |
41 |
21 |
3 |
12 |
10 |
18 |
24 |
21 |
6 |
17 |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
TRUE |
FALSE |
FALSE |
FALSE |
FALSE |
TRUE |
FALSE |
FALSE |
TRUE |
FALSE |
FALSE |
FALSE |
FALSE |
TRUE |
FALSE |
FALSE |
TRUE |
FALSE |
FALSE |
FALSE |
TRUE |
TRUE |
FALSE |
FALSE |
TRUE |
TRUE |
FALSE |
FALSE |
TRUE |
TRUE |
FALSE |
FALSE |
TRUE |
TRUE |
TRUE |
TRUE |
TRUE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
The imagehash difference of the two images/matrices above is 3. It means that there are 3 pixels with different boolean values. The two images are relatively similar.
For more clarity, let’s examine imagehash applied to the following 3 pairs of images. The first pair consists of two same images and the imagehash difference is 0. The second pair compares two similar images. The second image (image_b) is actually an edited version of the first image (image_a). The imagehash difference is 6. The last pair shows the comparison of two totally different images. The imagehash difference is 30, which is the farthest from 0.
# First pair
hash1 = imagehash.average_hash(Image.open('D: /image_a.jpg'))
hash2 = imagehash.average_hash(Image.open('D:/ image_a.jpg'))
diff = hash1 - hash2
print(diff)
# 0
# Second pair hash1 = imagehash.average_hash(Image.open('D: /image_a.jpg')) hash2 = imagehash.average_hash(Image.open('D:/ image_b.jpg')) diff = hash1 - hash2 print(diff) # 6
# Third pair hash1 = imagehash.average_hash(Image.open('D: /image_a.jpg')) hash2 = imagehash.average_hash(Image.open('D:/ image_c.jpg')) diff = hash1 - hash2 print(diff) # 30
Here is how the average imagehash looks like
>imagehash.average_hash(Image.open('D:/image_a.jpg')) array([[ True, True, True, True, True, True, True, True], [ True, True, True, True, True, True, True, True], [ True, True, True, True, True, True, True, True], [False, True, False, False, False, False, False, False], [ True, True, False, False, False, False, False, False], [False, False, False, True, False, False, False, False], [False, False, False, True, False, False, False, False], [False, False, False, False, False, False, False, False]])
>imagehash.average_hash(Image.open('D:/image_b.jpg')) array([[ True, True, True, True, True, True, True, True], [ True, True, True, True, True, True, True, True], [False, True, True, True, True, False, False, False], [ True, True, True, False, False, False, False, False], [ True, True, False, False, False, False, False, False], [False, False, False, True, False, False, False, False], [False, False, False, True, False, False, False, False], [False, False, False, False, False, False, False, False]])
>imagehash.average_hash(Image.open('D:/image_c.png')) array([[False, False, False, False, False, False, False, False], [ True, True, True, True, True, True, True, True], [ True, True, True, True, True, True, True, True], [ True, True, True, True, True, True, True, True], [ True, True, True, True, True, True, True, True], [ True, True, True, True, True, True, True, True], [False, False, False, False, True, False, False, False], [False, False, False, False, False, False, False, False]])
Text Similarity
Text similarity can be assessed using Natural Language Processing (NLP). There are 4 ways to compare the similarity of a pair of texts provided by “fuzzywuzzy” package. The function of this package returns an integer value from 0 to 100. The higher value means the higher similarity.
1. fuzz.ratio – is the most simple comparison of the texts. The fuzz.ratio value of “blue shirt” and “blue shirt.” is 95. It means that the two texts are similar or almost the same, but the dot makes them a bit different
The measurement is based on the Levenshtein distance (named after Vladimir Levenshtein). Levenshtein distance measures how similar two texts are. It measures the number of minimum edits, such as inserting, deleting, or substituting, a text into another text. The text “Blue shirt” requires only 1 editing away to be “blue shirt.”. It only needs a single dot to be the same. Hence, the Levenshtein distance is “1”. The fuzz.ratio is calculated with this equation (len(a) + len(b) – lev)/( (len(a) + len(b), where len(a) and len(b) are the lengths of the first and second text, and lev is the Levenshtein distance. The ratio is (10 + 11 – 1)/(10 + 11) = 0.95 or 95%.
2. fuzz.partial_ratio – can detect if a text is a part of another text. But, it cannot detect if the text is in a different order. The example below shows that “blue shirt” is a part of “clean blue shirt” so that the fuzz.partial_ratio is 100. The fuzz.ratio returns the value 74 because it only detects that there is much difference between the two texts.
print(fuzz.ratio('blue shirt','clean blue shirt.')) #74 print(fuzz.partial_ratio('blue shirt','clean blue shirt.')) #100
3. Token_Sort_Ratio – can detect if a text is a part of another text although they are in a different order. Fuzz.token_sort_ratio returns 100 for the text “clean hat and blue shirt” and “blue shirt and clean hat” because they actually mean the same thing, but are in reverse order.
print(fuzz.ratio('clean hat and blue shirt','blue shirt and clean hat')) #42 print(fuzz.partial_ratio('clean hat and blue shirt','blue shirt and clean hat')) #42 print(fuzz.token_sort_ratio('clean hat and blue shirt','blue shirt and clean hat')) #100
4. Token_Set_Ratio – can detect the text-similarity accounting for the partial text, text order, and different text lengths. It can detect that the text “clean hat” and “blue shirt” is part of the text “People want to wear a blue shirt and clean hat” in a different order. In this study, we only use “Token_Set_Ratio” as it is the most suitable.
print(fuzz.ratio('clean hat and blue shirt','People want to wear blue shirt and clean hat')) #53 print(fuzz.partial_ratio('clean hat and blue shirt','People want to wear blue shirt and clean hat')) #62 print(fuzz.token_sort_ratio('clean hat and blue shirt','People want to wear blue shirt and clean hat')) #71 print(fuzz.token_set_ratio('clean hat and blue shirt','People want to wear blue shirt and clean hat')) #100
The following cell will load the training dataset and add features of hash as well as token set ratio.
# load training set trainingSet = pd.read_csv('D:/new_training_set.csv', index_col=0).reset_index() # Compute imagehash difference hashDiff = [] for i in trainingSet.index: hash1 = imagehash.average_hash(Image.open(path_img + trainingSet.iloc[i,2])) hash2 = imagehash.average_hash(Image.open(path_img + trainingSet.iloc[i,4])) diff = hash1 - hash2 hashDiff.append(diff) trainingSet = trainingSet.iloc[:-1,:] trainingSet['hash'] = hashDiff # Compute token_set_ratio Token_tes = [] for i in trainingSet.index: TokenSet = fuzz.token_set_ratio(trainingSet.iloc[i,1], trainingSet.iloc[i,3]) TokenSet = (i, TokenSet) Token_tes.append(TokenSet) dfToken = pd.DataFrame(Token_tes) trainingSet['Token'] = dfToken
Below is the illustration of the training dataset. It is actually not the original dataset because the original dataset is not in the English language. I create another data in English for understanding. Each row has two products. The columns “text_1” and “image_1” belong to the first product. The columns “text_2” and “image_2” belong to the second product. “Label” defines whether the pairing products are the same (1) or not (0). Notice that there are other two columns: “hash” and “tokenSet”. These two columns are generated, not from the original dataset, but from the above code.
index | text_1 | image_1 | text_2 | image_2 | Label | hash | tokenSet |
0 | Blue shirt | Gdsfdfs.jpg | Blue shirt. | Safsfs.jpg | 1 | 6 | 100 |
1 | Clean hat | Fsdfsa.jpg | Clean trousers | Yjdgfbs.jpg | 0 | 25 | 71 |
2 | mouse | Dfsdfasd.jpg | mouse | Fgasfdg.jpg | 0 | 30 | 100 |
. . . | . . . | . . . | . . . | . . . | . . . | . . . | . . . |
Applying Machine Learning
Now, we know that lower Imagehash difference and higher Token_Set_Ratio indicates that a pair of products are more likely to be the same. The lowest value of imagehash is 0 and the highest value of Token_Set_Ratio is 100. But, the question is how much the thresholds are. To set the thresholds, we can use the Decision Tree Classifier.
A Machine Learning of Decision Tree model is created using the training dataset. The Machine Learning algorithm will find the pattern of imagehash difference and the token set ratio of identical and different products. The Decision Tree is visualized for the cover image of this article. The code below builds a Decision Tree model with Python. (But, the visualization for the cover image is the Decision Tree generated using R because, in my opinion, R visualizes Decision Tree more nicely). Then, it will predict the training dataset again. Finally, we can get the accuracy.
# Create decision tree classifier: hash and token set Dtc = DecisionTreeClassifier(max_depth=4) Dtc = Dtc.fit(trainingSet.loc[:,['hash', 'tokenSet']], trainingSet.loc[:,'Label']) Prediction2 = Dtc.predict(trainingSet.loc[:,['hash', 'tokenSet']]) metrics.accuracy_score(trainingSet.loc[:,'Label'], Prediction2)
The Decision Tree is used to predict the classification of the training dataset again. The accuracy is 0.728. In other words, 72.8% of the training dataset is predicted correctly.
From the Decision Tree, we can extract the information that if the Imagehash difference is smaller than 12, the pair of products are categorized to be identical. If the Imagehash difference is bigger than or equal to 12, we need to check the Token_Set_Ratio value. The Token_Set_Ratio lower than 97 confirms that the pair of products are different. If else, check whether the Imagehash difference value again. If the imagehash difference is bigger than or equal to 22, then the products are identical. Otherwise, the products are different.
Apply to test dataset
Now, we will load the test dataset, generate the Imagehash difference and Token_Set_Ratio, and finally predict whether each product pair matches.
# path to image path_img = 'D:/test_img/' # load test set test = pd.read_csv('D:/new_test_set.csv', index_col=0).reset_index() # hashDiff list hashDiff = [] # Compute image difference for i in test.index[:100]: hash1 = imagehash.average_hash(Image.open(path_img + test.iloc[i,2])) hash2 = imagehash.average_hash(Image.open(path_img + test.iloc[i,4])) diff = hash1 - hash2 hashDiff.append(diff) test['hash'] = hashDiff # Token_set list Token_set = [] # Compute text difference using token set for i in test.index: TokenSet = fuzz.token_set_ratio(test.iloc[i,1], test.iloc[i,3]) Token_set.append(TokenSet) test['token'] = Token_set
After computing the Imagehash difference and Token_Set_ratio, the next thing to do is to apply the Decision Tree for the product match detection.
# Detecting product match test['labelPredict'] = np.where(test['hash']<12, 1, np.where(test['token']<97, 0, np.where(test['hash']>=22, 0, 1))) # or test['labelPredict'] = Dtc.predict(test[['hash','token']])
index | text_1 | image_1 | text_2 | image_2 | hash | tokenSet | labelPredict |
0 | pen | Fdfgsdfhg.jpg | ballpoint | Adxsea.jpg | 8 | 33 | 1 |
1 | harddisk | Sgytueyuyt.jpg | a nice Harddisk | Erewbva.jpg | 20 | 100 | 1 |
2 | eraser | Sadssadad.jpg | stationary | Safdfgs.jpg | 25 | 25 | 0 |
. . . | . . . | . . . | . . . | . . . | . . . | . . . | . . . |
The above table is the illustration of the final result. The focus of this article is to demonstrate how to predict whether two images and two texts are similar or the same. You may find out that the Machine Learning model used is quite simple and there is no hyperparameter-tuning or training and test data splitting. Applying other Machine Learning, such as tree-based ensemble methods, can increase the accuracy. But it is not our discussion focus here. If you are interested to learn other tree-based Machine Learning more accurate than Decision Tree, please find an article here.
About Author
Connect with me here https://www.linkedin.com/in/rendy-kurnia/
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.