The apriori algorithm, an example of machine learning in python.

The apriori algorithm, an example of machine learning in python.

Machine learning isn’t a mythical beast that only appears every 6 years when you park a 2014 Toyota Prius next to a dumpster in Buford Wyoming. It is just a form of statistics that has gathered a lot of hype. It is about finding patterns in massive data sets, find a pattern and then apply that to something. There are a lot of different ways to find patterns and they are not necessarily complicated. Today I’m going to provide a great pattern finding technique using the Apriori algorithm in python.

Apriori is an algorithm that looks for frequent item sets. Itemsets are groups of things, they can be numbers, images, emojis, etc. The apriori algorithm looks for a minimum threshold that the set appears in, this is the total number of occurrences/the total records in the set. You can define the set in 2 or more values, but you will want to probably stick around 3-5 to have an effective calculation. Apriori is an effective tool to not only look at the frequency of an itemset, but the itemset itself, as you do not need to define it. You simply provide a threshold for the support (e.g. frequency) and number of items in the sets. Let’s take a look at an example.

This example will be coming from the FiveThirtyEight data set of the Mayweather vs McGregor fight. Github Dataset. In this article the author expands on the emojis used during tweets of the fights to explain the feelings of the fans during specific timepoints. Let’s use this as the base for our analysis of the emojis as frequent itemsets.

The libraries we will be using for this example are:

  • re
  • sklearn
  • mlxtend
  • altair

Here I will set up a few functions we will need.

List Cleaning:

def listToString(s): #for cleaning up the lists that the text parsing creates
    str1= " "
    return(str1.join(s))

Creating an emoji matrix with sklearn to use in our apriori algorithm. In this function we will take each emoji and pivot the data to produce columns per emoji. Then we will count how many of each emoji appear in the row we are looking at. Using MultiLabelBinarizer() makes very short work of this.

def enter_the_matrix(df): #creating matrix for apriori 
    emoji_list = df.emojis.unique()
    emoji_set = set(emoji_list)
    df['emojis'] = df.text.apply(lambda text:np.unique([chr for chr in text if chr in emoji_set]))
    mlb = MultiLabelBinarizer()
    emoji_matrix = pd.DataFrame(data=mlb.fit_transform(df.emojis), index=df.index, columns=mlb.classes_)

    return emoji_matrix

Using the mlxtend to process our results, here we need to use the apriori algorithm in mlxtend to actually process this matrix and then it can not only put together the itemsets, it can calculate the support so we can set thresholds to establish a set that is meaningful to us.

def emoji_frequent_itemsets(emoji_matrix, min_support=0.005, k=3): #setting up the apriori calculation
    emoji_matrix_itemset = apriori(emoji_matrix, min_support=min_support, use_colnames=True)
    out = emoji_matrix_itemset[emoji_matrix_itemset['itemsets'].apply(lambda x: len(x)) == k]
    return out

First things first, let’s set up a tweet data frame of the emojis and the “boxer” set used in the 538 article.

tweets = pd.read_csv('tweets.csv')
tweets['emojis'] = tweets['text'].str.findall(r'[^\w\s.,"@\'?/#!$%\^&\*;:{}=\-_`~()\U0001F1E6-\U0001F1FF]').str.len()
boxer_emojis = ['☘️','🇮🇪','🍀','💸','🤑','💰','💵','😴','😂','🤣','🥊','👊','👏','🇮🇪','💪','🔥','😭','💰']

The above code strips the emojis from the unicode text. The third line we search for each type of emoji using unicode to create a column to work off of. Then we take the “processed” emojis as a list we will strip a subset from. This is the “boxer” emojis, that represent emojis that are only related to the fan bases for each boxer.

Now we will make some copies and find our “emojis” vs the “boxers emojis”

tweet_copy = tweets.copy()
tweet_copy2 = tweets.copy()
tweets_df = tweet_copy
boxer_df = tweet_copy2

tweet_copy['emojis'] = tweets['text'].str.findall(r'[^\w\s.,"@\'?/#!$%\^&\*;:{}=\-_`~()\U0001F1E6-\U0001F1FF]')
tweet_copy['emojis']= tweet_copy['emojis'].apply(lambda x: listToString(x))

tweet_copy2['emojis'] = tweets['text'].str.findall(str(boxer_emojis))
tweet_copy2['emojis']= tweet_copy2['emojis'].apply(lambda x: listToString(x))

tweet_all = enter_the_matrix(tweet_copy)
boxer = enter_the_matrix(tweet_copy2)

From this we will get a nice dataframe that has pulled out all the emojis from the tweet data vs just the boxer set. Utilizing the function “enter the matrix” we pivot the data into counts per emoji, which the apriori algorithm will use to create the support and frequent itemsets.

To finalize this we need to do a few more maintenance steps:

tweet_all.reset_index(inplace=True)
tweet_all.drop('datetime', axis=1, inplace=True)

boxer.reset_index(inplace=True)
boxer.drop('datetime', axis=1, inplace=True)

Finally we will mess with the support to “normalize” the two sets of data since the boxer set will be slightly different than that of our main set. The code below uses our function “emoji_frequent_itemsets” to actually do teh apriori calcuation.

tweet_all_frequent_3itemsets = emoji_frequent_itemsets(tweet_all, min_support=0.0005, k=3) #supports need to be different for the item sets as the "boxer" emojis have less data overall
boxer_frequent_3itemsets = emoji_frequent_itemsets(boxer, min_support=0.00001, k=3)
boxer_frequent_3itemsets =boxer_frequent_3itemsets.loc[60:]#keeping only 3 itemsets
tweet_all_frequent_2itemsets = emoji_frequent_itemsets(tweet_all, min_support=0.0025, k=2)
boxer_frequent_2itemsets = emoji_frequent_itemsets(boxer, min_support=0.0005, k=2)
boxer_frequent_2itemsets = boxer_frequent_2itemsets.loc[19:] # keeping only 2 itemsets

Lets take a look at what we have. First the 3 itemsets, all on the left, “boxer” on the right.

Now the two itemsets, all on the left, “boxer” on the right.

Without much trouble we have created simple dataframes that show us what the most frequent emojis were used in 3 vs 2 itemsets during this fight. Taking this a step further and applying the set to emoji meaning, one could even understand  the emotion of the fans during the fight. But that will be for another article.

So what to do with this data? Let’s graph it. I won’t be going through the altair graphing library here, so stay tuned for a youtube video where I provide more in-depth information. I will give a code dump below. First, let’s see a pretty visualization.

I hope you enjoyed this simple example of a machine learning algorithm that you can employ. Here is the code dump for the above visualization.

tweet_all_frequent_3itemsets["itemsets"] = tweet_all_frequent_3itemsets["itemsets"].apply(lambda x: list(x)).astype("unicode")
boxer_frequent_3itemsets["itemsets"] = boxer_frequent_3itemsets["itemsets"].apply(lambda x: list(x)).astype("unicode")
tweet_all_frequent_2itemsets["itemsets"] = tweet_all_frequent_2itemsets["itemsets"].apply(lambda x: list(x)).astype("unicode")
boxer_frequent_2itemsets["itemsets"] = boxer_frequent_2itemsets["itemsets"].apply(lambda x: list(x)).astype("unicode")


chart_all_3 = alt.Chart(tweet_all_frequent_3itemsets).mark_bar(size=10, color='#195190FF').encode(
    x=alt.X('support:Q', title='Support: 3 Itemsets'),
    y=alt.Y('itemsets:N', sort=alt.EncodingSortField(
            field="support",  
            order="ascending"  
            )),

)

annotation_all_3 = alt.Chart(tweet_all_frequent_3itemsets).mark_text(
    align='left',
    baseline='middle',
    lineBreak='\n',
    fontSize = 14
).encode(
    x='support:Q',
    y=alt.Y('itemsets:N', axis=None, sort=alt.EncodingSortField(
            field="support",  
            order="ascending"  
            )),
    text = 'itemsets'
)

chart_box_3 = alt.Chart(boxer_frequent_3itemsets).mark_bar(size=10, color = '#A9A9A9').encode(
    x=alt.X('support:Q'),
    y=alt.Y('itemsets:N', sort=alt.EncodingSortField(
            field="support",  
            order="ascending"  
            )),
)

annotation_box_3 = alt.Chart(boxer_frequent_3itemsets).mark_text(
    align='left',
    baseline='middle',
    lineBreak='\n',
    fontSize = 14
).encode(
    x='support:Q',
    y=alt.Y('itemsets:N', axis=None, sort=alt.EncodingSortField(
            field="support",  
            order="ascending"  
            )),
    text = 'itemsets'
)

chart_all_2 = alt.Chart(tweet_all_frequent_2itemsets).mark_bar(size=10, color='#195190FF').encode(
    x=alt.X('support:Q'),
    y=alt.Y('itemsets:N', sort=alt.EncodingSortField(
            field="support",  
            order="ascending"  
            )),

)

annotation_all_2 = alt.Chart(tweet_all_frequent_2itemsets).mark_text(
    align='left',
    baseline='middle',
    lineBreak='\n',
    fontSize = 14
).encode(
    x='support:Q',
    y=alt.Y('itemsets:N', axis=None, sort=alt.EncodingSortField(
            field="support",  
            order="ascending"  
            )),
    text = 'itemsets'
)

chart_box_2 = alt.Chart(boxer_frequent_2itemsets).mark_bar(size=10, color = '#A9A9A9').encode(
    x=alt.X('support:Q', title = 'Support: 2 Itemsets'),
    y=alt.Y('itemsets:N', sort=alt.EncodingSortField(
            field="support",  
            order="ascending"  
            )),
)

annotation_box_2 = alt.Chart(boxer_frequent_2itemsets).mark_text(
    align='left',
    baseline='middle',
    lineBreak='\n',
    fontSize = 14
).encode(
    x='support:Q',
    y=alt.Y('itemsets:N', axis=None, sort=alt.EncodingSortField(
            field="support",  
            order="ascending"  
            )),
    text = 'itemsets'
)

Title = alt.Chart(
    {"values": [{"text": ['The most common tweet itemsets for all vs "boxer" emojis']}]}
).mark_text(size=24, color='black', lineBreak='/n', align='left', dx=-50, fontStyle='bold').encode(
    text="text:N"
)

subtitle = alt.Chart(
    {"values": [{"text": ['Legend: ']}]}
).mark_text(size=16, color='black', lineBreak='/n', dx=-50,align='left').encode(
    text="text:N"
)

subtitle2 = alt.Chart(
    {"values": [{"text": ['▉ Itemsets - All Emojis']}]}
).mark_text(size=16, color='#195190FF', lineBreak='/n', dx=-50,align='left').encode(
    text="text:N"
)

subtitle3 = alt.Chart(
    {"values": [{"text": ['▉ Itemsets - Boxer Emojis']}]}
).mark_text(size=16, color='#A9A9A9', lineBreak='/n', dx=-50,align='left').encode(
    text=alt.Text("text:N")
)

subtitle4 = alt.Chart(
    {"values": [{"text": ['Tweets were divided into "itemsets" (3-set and 2-set) to evaluate if certain combinations were more prevelant in the /n "Boxer" defined group vs all.']}]}
).mark_text(size=16, color='grey', lineBreak='/n', dx=-50,align='left').encode(
    text=alt.Text("text:N")
)


line = alt.Chart(
    {"values": [{"text": ['_____________________________________________________________________________________']}]}
).mark_text(size=16, color='black', fontStyle='bold', dx=-50, dy=-50, align='left').encode(
    text=alt.Text("text:N")
)

chart1 = (chart_all_3+annotation_all_3)+(chart_box_3+annotation_box_3)
chart2 = (chart_all_2+annotation_all_2)+(chart_box_2+annotation_box_2)
charts = (chart1|chart2)

alt.vconcat(Title, (subtitle|subtitle2|subtitle3),subtitle4,line,charts,background = '#F0F0F0'
           ).configure_axis(
    grid=False,
).configure_view(
    strokeWidth=0,strokeOpacity=0
)

2 thoughts on “The apriori algorithm, an example of machine learning in python.

    1. Thanks! Glad you like the article, definitely give this a try with the youtube video extraction I showed you. It is easer than you think!

Comments are closed.

Comments are closed.