Using Social Relationships to Determine Spatial Arrangements
An investigation into placing individuals in to groups based on their social interactions
In order to determine how people could be grouped based on social interactions, a set of emails communicated between employees of an architectural firm were chosen. The data collected contains 3662 emails communicated between 1653 emails addresses over a period of time. Each email has the following data attached to it: Body text as an array of words and their counts Sender email, Receiver emails, Subject and CC. This data were converted into objects of a custom email class along with additional aggregated information such as number of words and number of people who received the email. At this stage through anonymization is possible both obfuscation of body text into key words or removed entirely and individuals can be converted into user-id values.
Length of Emails
This corpus does not have enough information to understand in detail the 'nature' and 'use' of each email communication, we set up a simple keyword filter to remove some of the noise in the emails.
The keywords were identified by first getting rid of all junk words using the 'Tokenision' library from the NLTK (Natural Language Toolkit). Then a set of custom words were created to remove some common words specific to emails in general ( eg: http,mailto,subject) and to the particular office (eg: names of individuals, address, salutations). In addition to this all combinations of the employee names were removed as much as possible.
Then we could visualise how the number of keyboards relate to total number of words in an email.
And if we look at the email distribution based on total wordcount versus that of keyword count, we can see that although the number of longer (more than 20 words) emails is significantly higher, the number of emails with more keywords is not that significant.
Recipients of Emails
We could also visualise the lengths of the emails against how many people received them. A lengthy email reaching a lot of people is more likely to be an office wide announcement while a shorter email among a few people could be a project discussion.
We could explore this same concept further by visualising the percentages of email communication among individuals, small groups (less than 5 people) or larger groups of more than 5 people. Based on this data, we could easily see that majority of communication has happened between individuals and small groups while less than 10% of the emails include more than 5 people. Which does suggest that email communications in this context could indicate a strong social (work based) relationship.
Distribution of People
If we do a similar aggregation on people based on number of emails sent and received, we may understand the role of each person within the organisation.
Generally people within a group would send an receive an equal number of emails due to the forth and back communication.
A higher number of sent and received emails is indicative of a key person who is a leader of a group, or perhaps involved in several projects.
Alternatively, a few people would send more emails than they receive, indicating a coordinator or PA kind of role, or receive more than they send, indicating someone one who might not be engaged too much in group communications.
By banding the people based on number of emails sent or received, we could identify the general trend within the organisation. In this case, for the period given, most people sent less than five emails. While the amount of people who received up to five emails is more than four times of that. This indicates that generally people receive more than they send.
Grouping based on Emails
In order to group people into clusters, we need to break down all the emails into single threads between two people. Once that is done, we could aggregate the total communication between each pair to create a normalised score, where higher score indicates stronger social relatinship.
With this data, it is possible to run a clustering algorithm to create groups of related people. This can then be visualised using a network graph as below. The circles indicate a person coloured according to their group. Each line indicates a direct communication relationship between two people, while their lengths indicate the strengths.
Once we have the network, we could identify the key people based on their 'connectivity degrees', or number of links to unique individuals. Here we have isolated people who are in contact with more than one person (have a degree of more than 1) and re-clustered them into new groups.
Since each email is directional (from a specific person to a specific person), the individual relationships could be better understood by visualising as a matrix graph with people on both axes. The y-axis representing the senders and x-axis, recipients. Below is a matrix graph for people before filtering (the colors follow the same groups). Each grid point represents an existing communication from the person on the y-axis to the person on the x-axis.
We could get a better view by plottong the filtered clusters as below. The colored threads indicate communication within group, while the grey ones show communication between people across groups.
If we sort the graph by groups, we could understand the relationship within and across groups more easily. Those groups with a lot of internal communication tend to have a more symmetric shape while those those groups where a few key people send out most of the emails tend to be more asymmtric.
Some groups tend to have a high internal communication with minimum external communication, while some groups communicate well within and across several groups.
By sorting the graph based on connectivity degrees, we may identify the people who communicate with most people along with who those people are in terms of grouping and their connectivity.
It is also possible to do a similar sorting within each group to analyse who communicates most within their groups and who communicates most across.
Visualising Key People based on Connectivity
Another way to identify key people is to plot the number of emails sent or received by a person against the number of people they are in touch with. Generally, the more people a person communicates with, the more emails they will send or receive.
But as can be seen below, some people communicate a lot with just a handful of people, while a few are in communication with a lot of people, but do not communicate as much.
If we plot a similar graph using the total number of emails (sent and received), then the trend of more communication equals more connectivity is stronger.