The internet is growing every day and has become the biggest data source for solving problems or improving the quality of life for us. However, the data collected from the internet is not always well structured and well organized to be used readily. Consequently, having the problem of classifying a bunch of text data collected on the internet that are so messy and misleading that it is unsuitable for unsupervised classification is a rather recurring issue. Here we present a very NLP and not very deep learning way of solving the problem.
Authors: Claudio Palmieri and Dr. Banu Turkmen
The problem mentioned above arose in the Omdena project Preventing Gender-Based Violence Through an AI-Driven Support & Reporting Tool. The project required finding an AI solution to help combat GBV (Gender-Based Violence) in Nigeria. We did not have a ready data set to be used for our AI solution in this project and we needed to focus on the problem itself, reviewed many types of research about GBV in Nigeria, and gained insights from previous studies. Gender-Based Violence, already a very complex problem in itself , seems to be even more complex in Nigeria. The phenomenon seems to be quite widespread and common and exacerbated by the activities of the insurgency in the North East and, in general, by Covid-19 . One of the elements that emerge from the studies as well as from data collected on the internet is that one of the factors of the complexity of the Nigerian problem is the reluctance of women to tell their story of violence openly (culture of silence) . More insights from previous studies and research could be found in the next section for further understanding.
Overview of Gender-Based Violence research in Nigeria
The State of the World’s Girls Report has a good analysis of the global picture of online abuse targeted to women with a scope of 14,000 girls and young women from 22 survey countries including Nigeria. Some key findings from this research are:
- More than half of girls surveyed, from around the world, have been harassed and abused online.
- One in four girls abused online feels physically unsafe as a result.
- Online abuse is silencing girls’ voices.
The biggest gender gaps in internet access are found in India, Benin, Guinea, Ghana and Nigeria (122nd =the last one in the world ranking). According to the Report on the Rapid Assessment Survey on GBV Experiences of Women with Disabilities (WWD) in Nigeria below are some key findings  :
- Although WWDs appear to possess surface awareness of GBV, they lack an in-depth understanding of what constitutes GBV, as well as the capacity to advocate for their inclusion in GBV intervention programs.
- WWDs lacked easy access to GBV capacity-building and advocacy programs, thereby weakening their voice and reducing their knowledge and capacity to engage relevant stakeholders.
- A review of major GBV and disability rights legal frameworks in Nigeria indicates that the issues and needs of WWDs are not adequately mainstreamed.
MEAC Findings Report on Gender Norms and Sexism in and Around Maiduguri is based on data collected from December 2020 to January 2021, as part of a phone survey with a randomized sample of 3,117 community members from the region . This data was gathered to help understand the context in which girls and women are recruited into armed groups, particularly the gender norms and gender expectations in the region.
Girls and women in Nigeria face persistent inequalities when it comes to access to education, political representation, health, and labour markets. In 2020, Nigeria was ranked 161 out of 189 countries on the UNDP Gender Inequality Index (GII), which measures differences in three aspects of human development: reproductive health, empowerment, and economic status. These inequalities impact the opportunities available to women and girls, but also likely contribute to violence against them. Here are some news articles selected to provide more insights to the readers on GBV in Nigeria :
- Lagos State saw a nearly 40% increase in rape and domestic and sexual violence in 2020, official data showed.
- Some Nigerian women are now acting to address the problem of sexual violence, saying that cases have ended in few prosecutions, widespread stigmatization and a tendency to blame victims.
- Activists have launched centers to support women, an app to report attacks and a push to protect girl victims from being traumatized again in the legal system
- Polling group NOIPolls found that 47% of Nigerians blamed rape on indecent dressing, and fewer than half thought offenders should be punished.
- In Lagos, senior lawyer Boma Alabi is rallying others in her profession to protect under-aged victims when cases go to trial.
- In northern Kano state, tech entrepreneur Sa’adat Aliyu in August launched an app, Helpio, for women to report assaults.
IGC – International Growth Center has a good analysis on the shadow pandemic which emphasizes the increase in GBV in Nigeria . The pandemic has seen the diversion of priorities and resources and resulted in a surge of reports of GBV because of Federal Government imposed lock downs. Some key findings are highlighted as well as below. Lock downs can increase domestic violence as seen in Fig 1 and 2 below:
- Compromised support services and access to justice
Strict movement restrictions have meant that survivors are unable to access centers and shelters Many court proceedings have been postponed, which will limit the system’s ability to issue protection and restraining orders that would otherwise have an immediate impact on protecting victims.
- Destroyed livelihoods could push women into transactional sex
Nigerian women are particularly vulnerable in the COVID-19 pandemic, as over 80% of women in the labor force are employed in the informal sector with little or no social protection and safety nets.
- School closures increase the risk of child marriage
Early marriage is already widespread in Nigeria, with 44% of girls married before the age of 18. The United Nations Population Fund predicts that an additional 13 million child marriages will take place globally in the next 10 years that would have otherwise been prevented due to the disruption caused by the COVID-19 pandemic. With the third-highest absolute number of child brides in the world, Nigeria is at risk of bearing many of these additional child marriages. UN Women has produced a good brief on GBV during COVID-19 . Accordingly, as the world battles with the COVID-19 pandemic, emerging evidence indicates a sharp rise in GBV, especially violence against women and girls. Accordingly, some key findings are:
- 30 percent of girls and women aged between 15 and 49 reported having experienced sexual abuse
- 43 percent of girls married before the age of 18, while 20 percent of women aged 15 to 49 have undergone female genital mutilation (FGM).
- Once girls in Nigeria are married, only 1.2 percent of those aged 15 to 19 have their contraception needs met, leading to high levels of early and teenage pregnancy
- While women and girls are disproportionately affected by GBV, sexual violence against men and boys also occurs, particularly in conflict-affected contexts. Service providers in the North East, for instance, have observed incidents of sexual violence towards men and boys. However, male survivors are less likely than women to report an incident of sexual violence.
- Initial data shows a general increase in GBV across all six geopolitical zones, and service providers have reported sharp increases in cases of intimate partner violence and domestic violence. Data on reported incidents of GBV cases in Nigeria based on preliminary information from 24 states shows that in March, the total number of GBV incidents reported were 346, while in the first part of April, incident reports spiked to 794, depicting a 56 percent increase in just two weeks of lock down. Some of these incidents of violence have tragically resulted in the death of victims, the rape of children, including incestual rape, and tenant-landlord assault
- More than 90 percent of Nigerian women in the labour force work in the informal sector,14 many of whom have seen their wages evaporate overnight amid lock downs.
- Of the over 90 million Nigerians estimated to be living in extreme poverty,15 fewer than 12 percent were registered in the National Social Register of Poor and Vulnerable Households (as of 31 March 2020)
- The COVID-19 pandemic has significant implications for the provision of critical sexual and reproductive health information and services. Indeed, some 47 million women in 114 low- and middle-income countries are projected to be unable to use modern contraceptives if the average lock down or COVID-19-related disruption continues for six months. It is expected that the major disruptions to services would lead to an additional 7 million unintended pregnancies, including unintended pregnancies resulting from rape and unprotected sex
- In Nigeria, nationwide school closures have affected 18,549,010 women and girls across primary, secondary and tertiary education,2 many of whom find themselves in low-resource contexts and with additional care duties compared to their male counterparts, making it more difficult to maintain learning.
- The World Bank indicates that based on costing done for various countries globally, GBV costs economies an estimated 1.2 to 3.7 percent of GDP and provides indicative economic costs in Nigeria.
As seen from these selected studies, the GBV after COVID-19 is an even more serious issue in Nigeria and there are some fundamental road blockers in the country to fight against it. So we needed to use whatever we can collect from the social networks as data to be able to identify the trends and types of GBV in the region which will help to combat the issue.
Despite the difficulties, the data collected from the main social networks amount to approximately 421182 records, but they are inevitably messy and misleading.
Given that the subject of the data search was gender-based violence against women in Nigeria, a quick reading of one of the datasets immediately raised questions about whether the violence referred to in the records was gender-based or attributable to men or directed against women or gendered.
Here is an example of the ambiguity of some records collected from LinkedIn (the lowercase is due to pre-processing): “…, sexual assault and harassment. those stories have meaning. those voices deserve recognition and action. these businesses, this industry needs to be held…” “1 in 3 women and 1 in 4 men have experienced physical violence by an intimate partner” “i come from a family with a history of multi-generational involvement with imprisonment. when i was 16, i was placed in youth detention for failing to comply with the foster care system. my crime running away from a home where i was being sexually abused”). The solution to this problem involved understanding what the datasets contained from the start.
One of the ways to understand the content of a data set is typically to use the top modeling technique to get keywords. However, this approach was not suitable for the type of datasets available, which required a more human and informative classification methodology. So, we opted for a less automatic approach.
We started by looking more carefully at the LinkedIn records and built a simple word list of selected keywords, then we used the words in the list to check if they were also in all the other datasets. Since that allowed us to roughly understand the content of the datasets, we replaced the keyword list with a set of five actual dictionaries.
The dictionaries were created according to the following procedure:
- reading the records
- extracting some significant keywords (e.g. ‘sex‘)
- creating the code that can search for the keyword in all the datasets to capture eventual recurrent expressions and other keywords.
ptnr="(([^\\s]+) sex ([^\\s]+) ([^\\s]+) ([^\\s]+) ([^\\s]+))" for index, row in dt.iterrows(): a = re.findall(ptnr, str(row['body'])) if len(a) != 0: print((a,row['body'])) print(index)
- creating the regular expression (regex) to capture that set of strings:
'(invi(te|tes|ted||ting|tation|tations) (her for|me for|for|to|me|to have|for having) (sex|sexual|sexually))'
- associating the regular expression to a single tag based on the recurring pattern:
'(invi(te|tes|ted||ting|tation|tations) (her for|me for|for|to|me|to have|for having) (sex|sexual|sexually))':["sexual invite"]
- and finally grouping the tags in a general category
'(invi(te|tes|ted||ting|tation|tations) (her for|me for|for|to|me|to have|for having) (sex|sexual|sexually))':["sexual invite - harassments"]
The help of a domain expert and the relatively small number of keywords made it possible to take advantage of the procedure. Each dictionary created had a specific task. An initial dictionary (fundamental words) was developed to identify forms of violence against women and so it contained words like clitorectomy, genital mutilations, forced marriage, and so on.
A second dictionary was constructed to identify forms of violence regardless of whether these were related to violence against a woman and it contained words like exploitative relationships, toxic marriage, beating, and so on.
A third one contained word like woman, husband, girl, female, spouse, etc. with the purpose to understand if it was possible to trace the form of violence identified with the second dictionary to violence against a woman or at least it was possible to presume it.
Finally, the fourth one was used to get an overall view of what the text is about without having to read the entire record and is made up of a set of keywords to be used with the Counter function to have also the frequency of the words in the record (example of regex used in the dictionary: ‘^abus.*?$’, ‘^femin.*?$’, ‘^violen.*?$’).
# Dictionaries #open dictionaries def openDictionary(file): f = open(file,'r') text = f.read() dictio = ast.literal_eval(text) f.close() return dictio # Vocabulary of terms that identify violence against women fund_words = openDictionary('./dictionaries/fundamentalWords.txt') # identify violence against women in combination with female terms #i.e abusive marriage with sex, sexual, she, girl, girls, violence imp_words = openDictionary('./dictionaries/importantWords.txt') #female terms female = openDictionary('./dictionaries/female.txt') # overall view of the content of the record word_list = openDictionary('./dictionaries/wordList.txt')
Violence against women
At this point the problem about when a record can be considered containing forms of violence against women was addressable. According to their tasks, the dictionaries were used to tag the records. If the record contained tags created with the first dictionary, the record was considered definitely about violence against women (A): If the record did not contain tags created with the first dictionary but created with the second and the third dictionary, the record was considered almost certainly about violence against women(B): If the record did not contain tags created with the first and third dictionary but created only with the second, the record was considered perhaps about violence against women (C): If the record did not contain tags created with the first, the second, and third dictionary, but created only with the fourth, the record was considered as not appearing to be about violence against women (D): In all the other cases, the record was not considered violence against women (E):
#INOUT for index, row in df1.iterrows(): #A(definitely about violence against women) if len(row['tags_a']) != 0: df1.at[index, 'inout'] = 'A = definitely about violence against women' #B(almost certainly about violence against women) elif len(row['tags_a']) == 0 and len(row['tags_b']) !=0 and len(row['tags_f']) != 0: df1.at[index, 'inout'] = 'B = almost certainly about violence against women' #C(perhaps about violence against women) elif len(row['tags_a']) == 0 and len(row['tags_b']) != 0 and len(row['tags_f']) == 0 and len(row['tags_c']) != 0: df1.at[index, 'inout'] = 'C = perhaps about violence against women' #D(it does not seem violence against women) elif len(row['tags_a']) == 0 and len(row['tags_b']) == 0 and len(row['tags_f']) != 0 and len(row['tags_c']) != 0: df1.at[index, 'inout'] = 'D = it does not seem violence against women' #E(should not be violence against women) else: df1.at[index, 'inout'] = 'E = it is not violence against women'
Distinctions among datasets
The procedure followed above was used to distinguish individual records within datasets. The need to help the operator to disentangle the types of datasets remained. For this reason, the distinction between homogeneous and in-homogeneous datasets and a system to prioritize datasets was introduced. The datasets that report only stories of Gender-Based Violence against women for certain are labeled as homogeneous, all others are classified as in-homogeneous. Each data set was also assigned a number from one to three, based on the priority scale imposed below:
- help center data set : 1;
- social media like Facebook, LinkedIn, Tweet: 2;
- any other source, like YouTube and Reddit: 3.
By only collecting stories of violence against women, help centers were of utmost importance for our purposes. Social networks, such as Facebook, contained stories that were interesting, but not easily and safely identifiable as violence against women. For this reason, they came right after the help center data set in terms of importance. Finally, the least relevant datasets were grouped together.
Pseudonymization and human control
An ethical way to build an AI application or model requires at least considering privacy aspects and should always allow human oversight, especially if the data is particularly sensitive, as in this case, and there is a risk of having biased results. According to Wikipedia definition “[p]seudonymization is a data management and de-identification procedure by which personally identifiable information fields within a data record are replaced by one or more artificial identifiers, or pseudonyms. A single pseudonym for each replaced field or collection of replaced fields makes the data record less identifiable while remaining suitable for data analysis and data processing”. In this regard, pseudonymization was implemented by simply assigning a unique id to each record before detaching it from the original data set. In this way only authorized operators can have access to personal data, while all others are free to use the records without reference to personal data. Everything involves a simple line of code:
#assign id unique columns df.insert(0, 'unique_id_d', range(0, 0 + len(df)))
(Source: Omdena) The system that establishes when a record could be considered containing forms of violence against women was for the purpose of orienting the operator but was far to be perfect. For these reasons, we had given the possibility to the operator to change the classification of the record. The following was the solution implemented for human control.
index_inp = for index, row in df.iterrows(): print() print('-----------------------------------------------') print('^^^ ',index, ' ^^^') print('source_priority: ', row['source_priority']) print('body: ', row['body']) print('tags: ', row['joined tags']) print('inout: ', row['inout']) print() inout_def = ['A = definitely about violence against women', 'B = almost certainly about violence against women', 'C = perhaps about violence against women', 'D = it does not seem violence against women', 'E = it is not violence against women'] print('inout definition') print(inout_def) inp = input("Change inout or 'ext' for exit ") try: inp = str(inp) if inp in inout_def: index_inp.append((index,inp)) print(inp) elif inp not in inout_def: print("Please enter a valid value") print('exit') break elif inp == 'ext': print('exit') break except ValueError: print('exit') break print(index_inp) print('Changhed:') for ind, inp in index_inp: print('index row: ', ind, ' with ', inp) df.at[ind, 'allert_priority'] = inp
With this approach, the operator was easily able to understand the content and importance of records, as well as select records and datasets according to their usefulness. Everything was designed to have full control of the datasets, records and tags applied. Taking advantage of these characteristics, we formed a new data set of selected records from the original datasets and built a graphic representation of the forms of violence against women. The result is shown in the pie chart below.
Figure 3: Categories and Subcategories of Violence from the datasets (Source: Omdena)
From tags to deep learning
The classification of the datasets with the form of violence posed a multi-label classification problem: any records could be labeled with more than one label.
Since the original datasets were not already labeled, we had two typical options: an unsupervised classification for labeling the records via machine learning or labeled the records for supervised classification. As mentioned above, considering the type of data, an unsupervised classification did not seem to be easy to apply and could have led to misleading results. In addition, we needed to have a classification system that was particularly informative and that was human-friendly. The above approach instead seemed to meet the needs and overcame the problems described. Nonetheless, the solution raised the problem of the rigidity of a subjectively imposed categorization system that was difficult to measure in performance and difficult to generalize as records increase. Having labeled the records, the generalization problem could be addressed with the supervised classification model. In other words, we built two models. With the first one, we used a system of tags that allowed us to easily navigate through the records and of classifying forms of violence through particularly informative human labels consisting of categories and subcategories. With the second, we generalized the first model exploiting the state of the art of deep learning techniques (glove + GRU) and got a way to measure the performance and reliability of the first model. Essentially the ability of the deep learning model to function and generalize was influenced by the functionality of the model on which it was formed. With Glove and GRU, we got 88% accuracy, but that is another story.
Figure 4: Roadmap for Deep Learning Tool (Source: Omdena)
Some recently introduced ethical principles require that AI products be built and used responsibly, transparently, and subject to human oversight. The solutions presented here seek to integrate this new culture, conforming to Omdena’s mission and style. Model 1 offers a fully controllable, human-friendly data classification system and is therefore particularly informative, but difficult to measure in performance and generalize. Model 2 generalizes Model 1, making the rigid approach elastic and the performance measurable through the use of AI solutions. Beyond the technical considerations, Omdena gave us the opportunity to work on a real problem and better understand and study what GBV is. As we have just mentioned, unfortunately, Gender-Based Violence is not only a Nigerian problem but a worldwide one. With this contribution, we hope to have sensitized the reader on this particular execrable issue.
- Plan International Free To Be Online (Girls Online)
- Report On Rapid Assessment Survey On Gender-based Violence Experiences Of Women With Disabilities In Nigeria
- Data Points On Gender Norms And Sexism In And Around Maiduguri
- Nigerian Women Take Action As Rape, Assault Cases Surge During Pandemic
- The Shadow Pandemic: Gender-based Violence And Covid-19
- Gender-based Violence In Nigeria During The Covid-19 Crisis: The Shadow Pandemic