Predicting Disaster-Related Tweets

Assignments in the course IT3212 - Data Driven Software at NTNU. Our task was to classify whether a tweet is related to a disaster or not.

Motivation

The rapid spread of information on social media platforms like Twitter is revolutionizing how we learn about emergencies and disasters. However, the sheer volume of data makes manual monitoring impractical. Hence, developing an automated system to filter and verify posts about genuine disaster events could be invaluable for different emergency services. This could enable faster responses to crises and help medical facilities prepare for possible surges in patient numbers, allowing for more efficient resource management in urgent situations.

The machine-learning problem to solve

Creating an automatic filter for determining if a given tweet is associated with a disaster translates to a binary classification problem, which can be solved using machine learning. Before feeding data to a model, the data has to be preprocessed and feature extraction must be considered. This requires thinking critically about how to clean the raw data and extract features to refine and improve the machine learning model.

Preprocessing the data

Our first step involved removing irrelevant data, including certain columns and rows that didn't contribute to our analysis, such as duplicate tweets and rows with uncertain classifications. We also set a confidence score threshold to enhance data reliability.

Next, we normalized the text by converting all characters to lowercase, removing special characters, links, and English stopwords. Notably, we retained the essential content of hashtags and mentions while discarding irrelevant parts. Lemmatization was applied to reduce words to their base forms, aiding in effective text analysis.

Our feature extraction process involved techniques like TF-IDF, which helped in understanding the significance of words within our corpus. Additional features like text length, hashtag count, presence of URLs, and sentiment analysis enriched our dataset, offering a more nuanced view for our models.

This preprocessing journey shed light on the intricate nature of text data, emphasizing the importance of thoughtful and tailored preprocessing steps. Through this process, we learned the role of data quality and the impact of preprocessing choices on model performance.

Selecting the classification algorithms

In our project, we employed three different machine learning algorithms: logistic regression, AdaBoost, and XGBoost. Each of these models brought unique strengths and challenges to our task of classifying tweets as disaster-related or not.

Logistic Regression

This model is a fundamental approach in machine learning for binary classification problems. In our case, it performed quite well, achieving a test accuracy of 0.7955. Its precision, recall, and F1 score were balanced, indicating a reliable performance across different aspects of classification.

AdaBoost

Short for Adaptive Boosting, AdaBoost is a boosting algorithm that builds a series of weak learners and combines them to form a stronger model. It was particularly effective for binary classification, similar to our project. In training, AdaBoost achieved an accuracy of 0.8236. However, its test accuracy slightly dropped to 0.7808, indicating some challenges in generalizing to new data.

XGBoost

Known for its efficiency and performance, XGBoost is a gradient boosting model that constructs an ensemble of decision trees sequentially, with each tree correcting the errors of its predecessor. While it is celebrated for its speed and capability to handle various data complexities, we discovered that our XGBoost model was overfitted, meaning it performed well on training data but was less reliable in classifying new tweets.

In determining the best pipeline, we learned that logistic regression offered the most balanced and reliable results without overfitting. Although XGBoost showed strong individual metrics in precision and F1 score, its overfitting issue made it less reliable for new data. AdaBoost, while effective, had slightly lower scores overall. The performance differences among these models were marginal, with less than a 3 percentage point difference, emphasizing that all models were reasonably effective for our task.

This observation reinforced our belief that the key determinants of performance were not the models themselves, but rather the quality of the data and the effectiveness of the embedding process.

Solution

The machine-learning solution

Our implemented pipeline encompassed data preprocessing, feature extraction, and model evaluation. The preprocessing phase, including the removal of duplicate entries and those with low confidence scores, significantly enhanced the dataset's quality. However, we recognized that the TF-IDF method used for vectorizing textual data had limitations in capturing contextual meanings within the text.

The real-world solution

The machine learning solution we developed is particularly useful for emergency response units (ERUs) such as police, ambulance services, firefighters, and search and rescue teams. It enables these units to monitor and analyze tweets in real-time, extracting vital information pertinent to emergencies.

It's important to note that our model's predictions are focused solely on identifying whether a tweet is related to a disaster, not on the accuracy of the tweet's content. This functionality serves as a foundational tool for future developments, where the focus can shift towards tailoring predictions more precisely to the needs of specific stakeholders like ERUs.