top of page

🏀 Texas Women’s Basketball Instagram Engagement Analysis 🏀

Unstructured Data · Web Scraping · Computer Vision · Text Analytics · Classification

Project Type: Unstructured Data Analysis + Classification
Tools: Python, Selenium, BeautifulSoup, Pandas, Google Vision API, Scikit-learn, LDA
Data: 500+ Instagram posts from UT Women’s Basketball
Output: Engagement prediction + content strategy insights


 

Project Overview

Social media is one of the primary ways collegiate athletic programs connect with fans, recruits, and alumni. For Texas Women’s Basketball, Instagram plays a key role in telling the program’s story — from game action and player moments to announcements and branding.

 

This project analyzes how visual content and caption text influence Instagram engagement for the University of Texas Women’s Basketball team. Starting from raw, publicly available Instagram content, I built an end-to-end pipeline that scrapes posts, extracts visual and textual signals, trains predictive models, and translates results into actionable content strategy insights.

 

The goal is not just prediction, but insight: What should the program post more of, and why?

Business Question

Which types of Instagram posts generate higher engagement for Texas Women’s Basketball, and what content characteristics consistently drive strong performance?

Data & Problem Setup

Data Sources and Collection

Instagram data is not directly available as a clean, structured dataset, so posts were collected using web scraping and browser automation. Using Selenium and BeautifulSoup, I scraped over 500 posts from the official UT Women’s Basketball Instagram account.

 

For each post, the following fields were collected:

  • Image URL (visual content)

  • Caption text (language, hashtags, emojis)

  • Number of likes (engagement metric)

 

This mirrors a real-world analytics scenario where data must be assembled from unstructured, dynamic webpages before any modeling can begin.

Engagement Definition

To frame this as a classification problem, engagement was defined relative to the account’s own performance:

  • Posts above the median number of likesHigh engagement

  • Posts below the medianLow engagement

This avoids arbitrary thresholds and focuses on what outperforms typical content.

output.png

Feature Construction (Unstructured → Structured)

Image Labeling (Computer Vision)

Each Instagram image was processed using the Google Vision API, which extracts descriptive labels representing what appears in the image (e.g., basketball game, celebration, crowd, athlete). 

This step converts raw visual content into model-ready textual features while preserving interpretability.

Caption Text Processing

Caption text was cleaned and vectorized using a Bag-of-Words approach, capturing:

  • Keywords

  • Hashtags

  • Emojis and short hype phrases

Together, image labels and captions provide two complementary views of each post:

  • What the image shows

  • How the post tells the story

wouldcount.png

Engagement Prediction (Logistic Regression)

To understand which sources of information are most predictive of Instagram engagement, we trained three logistic regression models, each using a different feature set:

  • Image labels only (visual content from Google Vision API)

  • Captions only (text, hashtags, emojis)

  • Combined image labels + captions

 

Each model predicts whether a post will receive high or low engagement, defined relative to the account’s median likes.

Model Results

Across all models, caption-based features consistently outperformed image-only features, indicating that language and context play a stronger direct role in engagement than visual descriptors alone. However, image content still contributed meaningful signal, particularly for emotionally charged or dynamic posts.

 

Test performance (Accuracy):

  • Image labels only: 60.00%

  • Captions only: 64.85%

  • Combined (labels + captions): 64.85%

 

While the captions-only and combined models achieved the same overall accuracy, their error profiles differed, motivating a deeper evaluation beyond headline metrics.

image.png
Screenshot 2026-02-03 at 11.16.33 PM.png

Evaluation & Model Selection

 

The dataset consisted of 548 total posts, split into 383 training and 165 test observations. The engagement target was nearly balanced (275 high vs 273 low), making accuracy an appropriate baseline metric.

However, the primary objective is to identify content that is likely to perform well, meaning the cost of a false negative (missing a high-performing post) is higher than the cost of a false positive (promoting a post that underperforms).

Reviewing confusion matrices showed that the combined model (image labels + captions) provided stronger detection of high-engagement posts, making it better suited for identifying content worth amplifying. As a result, the combined model was selected as the preferred approach despite having the same overall accuracy as the captions-only model.

Screenshot 2026-02-03 at 11.37.27 PM.png

Key Modeling Takeaways

  • Captions matter more than visuals — but visuals still help. Caption-based models achieved the highest standalone accuracy, while combining image and text features improved detection of high-engagement posts.

  • Emojis and hashtags function more as brand signals than independent drivers. Preserving emojis and hashtags produced similar predictive performance to words alone, suggesting they reinforce identity rather than directly increasing engagement.

  • Model evaluation should reflect real decisions. Selecting models based on recall and error tradeoffs is more meaningful than relying solely on accuracy when the goal is to surface high-performing content.​​

Interpreting Content Themes (Topic Modeling)

 

Prediction alone doesn’t explain why posts succeed. To uncover underlying patterns, Latent Dirichlet Allocation (LDA) was applied to the image labels to identify recurring visual themes across posts.

A 3-topic solution was selected to balance interpretability and coverage of the content space. Each post was represented as a mixture of these topics, allowing us to compare how content themes differ between high- and low-engagement posts.

The three discovered topics can be summarized as:

  • Topic 1 – Posters & Branding:
    Includes design-heavy content such as game posters, announcement graphics, and branded visuals. These posts are characterized by labels related to text, design elements, and structured layouts.

  • Topic 2 – Off-Court / People-Focused Content:
    Captures behind-the-scenes moments, team interactions, and player-focused imagery. Labels emphasize people, groups, and non-game settings.

  • Topic 3 – In-Game / Live Action:
    Represents basketball gameplay and on-court action, including shots, movement, and competitive moments during games.

High vs Low Engagement Themes

Comparing topic proportions between engagement groups revealed meaningful differences:

  • High-engagement posts were most strongly associated with Off-Court / People-Focused content, suggesting that behind-the-scenes moments, player personalities, and human storytelling are key drivers of fan engagement.

  • Low-engagement posts over-indexed on Posters & Branding, indicating that graphic-heavy announcements may underperform when not supported by stronger narrative context.

  • In-Game / Live Action content appeared at similar levels across both groups, suggesting that game photos alone do not consistently differentiate engagement outcomes.

Screenshot 2026-02-03 at 10.48.20 PM.png
Screenshot 2026-02-03 at 10.49.58 PM.png

Insights & Recommendations

What Works Best

  • Player-focused and behind-the-scenes content

  • Posts that highlight emotion, personality, and connection

  • Captions that add narrative context to visuals

What Underperforms

  • Graphic-heavy promotional or announcement posts without storytelling

  • Content that lacks emotional or personal framing

Content Strategy Recommendations

  • Increase off-court and behind-the-scenes storytelling

  • Pair announcements and posters with narrative-driven captions

  • Use branding graphics strategically rather than as standalone posts

  • Emphasize player identity and team culture, not just game action

Why This Matters

 

This project demonstrates how unstructured data images and text — can be transformed into actionable insights using modern analytics techniques. Rather than guessing what content works, the approach provides a data-driven framework for shaping social media strategy in collegiate athletics.

 

The same workflow can be applied to:

  • Professional sports teams

  • Brand social media accounts

  • Marketing and content analytics teams

Tools & Skills Demonstrated

​​

  • Computer Vision (Google Vision API)

  • Text Analytics (Bag-of-Words)

  • Classification Modeling (Logistic Regression)

  • Topic Modeling (LDA)

  • Translating models into business recommendations

bottom of page