Data Intelligence Platform

Sephora Review Intelligence Pipeline

A complete system that collects millions of beauty product reviews, analyzes them with AI, and identifies which products are genuinely loved by real customers.

0
Reviews Collected
0
Unique Reviewers
0
Years of Data
0
AI Models
Scroll to explore

Executive Summary

Understanding the scope and power of our review intelligence system

4.4 Million Reviews

Collected from Sephora, spanning 17 years of customer feedback (2008-2025)

1.4 Million Reviewers

Unique reviewers with demographics like skin type, skin tone, and age

Three AI Models

Working together to separate genuine reviews from fake/paid ones

The "Love Score"

Custom formula identifying products people genuinely love, not just inflated ratings

End Goal

Find the best products, then scrape their ingredient lists to understand what formulations actually work.

The Big Picture

Think of this pipeline like a factory with four stations

Collection

Scrape reviews from Sephora

Cleaning

Organize into clean tables

Intelligence

AI models add quality signals

Ranking

Score products by true love

Why this order matters: You can't analyze messy data. You can't rank products without quality signals. Each stage builds on the previous one.

Collecting the Data

Every Sephora product review contains a goldmine of information

What We're Collecting

What We Capture Why It Matters
Star Rating (1-5) The obvious signal, but easily manipulated
Review Text The real story - what did they love/hate?
Would Recommend? Often more honest than star ratings
Helpful Votes Crowd wisdom - which reviews are actually useful?
Reviewer Demographics Skin type, skin tone, age - does it work for people like you?
Photos Visual proof the reviewer actually used the product
Incentivized Flag Did Sephora give them free product for this review?
Date Posted Is this a recent opinion or from years ago?

How We Get It

Sephora uses a service called BazaarVoice to power their review system. This service has an API (think of it like a data faucet) that we can tap into.

1

Product Discovery

Start with a list of all Sephora products (from their sitemap)

2

API Requests

For each product, request all its reviews from the BazaarVoice API

3

Handle Challenges

Rate limits, errors, and pagination (reviews come in batches of 100)

4

Storage

Save everything as compressed files for later processing

Organizing the Data

Using a "star schema" - one central table with satellite tables for related information

Why We Split Into Multiple Tables

Imagine if we stored everything in one giant spreadsheet with 100+ columns. It would be:

  • Slow: Every query reads unnecessary data
  • Wasteful: User demographics repeat for every review
  • Messy: Hard to update one piece without touching everything
Database Schema - Hub and Spokes
User Profiles demographics
User Aggregates power users
REVIEWS 4.4M records
Engagement helpful votes
Photos visual proof
Product Scores ML outputs
Metadata timestamps

The Reviews Table (The Hub)

4,391,587 reviews

This is the heart of everything. Each row is one review.

ColumnWhat It Contains
Review IDA unique identifier (like a social security number for reviews)
Product IDWhich product this review is about
Author IDWho wrote it (links to user info)
Rating1-5 stars
Review TextThe actual written review
TitleThe headline of the review
Would RecommendYes/No
Date PostedWhen they wrote it
Key insight: 98% of reviews have actual text, not just star ratings.

User Profiles Table

Demographics

Demographics for each review - who is this person?

ColumnWhat It Contains
Skin TypeOily, Dry, Combination, Normal
Skin ToneFair, Light, Medium, Tan, Deep
Eye ColorFor makeup relevance
Hair ColorFor hair product relevance
Age Range18-24, 25-34, etc.
Is IncentivizedDid they receive free product?
Is StaffAre they a Sephora employee?
Coverage Statistics
Skin Type
82.6%
Skin Tone
74.8%
Age
15%

Engagement Table

Community Reaction

How did the community react to each review?

ColumnWhat It Contains
Helpful VotesHow many people found this useful
Not Helpful VotesHow many disagreed
Helpfulness ScoreCalculated ratio (helpful / total votes)
A review with 500 helpful votes is more trustworthy than one with 0.

Photos Table

Visual Proof

Visual proof of product experience.

ColumnWhat It Contains
Review IDLinks to the review
Photo URLWhere the image is hosted
~20%

of reviews include photos. These tend to be more credible - you can't fake a before/after photo easily.

User Aggregates Table

Reviewer History

A summary of each reviewer's history across ALL their reviews.

ColumnWhat It Contains
Author IDThe reviewer
Total ReviewsHow many reviews they've written
Average RatingTheir typical rating (are they harsh or generous?)
Organic ReviewsReviews written without incentive
First/Last Review DateHow long have they been reviewing?
A reviewer with 50 reviews and a 3.8 average is more trustworthy than someone who only reviews once and gives 5 stars.

Metadata Table

Technical Details

Technical details about each review's origin.

ColumnWhat It Contains
Created/Updated TimestampsWhen records changed
Source ClientMobile app? Desktop? In-store?
Campaign IDIf part of a marketing campaign

The Key Connectors

Review ID

Every satellite table links back to the main reviews table

Product ID

Groups all reviews for a single product together

Author ID

Connects to user-level statistics

The Intelligence Layer

Three AI models that separate genuine reviews from fake ones

Raw data only tells part of the story. A product might have 1,000 five-star reviews, but if half of them are from paid reviewers, that's not genuine love. This stage adds three layers of AI-powered intelligence.

Model A

Review Quality Scoring

The Problem

Not all reviews are created equal. Consider these two 5-star reviews:

★★★★★

"Great!"

3 words, no details, new reviewer
★★★★★

"I've been using this for 3 months now. My oily T-zone is finally under control without feeling tight..."

Detailed, experienced reviewer, photos, 50 helpful votes

How It Works

We use "incentivized" reviews as a signal. Incentivized reviews tend to be more detailed and structured, but also biased. By learning to detect these patterns, the model learns what makes a review substantive.

11 Features Used

Text length Word count Rating Would recommend Helpful votes User's total reviews User's avg rating Photos included +3 more

Output

0.0
Low Quality
1.0
High Quality
Model B

Fake Review Detection

The Problem

Some reviews are suspicious: brand employees, paid reviewers, competitors leaving fake negatives, or bot-generated text.

68+ Detection Signals

  • Vocabulary simplicity/repetition
  • Excessive superlatives
  • Uniform sentence structure
  • Suspicious phrases
  • Reading level anomalies
  • Hedge word frequency
  • One-time reviewer detection
  • Always 5-star pattern
  • Organic review count
  • Rating deviation from average
  • New account + immediate review
  • Coordinated review spikes
  • Time of day patterns
  • Helpfulness validation
  • Sentiment-engagement mismatch
  • Outlier detection
  • Product pattern analysis

Ensemble Approach

ML
Traditional ML Fast, interpretable
+
BERT
BERT Language Model Text nuance
=
50/50 Blend

Output: Fake Probability

0.0 Genuine
0.5 Uncertain
1.0 Likely Fake
Model C

Sentiment Analysis

The Problem

Star ratings are blunt instruments. Consider:

★★★★☆

"This is actually my holy grail product, I'd give it 5 but I'm strict with ratings"

★★★★☆

"It's okay I guess, nothing special, might repurchase if on sale"

Both are 4 stars, but the sentiment is completely different.

How It Works

DistilBERT fine-tuned on sentiment reads the review text and outputs a sentiment score.

Output

0.0
Negative
0.5
Neutral
1.0
Positive

Finding the Best Products

The Love Score - identifying products people genuinely love

The Love Score Formula

A custom metric designed to be manipulation-resistant. High ratings from paid reviewers don't count as much. Engagement signals genuine enthusiasm. Recent activity matters.

The Five Ingredients

35%

Organic Quality

What unpaid reviewers actually think

60% Average rating from organic reviewers
40% Recommendation rate from organic reviewers
25%

Engagement Quality

Community validation signals

50% Helpfulness ratio
30% Photo review percentage
20% Substantive reviews (50+ words)
15%

Authenticity

Percentage of genuine reviews

Ratio of organic reviews to total reviews. Default assumption: 13% incentivized.
15%

Diversity

Works across different people

50% Skin type diversity
50% Skin tone diversity
10%

Trend

Current momentum

Review velocity = reviews in last 180 days / total reviews

The Confidence Multiplier

Raw scores don't mean much without confidence. A product with 3 reviews might have a perfect Love Score, but we can't trust it.

confidence = log(organic_reviews + 1) / log(150)
0 reviews ~0%
10 reviews 46%
50 reviews 78%
150+ reviews 100%

Adjustments (Penalties & Boosts)

up to -15 pts

Inflation Penalty

When incentivized reviewers rate significantly higher than organic reviewers

up to -14 pts

Staff Penalty

When more than 30% of reviews are from Sephora employees

-10 pts

Polarization Penalty

When rating std dev > 1.5 AND negative reviews > 15%

up to -12 pts

ML Quality Penalty

When average review quality score < 0.5

+ ±10 pts

Rating Trend

Comparing recent ratings to historical ratings

+ up to +8 pts

Power User Boost

When experienced reviewers (21+ organic reviews) love the product

Final Score Calculation

raw_score = (0.35 × organic_quality)
          + (0.25 × engagement_quality)
          + (0.15 × authenticity)
          + (0.15 × diversity)
          + (0.10 × trend)

weighted_score = raw_score × confidence

final_score = weighted_score
            + inflation_penalty
            + staff_penalty
            + polarization_penalty
            + rating_trend
            + power_user_boost
            + ml_quality_penalty

(capped between 0 and 1)

Product Tiering

For products that need detail scraping (price, ingredients), we use a simpler tiering system:

Factor Points
Review Count
1,000+ reviews40 pts
500-999 reviews35 pts
100-499 reviews25 pts
50-99 reviews15 pts
10-49 reviews5 pts
Star Average
4.5+ star average30 pts
4.0-4.49 stars20 pts
3.5-3.99 stars10 pts
Substantive Reviews
100+ substantive reviews20 pts
50-99 substantive15 pts
10-49 substantive10 pts
High Priority 70+ points

Scrape ASAP

Medium Priority 40-69 points

Worth scraping

Low Priority 15-39 points

Low value

Complete Pipeline Order

The correct order to run everything

1

Data Collection

Scrape reviews from Sephora's BazaarVoice API

Output: Raw compressed files
2

Data Cleaning

Transform raw files into organized tables. Deduplicates and normalizes.

Output: 6 parquet tables
3

Train Review Quality Model

Learn what makes a review high-quality

Output: Trained model file
4

Score Review Quality

Apply the trained model to all 4.4M reviews

Output: Quality scores per review and per product
5

Score Sentiment

Run sentiment analysis on all review text

Output: Sentiment score per review
6

Train Fake Detection Model

Learn to detect fake/incentivized reviews

Output: ML model + BERT model Optional, GPU needed
7

Score Fake Detection

Apply fake detection to all reviews

Output: Fake probability per review
8

Run Product Finder

Calculate Love Scores using all the signals

Output: Ranked product list with full transparency
9

Run Product Prioritizer

Calculate tier scores for detail scraping

Output: Tiered product list
10

Generate Reports

Create analytics reports from the database

Output: Markdown and JSON reports

Key Insights from the Data

Important patterns revealed by our analysis

1

Ratings are heavily skewed positive

5★ 64.3%
4★ ~18%
3★ ~8%
2★ ~4%
1★ 5.3%

This is why raw average rating is a poor signal

2

Incentivized reviews are common

~13% default assumption for incentivized reviews

This varies significantly by product

3

Demographics matter

Different products work for different skin types. Products with diverse positive reviews are more universally good.

4

Power users are valuable

3-10% of reviewers have 21+ organic reviews

Their opinions are more predictive of true quality

5

Photos indicate genuine usage

~20% of reviews have photos

These reviews tend to be more helpful and authentic

6

Engagement validates quality

Reviews with high helpful votes are crowd-validated. Low engagement often correlates with low-quality reviews.

Current Gaps

What's missing and next steps

No Ingredients Data

We know WHAT products are loved, but not WHY from a formulation perspective.

Need: Full ingredients list, product description, how-to-use instructions, size/volume
Next: Use tiered product list to prioritize scraping

No Pricing Information

We can't factor in value-for-money without knowing prices.

Need: Current prices and price history
Next: Add price scraping to product page scraper

No Category Hierarchy

Products aren't fully categorized (e.g., Skincare > Moisturizers > Day Creams).

Need: Full category path from Sephora
Next: Extract from product pages

Glossary

Key terms and definitions

Organic Review
A review written without any incentive (no free product, no compensation). These are the most trustworthy.
Incentivized Review
A review where the reviewer received free product, discount, or other compensation. Tends to be more positive than organic reviews.
Power User
A reviewer who has written 21+ organic reviews. These people are experienced and harder to impress.
Helpfulness Ratio
The percentage of votes on a review that are "helpful" rather than "not helpful."
Substantive Review
A review with more than 50 words of actual content. These provide more insight than "Great product!"
Rating Inflation
When paid/incentivized reviewers give higher ratings than organic reviewers for the same product.
Polarization
When a product has a high standard deviation in ratings - some people love it, others hate it.
Confidence
How much we trust a score based on the amount of data. More reviews = higher confidence.
Parquet
A file format for storing tabular data that's compressed and fast to query. Think of it as a super-efficient spreadsheet format.
DuckDB
A database system that can query parquet files directly without loading them into memory.
JSONL
"JSON Lines" - a text file where each line is a separate JSON object. Easy to process line by line.
Love Score
Our custom metric that identifies products people genuinely love, designed to be manipulation-resistant.