Sephora Review Intelligence Pipeline

Overview

Executive Summary

Understanding the scope and power of our review intelligence system

4.4 Million Reviews

Collected from Sephora, spanning 17 years of customer feedback (2008-2025)

1.4 Million Reviewers

Unique reviewers with demographics like skin type, skin tone, and age

Three AI Models

Working together to separate genuine reviews from fake/paid ones

The "Love Score"

Custom formula identifying products people genuinely love, not just inflated ratings

Architecture

The Big Picture

Think of this pipeline like a factory with four stations

Collection

Scrape reviews from Sephora

Cleaning

Organize into clean tables

Intelligence

AI models add quality signals

Ranking

Score products by true love

Why this order matters: You can't analyze messy data. You can't rank products without quality signals. Each stage builds on the previous one.

Stage 1

Collecting the Data

Every Sephora product review contains a goldmine of information

What We're Collecting

What We Capture	Why It Matters
Star Rating (1-5)	The obvious signal, but easily manipulated
Review Text	The real story - what did they love/hate?
Would Recommend?	Often more honest than star ratings
Helpful Votes	Crowd wisdom - which reviews are actually useful?
Reviewer Demographics	Skin type, skin tone, age - does it work for people like you?
Photos	Visual proof the reviewer actually used the product
Incentivized Flag	Did Sephora give them free product for this review?
Date Posted	Is this a recent opinion or from years ago?

How We Get It

Sephora uses a service called BazaarVoice to power their review system. This service has an API (think of it like a data faucet) that we can tap into.

1

Product Discovery

Start with a list of all Sephora products (from their sitemap)

2

API Requests

For each product, request all its reviews from the BazaarVoice API

3

Handle Challenges

Rate limits, errors, and pagination (reviews come in batches of 100)

4

Storage

Save everything as compressed files for later processing

Challenges Overcome

403 Errors

Smart retry logic recovered ~100K lost reviews

Rate Limiting

~1 request every 2 seconds sustained

Checkpointing

Progress saved regularly to enable resumption

Stage 2

Organizing the Data

Using a "star schema" - one central table with satellite tables for related information

Why We Split Into Multiple Tables

Imagine if we stored everything in one giant spreadsheet with 100+ columns. It would be:

✗ Slow: Every query reads unnecessary data
✗ Wasteful: User demographics repeat for every review
✗ Messy: Hard to update one piece without touching everything

Database Schema - Hub and Spokes

User Profiles demographics

User Aggregates power users

REVIEWS 4.4M records

Engagement helpful votes

Photos visual proof

Product Scores ML outputs

Metadata timestamps

The Reviews Table (The Hub)

4,391,587 reviews

This is the heart of everything. Each row is one review.

Column	What It Contains
Review ID	A unique identifier (like a social security number for reviews)
Product ID	Which product this review is about
Author ID	Who wrote it (links to user info)
Rating	1-5 stars
Review Text	The actual written review
Title	The headline of the review
Would Recommend	Yes/No
Date Posted	When they wrote it

Key insight: 98% of reviews have actual text, not just star ratings.

User Profiles Table

Demographics

Demographics for each review - who is this person?

Column	What It Contains
Skin Type	Oily, Dry, Combination, Normal
Skin Tone	Fair, Light, Medium, Tan, Deep
Eye Color	For makeup relevance
Hair Color	For hair product relevance
Age Range	18-24, 25-34, etc.
Is Incentivized	Did they receive free product?
Is Staff	Are they a Sephora employee?

Coverage Statistics

Skin Type

82.6%

Skin Tone

74.8%

Age

15%

Engagement Table

Community Reaction

How did the community react to each review?

Column	What It Contains
Helpful Votes	How many people found this useful
Not Helpful Votes	How many disagreed
Helpfulness Score	Calculated ratio (helpful / total votes)

A review with 500 helpful votes is more trustworthy than one with 0.

Photos Table

Visual Proof

Visual proof of product experience.

Column	What It Contains
Review ID	Links to the review
Photo URL	Where the image is hosted

~20%

of reviews include photos. These tend to be more credible - you can't fake a before/after photo easily.

User Aggregates Table

Reviewer History

A summary of each reviewer's history across ALL their reviews.

Column	What It Contains
Author ID	The reviewer
Total Reviews	How many reviews they've written
Average Rating	Their typical rating (are they harsh or generous?)
Organic Reviews	Reviews written without incentive
First/Last Review Date	How long have they been reviewing?

A reviewer with 50 reviews and a 3.8 average is more trustworthy than someone who only reviews once and gives 5 stars.

Metadata Table

Technical Details

Technical details about each review's origin.

Column	What It Contains
Created/Updated Timestamps	When records changed
Source Client	Mobile app? Desktop? In-store?
Campaign ID	If part of a marketing campaign

The Key Connectors

Review ID

Every satellite table links back to the main reviews table

Product ID

Groups all reviews for a single product together

Author ID

Connects to user-level statistics

Stage 3

The Intelligence Layer

Three AI models that separate genuine reviews from fake ones

Raw data only tells part of the story. A product might have 1,000 five-star reviews, but if half of them are from paid reviewers, that's not genuine love. This stage adds three layers of AI-powered intelligence.

Model A

Review Quality Scoring

The Problem

Not all reviews are created equal. Consider these two 5-star reviews:

★★★★★

"Great!"

3 words, no details, new reviewer

★★★★★

"I've been using this for 3 months now. My oily T-zone is finally under control without feeling tight..."

Detailed, experienced reviewer, photos, 50 helpful votes

How It Works

We use "incentivized" reviews as a signal. Incentivized reviews tend to be more detailed and structured, but also biased. By learning to detect these patterns, the model learns what makes a review substantive.

11 Features Used

Text length Word count Rating Would recommend Helpful votes User's total reviews User's avg rating Photos included +3 more

Output

0.0
Low Quality 1.0
High Quality

Model B

Fake Review Detection

The Problem

Some reviews are suspicious: brand employees, paid reviewers, competitors leaving fake negatives, or bot-generated text.

68+ Detection Signals

Vocabulary simplicity/repetition
Excessive superlatives
Uniform sentence structure
Suspicious phrases
Reading level anomalies
Hedge word frequency

One-time reviewer detection
Always 5-star pattern
Organic review count
Rating deviation from average

New account + immediate review
Coordinated review spikes
Time of day patterns

Helpfulness validation
Sentiment-engagement mismatch
Outlier detection
Product pattern analysis

Ensemble Approach

ML

Traditional ML Fast, interpretable

+

BERT

BERT Language Model Text nuance

=

50/50 Blend

Output: Fake Probability

0.0 Genuine

0.5 Uncertain

1.0 Likely Fake

Model C

Sentiment Analysis

The Problem

Star ratings are blunt instruments. Consider:

★★★★☆

"This is actually my holy grail product, I'd give it 5 but I'm strict with ratings"

★★★★☆

"It's okay I guess, nothing special, might repurchase if on sale"

Both are 4 stars, but the sentiment is completely different.

How It Works

DistilBERT fine-tuned on sentiment reads the review text and outputs a sentiment score.

Output

0.0
Negative 0.5
Neutral 1.0
Positive

Stage 4

Finding the Best Products

The Love Score - identifying products people genuinely love

The Love Score Formula

A custom metric designed to be manipulation-resistant. High ratings from paid reviewers don't count as much. Engagement signals genuine enthusiasm. Recent activity matters.

The Five Ingredients

35%

Organic Quality

What unpaid reviewers actually think

60% Average rating from organic reviewers

40% Recommendation rate from organic reviewers

25%

Engagement Quality

Community validation signals

50% Helpfulness ratio

30% Photo review percentage

20% Substantive reviews (50+ words)

15%

Authenticity

Percentage of genuine reviews

Ratio of organic reviews to total reviews. Default assumption: 13% incentivized.

15%

Diversity

Works across different people

50% Skin type diversity

50% Skin tone diversity

10%

Trend

Current momentum

Review velocity = reviews in last 180 days / total reviews

The Confidence Multiplier

Raw scores don't mean much without confidence. A product with 3 reviews might have a perfect Love Score, but we can't trust it.

confidence = log(organic_reviews + 1) / log(150)

0 reviews ~0%

10 reviews 46%

50 reviews 78%

150+ reviews 100%

Adjustments (Penalties & Boosts)

− up to -15 pts

Inflation Penalty

When incentivized reviewers rate significantly higher than organic reviewers

− up to -14 pts

Staff Penalty

When more than 30% of reviews are from Sephora employees

− -10 pts

Polarization Penalty

When rating std dev > 1.5 AND negative reviews > 15%

− up to -12 pts

ML Quality Penalty

When average review quality score < 0.5

+ ±10 pts

Rating Trend

Comparing recent ratings to historical ratings

+ up to +8 pts

Power User Boost

When experienced reviewers (21+ organic reviews) love the product

Final Score Calculation

raw_score = (0.35 × organic_quality)
          + (0.25 × engagement_quality)
          + (0.15 × authenticity)
          + (0.15 × diversity)
          + (0.10 × trend)

weighted_score = raw_score × confidence

final_score = weighted_score
            + inflation_penalty
            + staff_penalty
            + polarization_penalty
            + rating_trend
            + power_user_boost
            + ml_quality_penalty

(capped between 0 and 1)

Product Tiering

For products that need detail scraping (price, ingredients), we use a simpler tiering system:

Factor	Points
Review Count
1,000+ reviews	40 pts
500-999 reviews	35 pts
100-499 reviews	25 pts
50-99 reviews	15 pts
10-49 reviews	5 pts
Star Average
4.5+ star average	30 pts
4.0-4.49 stars	20 pts
3.5-3.99 stars	10 pts
Substantive Reviews
100+ substantive reviews	20 pts
50-99 substantive	15 pts
10-49 substantive	10 pts

High Priority 70+ points

Scrape ASAP

Medium Priority 40-69 points

Worth scraping

Low Priority 15-39 points

Low value

Skip < 15 points

Not enough signal

Implementation

Complete Pipeline Order

The correct order to run everything

1

Data Collection

Scrape reviews from Sephora's BazaarVoice API

Output: Raw compressed files

2

Data Cleaning

Transform raw files into organized tables. Deduplicates and normalizes.

Output: 6 parquet tables

3

Train Review Quality Model

Learn what makes a review high-quality

Output: Trained model file

4

Score Review Quality

Apply the trained model to all 4.4M reviews

Output: Quality scores per review and per product

5

Score Sentiment

Run sentiment analysis on all review text

Output: Sentiment score per review

6

Train Fake Detection Model

Learn to detect fake/incentivized reviews

Output: ML model + BERT model Optional, GPU needed

7

Score Fake Detection

Apply fake detection to all reviews

Output: Fake probability per review

8

Run Product Finder

Calculate Love Scores using all the signals

Output: Ranked product list with full transparency

9

Run Product Prioritizer

Calculate tier scores for detail scraping

Output: Tiered product list

10

Generate Reports

Create analytics reports from the database

Output: Markdown and JSON reports

Discoveries

Key Insights from the Data

Important patterns revealed by our analysis

1

Ratings are heavily skewed positive

5★ 64.3%

4★ ~18%

3★ ~8%

2★ ~4%

1★ 5.3%

This is why raw average rating is a poor signal

2

Incentivized reviews are common

~13% default assumption for incentivized reviews

This varies significantly by product

3

Demographics matter

Different products work for different skin types. Products with diverse positive reviews are more universally good.

4

Power users are valuable

3-10% of reviewers have 21+ organic reviews

Their opinions are more predictive of true quality

5

Photos indicate genuine usage

~20% of reviews have photos

These reviews tend to be more helpful and authentic

6

Engagement validates quality

Reviews with high helpful votes are crowd-validated. Low engagement often correlates with low-quality reviews.

Roadmap

Current Gaps

What's missing and next steps

No Ingredients Data

We know WHAT products are loved, but not WHY from a formulation perspective.

Need: Full ingredients list, product description, how-to-use instructions, size/volume

Next: Use tiered product list to prioritize scraping

No Pricing Information

We can't factor in value-for-money without knowing prices.

Need: Current prices and price history

Next: Add price scraping to product page scraper

No Category Hierarchy

Products aren't fully categorized (e.g., Skincare > Moisturizers > Day Creams).

Need: Full category path from Sephora

Next: Extract from product pages

Ingredient Intelligence

The eventual goal: Understand what ingredients actually work, not just what products are popular.

Parse ingredient lists into structured data

Identify active ingredients vs. fillers

Cluster products by formulation type

Correlate ingredients with review sentiment

Reference

Glossary

Key terms and definitions

Organic Review

A review written without any incentive (no free product, no compensation). These are the most trustworthy.

Incentivized Review

A review where the reviewer received free product, discount, or other compensation. Tends to be more positive than organic reviews.

Power User

A reviewer who has written 21+ organic reviews. These people are experienced and harder to impress.

Helpfulness Ratio

The percentage of votes on a review that are "helpful" rather than "not helpful."

Substantive Review

A review with more than 50 words of actual content. These provide more insight than "Great product!"

Rating Inflation

When paid/incentivized reviewers give higher ratings than organic reviewers for the same product.

Polarization

When a product has a high standard deviation in ratings - some people love it, others hate it.

Confidence

How much we trust a score based on the amount of data. More reviews = higher confidence.

Parquet

A file format for storing tabular data that's compressed and fast to query. Think of it as a super-efficient spreadsheet format.

DuckDB

A database system that can query parquet files directly without loading them into memory.

JSONL

"JSON Lines" - a text file where each line is a separate JSON object. Easy to process line by line.

Love Score

Our custom metric that identifies products people genuinely love, designed to be manipulation-resistant.

Sephora Review Intelligence Pipeline

Executive Summary

4.4 Million Reviews

1.4 Million Reviewers

Three AI Models

The "Love Score"

End Goal

The Big Picture

Collection

Cleaning

Intelligence

Ranking

Collecting the Data

What We're Collecting

How We Get It

Product Discovery

API Requests

Handle Challenges

Storage

Challenges Overcome

Organizing the Data

Why We Split Into Multiple Tables

The Reviews Table (The Hub)

User Profiles Table

Coverage Statistics

Engagement Table

Photos Table

User Aggregates Table

Metadata Table

The Key Connectors

The Intelligence Layer

Review Quality Scoring

The Problem

How It Works

11 Features Used

Output

Fake Review Detection

The Problem

68+ Detection Signals

Ensemble Approach

Output: Fake Probability

Sentiment Analysis

The Problem

How It Works

Output

Finding the Best Products

The Love Score Formula

The Five Ingredients

Organic Quality

Engagement Quality

Authenticity

Diversity

Trend

The Confidence Multiplier

Adjustments (Penalties & Boosts)

Inflation Penalty

Staff Penalty

Polarization Penalty

ML Quality Penalty

Rating Trend

Power User Boost

Final Score Calculation

Product Tiering

Complete Pipeline Order

Data Collection

Data Cleaning

Train Review Quality Model

Score Review Quality

Score Sentiment

Train Fake Detection Model

Score Fake Detection

Run Product Finder

Run Product Prioritizer

Generate Reports

Key Insights from the Data

Ratings are heavily skewed positive

Incentivized reviews are common

Demographics matter

Power users are valuable

Photos indicate genuine usage