Hackathon Challenge: Decode the Pulse of E-Commerc...
作成日: 2025年7月19日
作成日: 2025年7月19日
Hackathon Challenge: Decode the Pulse of E-Commerce
Storyline:
In the dynamic world of e-commerce, customer expectations are evolving rapidly. Businesses
struggle to keep up with what customers want, how they feel, and when they might leave. Your
mission is to build an Intelligent Customer Experience Platform—a unified system that listens,
learns, and responds to customer behavior in real time.
You are the data detectives, the experienced architects, and the insight engineers. Your goal is to
empower business teams with a smart dashboard that transforms raw data into actionable
intelligence—without relying on large language models or generative AI.
Challenge Objective:
Design and build a backend dashboard that integrates three core capabilities:
pythonfile_path = '/mnt/data/final_ecommerce_customer_experience_updated.xlsx' os.path.exists(file_path) True
pythonxls = pd.ExcelFile(file_path) xls.sheet_names ['Sheet1']
pythondf = pd.read_excel(xls, 'Sheet1') df.head() Unnamed: 0 age gender geo location purchase_count \ 0 CUST0152 67 Male Lake Brittany 22 1 CUST0742 25 Male Port Cynthialand 33 2 CUST0851 70 Male Lyonshaven 17 3 CUST0083 42 Male Laurafort 35 4 CUST0759 70 Female North Judith 39 last_purchase_date churn_label product_id product_title product_category \ 0 2025-03-23 14:02:00 No PROD0434 Exactly Shirt Beauty 1 2024-12-12 19:19:00 No PROD0284 Space Gadget Home 2 2025-06-15 09:28:00 No PROD0164 Join Cream Home 3 2024-09-17 12:58:00 No PROD0445 Military Ball Beauty 4 2024-10-29 07:17:00 No PROD0262 Quickly Ball Fashion base_price quantity purchase_date total_spend \ 0 391.67 4 2024-10-07 1566.68 1 367.87 1 2024-09-11 367.87 2 275.92 3 2024-08-12 827.76 3 279.26 4 2023-12-19 1117.04 4 414.01 4 2023-11-14 1656.04 user_feedback delivery_rating \ 0 The beauty product met expectations and was we... Positive 1 The item arrived damaged. Received a wrong ite... Very Positive 2 The home item was functional and matched the d... Very Positive 3 Would definitely recommend this product. The r... Neutral 4 Excellent service and fast delivery. The retur... Positive product_quality_rating support_rating discount_percentage pricing_score 0 NaN 5.0 10 1410.012 1 2.0 5.0 0 367.870 2 2.0 1.0 25 620.820 3 3.0 1.0 15 949.484 4 2.0 4.0 0 1656.040
python# See shape and columns df.shape (3000, 20)
pythondf.columns Index(['Unnamed: 0', 'age', 'gender', 'geo location', 'purchase_count', 'last_purchase_date', 'churn_label', 'product_id', 'product_title', 'product_category', 'base_price', 'quantity', 'purchase_date', 'total_spend', 'user_feedback', 'delivery_rating', 'product_quality_rating', 'support_rating', 'discount_percentage', 'pricing_score'], dtype='object')
pythondf.dtypes Unnamed: 0 object age int64 gender object geo location object purchase_count int64 last_purchase_date datetime64[ns] churn_label object product_id object product_title object product_category object base_price float64 quantity int64 purchase_date datetime64[ns] total_spend float64 user_feedback object delivery_rating object product_quality_rating float64 support_rating float64 discount_percentage int64 pricing_score float64 dtype: object
Below is a step-by-step blueprint you can follow to turn the Excel data you just shared into an Intelligent Customer Experience Platform with the three required capabilities—churn prediction, feedback analysis, and a lightweight QA assistant—while staying strictly within classic ML + Transformer encoders (no generative LLMs).
What you have | Why it matters | Immediate actions |
---|---|---|
Excel sheet (final_ecommerce_customer_experience_updated.xlsx )<br>3000 rows × 20 columns (customers, transactions, feedback, ratings, discounts, etc.) | One flat file that mixes customer fields (e.g. age, geo), order fields (e.g. product_id, quantity) and text feedback. | 1. Load into a staging table (e.g. Postgres or DuckDB).<br>2. Create three logical tables/views:<br> • customers (1 row per customer_id)<br> • orders (1 row per order_id)<br> • feedback (1 row per feedback_id)<br>3. Convert all date strings to datetime ; drop the unnamed index column. |
(You can automate the split with a tiny ETL script; this keeps each model’s features tidy.)
Group | Features to derive (examples) |
---|---|
RFM-style | recency = days_since(last_purchase_date) <br>frequency = purchase_count / customer_lifetime_months <br>monetary = total_spend / customer_lifetime_months |
Engagement | avg quantity per order, number_of_products_bought, share_of_discounted_orders |
Price Sensitivity | mean(discount_percentage), mean(pricing_score) |
Satisfaction | mean(delivery_rating), mean(product_quality_rating), mean(support_rating) |
Demographics (encoded) | one-hot gender, age bucket, geo cluster |
Model | Why it fits | Notes |
---|---|---|
Gradient Boosting (XGBoost / LightGBM) | Handles mixed feature types, small-to-medium data, interpretable via SHAP | Baseline workhorse |
TabTransformer (Hugging Face tabtransformer-base ) | Pure-encoder transformer for tabular data—still non-generative. Learns contextual embeddings across categorical columns. | Good hackathon differentiator; fine-tune for 5–10 epochs. |
Logistic Regression | Simple benchmark | Sets a lower bound for lift |
Process: stratified train/validation split (e.g. 70/30), SMOTE or class-weighting if churn_label
is imbalanced, optimize AUC-ROC & F-β (β>1 to favor recall).
Use causal uplift trees or CausalML’s UpliftRandomForestClassifier
. Output recommended discount tiers, then surface in the dashboard.
Approach | Tools |
---|---|
Transformer encoder fine-tune | Start with distilbert-base-uncased-finetuned-sst-2-english (binary) or roberta-base-go_emotions (multi-class).<br>Fine-tune on your own labeled subset (if any). |
Rule/lexicon fallback | VADER or TextBlob for low-resource languages or cold start |
Store a sentiment_score ∈ [-1,1]
and sentiment_label
per feedback row.
topic_id
, keywords
, representative feedback excerpts.(No generative summarization: instead show “Top n key phrases” & “Most representative sentences” for each topic with a TextRank extractive method.)
Goal 📌: Let ops teams ask “Which customers in California bought ‘Exactly Shirt’ last month and contacted support twice?” without writing SQL.
Layer | Tooling | Rationale |
---|---|---|
Vector index | FAISS or pgvector | Search by semantic similarity |
Embedding model | all-MiniLM-L6-v2 (sentence-transformers) | Tiny, fast, encoder-only |
Keyword/BM25 index | ElasticSearch / OpenSearch | Exact filters & fast metadata facets |
Router | Simple rules: if query contains “count / how many / average…” send to analytic SQL; else combine BM25 + vector search. |
Return a pre-formatted answer card:
• Relevant records (top k rows)
• Aggregates (e.g. total_spend) when patterns like “sum / total / average” detected
(No text generation—just assembled facts.)
mermaidgraph TD subgraph Batch & Streaming Excel-->|ETL|Staging[(DB)] Orders-->FeatEng Customers-->FeatEng FeatEng-->|Features|ChurnModel Feedback-->|BERT|Sentiment Feedback-->|BERTopic|Topics end FeatEng-->Scores[(Churn_Scores)] Sentiment-->NLP_DB[(Feedback_Insights)] Topics-->NLP_DB subgraph API Layer (FastAPI) Scores & NLP_DB & VectorIndex --> API end API-->Dashboard[(React / Streamlit)] Kafka/EventHub --> FeatEng %% for real-time adds
Batch jobs refresh predictions daily; streaming micro-batch (e.g. every 5 min) can update a single customer right after an event.
Section | API Endpoint | Data source |
---|---|---|
Churn Heatmap | /churn/top_risk?limit=100 | Scores table |
Price Recommendations | /pricing/discount_suggestions?customer_id=… | Uplift model outputs |
Sentiment Trends | /feedback/trends?period=30d | NLP_DB |
Topic Drill-down | /feedback/topic/{topic_id} | NLP_DB , raw feedback |
QA Search | /qa/query?q=… | Vector + BM25 + SQL engine |
Return JSON; the front-end can render with Plotly/Chart.js and trigger CSV exports.
Capability | Offline metric | Online metric |
---|---|---|
Churn | AUC-ROC, PRAUC, recall@k | Net revenue saved per campaign, uplift vs. control |
Sentiment | F1, MCC | Correlation with CSAT / support tickets |
Topics | Topic coherence | Analyst feedback acceptance rate |
QA | Precision@5 (manual audit) | Query success rate, latency |
Set up MLflow or Weights & Biases for experiment tracking, plus Prometheus/Grafana for live health (latency, missing features, drift).
Day | Milestone |
---|---|
1 | Ingest Excel ➜ clean schema ➜ quick EDA |
2 | Build churn feature pipeline ➜ baseline XGBoost model |
3 | Fine-tune DistilBERT for sentiment ➜ run BERTopic |
4 | Stand up vector+BM25 QA service ➜ API skeleton |
5 | Wire APIs into minimal dashboard ➜ iterate on UX |
6 | Add uplift pricing module ➜ evaluate & polish charts |
7 | Prepare demo dataset ➜ write readme & deployment scripts |
You now have a clear blueprint to start building. Let me know if you want code snippets, sample notebooks, or deeper dives into any step.