Amazon-ML-Challenge-2025-3rd

This Repository contains our submission that got us 3rd place in Amazon ML challenge 2025

public

Forked

View on GitHub

Amazon ML Challenge 2025 — 3rd (Second Runner-Up)

Team Name: 00_Team_Rocket
Team Members: Parth Rastogi, Angadjeet Singh, Abhishek Jha, Harsh Kumar

Achievement

Secured Second Runner-Up Position in Amazon ML Challenge 2025

1. Executive Summary

Our team developed a hybrid deep learning approach combining DeBERTa-large-v3 with engineered features through a cross-attention fusion mechanism.
This architecture leverages pretrained language understanding enriched with domain-specific product attributes, resulting in robust and generalizable price prediction performance.

2. Methodology Overview

2.1 Problem Analysis & Key Observations

Through extensive exploratory data analysis (EDA), we observed that product pricing is influenced by both semantic content and explicit attributes.
Key findings include:

Price correlations with quality indicators (organic, gourmet, gluten-free)
Importance of text structure (length, punctuation, capitalization ratio)
Relevance of packaging characteristics (bulk size, unit type)

2.2 Solution Strategy

Approach Type: Hybrid (Pretrained LM + Feature Engineering + Cross-Attention)
Core Innovation:
A two-stream architecture that fuses DeBERTa’s [CLS] embedding with engineered feature embeddings via cross-attention, capturing complex relationships between product descriptions and structured metadata.

Dataset Insight:
As the log(price) distribution approximated a Gaussian, training was performed on log(price + 1).
Final predictions were obtained via exp(pred) - 1.

We also evaluated CLIP-based image embeddings, but analysis via UMAP clustering revealed poor consistency of embedding clusters with price values. Hence, we relied solely on text-based modeling.

Loss Function:
We used Smooth L1 loss instead of MSE, as it yielded lower SMAPE and more stable training.

3. Model Architecture

3.1 Architecture Overview

Stage	Model/Component	SMAPE (Validation)	SMAPE (Unstop)
First Approach	Bert Base	49.1	48.1
Pretraining	DeBERTa (Regression Task)	44.2	43.2
Final Hybrid Training	DeBERTa + Cross-Attention Fusion	39.86	40.329

3.2 Model Components

🧩 Text Processing Pipeline

Tokenization: DeBERTa tokenizer
Preprocessing: text normalization, truncation, special char handling
Model: DeBERTa-large-v3 (1024 hidden dim)
Pretraining task: price regression

⚙️ Feature Engineering Pipeline

Binary: is_organic, is_gourmet, is_gluten_free, is_bulk, special_chars
Categorical: unit_type
Numeric: value, num_words, num_sentences, uppercase_ratio

Fusion Strategy:
Concatenated feature embeddings + DeBERTa [CLS] → Cross-Attention → Linear Regression Head

❌ Other (Failed) Approaches

XGBoost / CatBoost on sparse feature datasets — SMAPE ≈ 56
Reinforcement Learning Finetuning: used SMAPE as a reward on frozen DeBERTa backbone; limited success due to differentiability constraints.

4. Model Performance

Model	Validation SMAPE	Notes
Pretrained DeBERTa Baseline	43.4	Text-only
Final Hybrid Model	39.83	5% hold-out
Challenge Submission	40.329	on Unstop

5. Conclusion

Our hybrid model successfully demonstrates that structured features can enhance the predictive power of pretrained language models for complex pricing regression tasks.
The cross-attention fusion effectively learns the interplay between semantic understanding and explicit product attributes, achieving scalable and interpretable performance gains for real-world e-commerce applications.

📁 Repository Structure

Amazon-ML-Challenge-2025-3rd/
│
├── Data/
│   ├── preprocessed_train.csv
│   ├── preprocessed_test.csv
│
├── Preprocess.py         # Data preprocessing
├── Pretraining.py        # DeBERTa pretraining on regression
├── Main_Training.py      # Final hybrid model training
├── Inference.py          # Inference and submission CSV generation
├── README.md

🔗 Final Model Checkpoint

You can access the final trained model (DeBERTa + Cross-Attention Fusion) used for the final submission here:
👉 Final_Model.pt (Google Drive)

Find me

v0.3.3[beta]