CarPrice Estimate -DataMining

by

in

One of the most challenging parts of the project was not building the models—it was building the dataset.

Like many real-world machine learning projects, the majority of the work happened before any model training began. Collecting and cleaning reliable data proved to be far more difficult than expected.
To build the dataset, we designed a web scraping pipeline that collected vehicle listings from Iranian marketplaces such as Divar and Karnameh. The pipeline went through several iterations as we experimented with different extraction strategies and adapted to changes in website structures, timeouts, and inconsistent responses.

The data collection process consisted of two main stages:
1. URL Collection
First, we gathered listing URLs using a combination of browser-based extraction tools and automated collection methods.
2. Data Extraction
This stage required several different approaches before reaching a stable solution.

A Selenium-based approach was our initial choice, but we struggled with the structure and behavior of some websites.
A Requests + BeautifulSoup pipeline was significantly faster and worked well in many cases, although it was not always reliable.

Eventually, a Selenium-based solution with dynamic element detection provided the most consistent results and became the final implementation.
Each vehicle listing was transformed into a single row in a large DataFrame. While every record contained a common set of fields, many attributes varied across listings depending on the available information. As a result, the final dataset resembled a sparse matrix, where different vehicles contained different combinations of features.

To maintain stability during data collection, several anti-detection measures were incorporated, including custom user-agent headers, randomized delays between requests, browser automation masking, and rendering optimizations to ensure consistent page loading.

Once the raw data was collected, the next challenge was preparing it for machine learning. Categorical variables such as color, body type, transmission type, and vehicle condition had to be encoded into numerical representations. Missing values, inconsistent formats, and duplicated information also required extensive preprocessing before the data could be used for regression and classification models.

In the end, the dataset became one of the most valuable outcomes of the project. Building the model was important, but building a reliable data pipeline was what made the model possible in the first place.

(Technical stack: Selenium, BeautifulSoup, Requests, Pandas, and WebDriver Manager.)


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *