160 lines
7.1 KiB
Markdown
160 lines
7.1 KiB
Markdown
|
|
---
|
||
|
|
title: "2 Million Used Cars and What They Tell Us"
|
||
|
|
draft: true
|
||
|
|
date: 2026-02-12
|
||
|
|
tags: ["Scraping", "Data Engineering", "Grafana", "PostgreSQL"]
|
||
|
|
summary: "Scraping ~2M used car listings, throwing them into a database, and seeing what shakes out."
|
||
|
|
code: ""
|
||
|
|
demo: ""
|
||
|
|
---
|
||
|
|
|
||
|
|
## The Question
|
||
|
|
|
||
|
|
Everyone's heard the legend: a VW Passat that just keeps going at 400,000 km. But is it actually the only car that pulls that off? What other models quietly rack up absurd mileage — and what do they cost? In short: what makes a car *last*, and can you get one without overpaying?
|
||
|
|
|
||
|
|
Time to find out with data instead of hearsay.
|
||
|
|
|
||
|
|
## The Approach
|
||
|
|
|
||
|
|
A major used car platform turned out to be surprisingly cooperative when it came to structured data. Their recommendation engine helpfully links to similar listings — so starting from a search, you can just keep crawling through related results.
|
||
|
|
|
||
|
|
The haul: roughly **2 million listings** from early 2026, downloaded as JSON and
|
||
|
|
loaded into a PostgreSQL database. At that point, the recommendation graph
|
||
|
|
stopped surfacing new entries — a second pass would likely uncover more, since
|
||
|
|
new listings appear daily. But 2M felt like a solid starting point.
|
||
|
|
|
||
|
|
## The Data
|
||
|
|
|
||
|
|
**2,046,879 listings**, most of them containing the following fields (among others):
|
||
|
|
|
||
|
|
`make` · `model` · `model_variant` · `fuel` · `price` · `mileage` · `power_kw` ·
|
||
|
|
`transmission_type` · `number_of_cylinders` · `body_type` · `body_color` ·
|
||
|
|
`first_registration_date` · `number_of_previous_owners` · `is_roadworthy` ·
|
||
|
|
`is_currently_damaged` · `usage_state` · `type` · `zip_code` · `country_code` ·
|
||
|
|
`city`
|
||
|
|
|
||
|
|
That's enough to get interesting.
|
||
|
|
|
||
|
|
About 119,500 listings were missing either price or mileage — not entirely clear
|
||
|
|
why, but with nearly 2 million records left, it's barely a dent.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Findings
|
||
|
|
|
||
|
|
### Price vs. Mileage
|
||
|
|
|
||
|
|
The obvious place to start: how does price relate to mileage?
|
||
|
|
|
||
|
|
<!-- dashboard on price vs mileage; scatter plot; clipped for 1,000,000 € and 1,000,000 km -->
|
||
|
|
{{< grafana url="https://gr.eliaskohout.de/public-dashboards/0852019305114cd189aedb67dea27721" height="450" >}}
|
||
|
|
|
||
|
|
Two hot spots jump out immediately. One in the low-mileage/high-price corner,
|
||
|
|
one in the high-mileage/low-price corner — exactly what you'd expect. Expensive
|
||
|
|
special cars that barely leave the garage, and daily drivers with six-figure
|
||
|
|
odometers priced to move.
|
||
|
|
|
||
|
|
The vast majority of listings, though, cluster in relatively low-price and
|
||
|
|
low-mileage territory compared to the extremes.
|
||
|
|
|
||
|
|
**A note on clipping.** The plot caps at €1,000,000 and 1,000,000 km because the
|
||
|
|
*tails get absurd. The highest listed price was €999,999,999 — obviously not
|
||
|
|
*real. Six listings exceeded €10 million, none of them serious. A handful
|
||
|
|
*between €1M and €10M could be genuine exotics. On the mileage side, the maximum
|
||
|
|
*was 100,000,000 km. The highest plausible reading I found was an Iveco truck at
|
||
|
|
*897,000 km on the odometer. The roughly 570 listings beyond that appeared to be
|
||
|
|
*typos or placeholder values for "mileage unknown."
|
||
|
|
|
||
|
|
Across the cleaned dataset, averages land at roughly **€28,400** for price and
|
||
|
|
**75,600 km** for mileage. The standard deviations are enormous — €731k and 109k
|
||
|
|
km respectively — which tells you just how wide the spread really is.
|
||
|
|
|
||
|
|
That gives an overall average ratio of about **€375 per 1,000 km**. In other
|
||
|
|
words: for each 1,000 km on the odometer, the average listing costs about €375.
|
||
|
|
This isn't a depreciation rate in the strict sense — we're looking at a
|
||
|
|
cross-sectional snapshot of listed prices, not tracking individual cars losing
|
||
|
|
value over time. But it turns out to be a useful back-of-the-napkin metric for
|
||
|
|
comparing brands.
|
||
|
|
|
||
|
|
### By Brand
|
||
|
|
|
||
|
|
The dataset contains **346 distinct makes**. Of those, 72 (~21%) have more than
|
||
|
|
500 listings — enough for halfway meaningful statistics. The rest are too sparse
|
||
|
|
to generalize from, so brand-level analysis focuses on these 72.
|
||
|
|
|
||
|
|
{{/* dashboard on price per mileage; table with make and euro/km; ordered by price per mileage */}}
|
||
|
|
{{< grafana url="https://gr.eliaskohout.de/public-dashboards/1777bb018e9b47639b93ef31d97f9c89" height="450" >}}
|
||
|
|
|
||
|
|
The ranking roughly mirrors the common perception of luxury brands — makes with
|
||
|
|
a reputation for being expensive also tend to show up with high price-per-km
|
||
|
|
values. No surprises there.
|
||
|
|
|
||
|
|
But this metric has a blind spot: **age**. Brands with almost no older cars on
|
||
|
|
the market look disproportionately expensive per km, simply because their
|
||
|
|
listings haven't had time to depreciate. BYD, for example, ranks just below
|
||
|
|
Ferrari and Rolls-Royce — not because a BYD is a luxury vehicle, but because the
|
||
|
|
average BYD listing is only **0.7 years** old, compared to the overall average
|
||
|
|
of **6.7 years**. Leapmotor is even more extreme at **0.5 years**. Give these
|
||
|
|
brands a few years to accumulate used inventory at higher mileages, and their
|
||
|
|
ratios will settle down considerably.
|
||
|
|
|
||
|
|
### Depreciation Curves
|
||
|
|
|
||
|
|
You'd expect the price-per-km ratio to fall as mileage increases — older,
|
||
|
|
high-mileage cars are cheaper, and the new-car premium fades fast. You'd also
|
||
|
|
expect the decline to be roughly exponential: a car loses a percentage of its
|
||
|
|
current value per additional kilometer, not a fixed euro amount.
|
||
|
|
|
||
|
|
Both hold up in the data:
|
||
|
|
|
||
|
|
{{/* dashboard on price vs mileage and price/km vs mileage */}}
|
||
|
|
{{< grafana url="https://gr.eliaskohout.de/public-dashboards/e999ce3c237b4cae95b3331331a26261" height="400" >}}
|
||
|
|
|
||
|
|
The curves above show average prices over mileage for BMW, Volkswagen, and Fiat.
|
||
|
|
The brand-level differences are immediately visible — different starting points,
|
||
|
|
different values throughout the decay — but the overall shape is the same: steep
|
||
|
|
early depreciation that gradually flattens out.
|
||
|
|
|
||
|
|
|
||
|
|
* nur zum ende hin verschwimmen die Grenzen, 200k km kann hier als grobe grenze gesehen werden, ab der die daten etwas chaotischer werden
|
||
|
|
* das könnte auch daran liegen, dass es hier einfach weniger datenpunkte gibt und damit die durchschnittbrechnung schlechter wird
|
||
|
|
|
||
|
|
* also lass uns genauer auf die verteilungen an diesem Ende der skala schauen
|
||
|
|
|
||
|
|
|
||
|
|
### The Survivors: Cars Beyond 250k km
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
{{/*
|
||
|
|
==============================================
|
||
|
|
PLAN: Analysis chapters to write
|
||
|
|
==============================================
|
||
|
|
|
||
|
|
1. Price vs. Mileage Relationship
|
||
|
|
- Scatter/heatmap of price vs. mileage across all listings
|
||
|
|
- Depreciation curves: how fast do different makes lose value?
|
||
|
|
- The "sweet spot": best mileage-to-price ratio by model
|
||
|
|
|
||
|
|
2. The Survivors: Cars Beyond 300k km
|
||
|
|
- Which makes/models appear most often at extreme mileage?
|
||
|
|
- Fuel type breakdown (diesel vs. petrol at high mileage)
|
||
|
|
- Average price of high-mileage cars — are they dirt cheap or still holding value?
|
||
|
|
|
||
|
|
+ ausfallrate abschätzen
|
||
|
|
|
||
|
|
4. Fuel & Drivetrain Trends
|
||
|
|
- Fuel type distribution (diesel/petrol/electric/hybrid/LPG)
|
||
|
|
- Price and mileage by fuel type
|
||
|
|
- Are EVs showing up in used markets yet? At what price?
|
||
|
|
|
||
|
|
5. Geography
|
||
|
|
- Listings by country and region (zip code clusters)
|
||
|
|
- Regional price differences for the same model
|
||
|
|
- Where are the cheap cars?
|
||
|
|
|
||
|
|
==============================================
|
||
|
|
*/}}
|
||
|
|
|