Files
personal-website/hugo/eliaskohout.de/content/projects/gebrauchtwagen-datenbank.md
Elias Kohout 561a04fded
All checks were successful
Build and Push Docker Container / build-and-push (push) Successful in 48s
feat: add 'coming soon' empty state for projects page & gitignore public dir
- Add animated coming-soon card when no projects exist
- Add Grafana shortcode and gebrauchtwagen-datenbank project
- Add hugo/eliaskohout.de/public/ to .gitignore and remove from tracking
2026-02-16 23:16:58 +01:00

160 lines
7.1 KiB
Markdown

---
title: "2 Million Used Cars and What They Tell Us"
draft: true
date: 2026-02-12
tags: ["Scraping", "Data Engineering", "Grafana", "PostgreSQL"]
summary: "Scraping ~2M used car listings, throwing them into a database, and seeing what shakes out."
code: ""
demo: ""
---
## The Question
Everyone's heard the legend: a VW Passat that just keeps going at 400,000 km. But is it actually the only car that pulls that off? What other models quietly rack up absurd mileage — and what do they cost? In short: what makes a car *last*, and can you get one without overpaying?
Time to find out with data instead of hearsay.
## The Approach
A major used car platform turned out to be surprisingly cooperative when it came to structured data. Their recommendation engine helpfully links to similar listings — so starting from a search, you can just keep crawling through related results.
The haul: roughly **2 million listings** from early 2026, downloaded as JSON and
loaded into a PostgreSQL database. At that point, the recommendation graph
stopped surfacing new entries — a second pass would likely uncover more, since
new listings appear daily. But 2M felt like a solid starting point.
## The Data
**2,046,879 listings**, most of them containing the following fields (among others):
`make` · `model` · `model_variant` · `fuel` · `price` · `mileage` · `power_kw` ·
`transmission_type` · `number_of_cylinders` · `body_type` · `body_color` ·
`first_registration_date` · `number_of_previous_owners` · `is_roadworthy` ·
`is_currently_damaged` · `usage_state` · `type` · `zip_code` · `country_code` ·
`city`
That's enough to get interesting.
About 119,500 listings were missing either price or mileage — not entirely clear
why, but with nearly 2 million records left, it's barely a dent.
---
## Findings
### Price vs. Mileage
The obvious place to start: how does price relate to mileage?
<!-- dashboard on price vs mileage; scatter plot; clipped for 1,000,000 € and 1,000,000 km -->
{{< grafana url="https://gr.eliaskohout.de/public-dashboards/0852019305114cd189aedb67dea27721" height="450" >}}
Two hot spots jump out immediately. One in the low-mileage/high-price corner,
one in the high-mileage/low-price corner — exactly what you'd expect. Expensive
special cars that barely leave the garage, and daily drivers with six-figure
odometers priced to move.
The vast majority of listings, though, cluster in relatively low-price and
low-mileage territory compared to the extremes.
**A note on clipping.** The plot caps at €1,000,000 and 1,000,000 km because the
*tails get absurd. The highest listed price was €999,999,999 — obviously not
*real. Six listings exceeded €10 million, none of them serious. A handful
*between €1M and €10M could be genuine exotics. On the mileage side, the maximum
*was 100,000,000 km. The highest plausible reading I found was an Iveco truck at
*897,000 km on the odometer. The roughly 570 listings beyond that appeared to be
*typos or placeholder values for "mileage unknown."
Across the cleaned dataset, averages land at roughly **€28,400** for price and
**75,600 km** for mileage. The standard deviations are enormous — €731k and 109k
km respectively — which tells you just how wide the spread really is.
That gives an overall average ratio of about **€375 per 1,000 km**. In other
words: for each 1,000 km on the odometer, the average listing costs about €375.
This isn't a depreciation rate in the strict sense — we're looking at a
cross-sectional snapshot of listed prices, not tracking individual cars losing
value over time. But it turns out to be a useful back-of-the-napkin metric for
comparing brands.
### By Brand
The dataset contains **346 distinct makes**. Of those, 72 (~21%) have more than
500 listings — enough for halfway meaningful statistics. The rest are too sparse
to generalize from, so brand-level analysis focuses on these 72.
{{/* dashboard on price per mileage; table with make and euro/km; ordered by price per mileage */}}
{{< grafana url="https://gr.eliaskohout.de/public-dashboards/1777bb018e9b47639b93ef31d97f9c89" height="450" >}}
The ranking roughly mirrors the common perception of luxury brands — makes with
a reputation for being expensive also tend to show up with high price-per-km
values. No surprises there.
But this metric has a blind spot: **age**. Brands with almost no older cars on
the market look disproportionately expensive per km, simply because their
listings haven't had time to depreciate. BYD, for example, ranks just below
Ferrari and Rolls-Royce — not because a BYD is a luxury vehicle, but because the
average BYD listing is only **0.7 years** old, compared to the overall average
of **6.7 years**. Leapmotor is even more extreme at **0.5 years**. Give these
brands a few years to accumulate used inventory at higher mileages, and their
ratios will settle down considerably.
### Depreciation Curves
You'd expect the price-per-km ratio to fall as mileage increases — older,
high-mileage cars are cheaper, and the new-car premium fades fast. You'd also
expect the decline to be roughly exponential: a car loses a percentage of its
current value per additional kilometer, not a fixed euro amount.
Both hold up in the data:
{{/* dashboard on price vs mileage and price/km vs mileage */}}
{{< grafana url="https://gr.eliaskohout.de/public-dashboards/e999ce3c237b4cae95b3331331a26261" height="400" >}}
The curves above show average prices over mileage for BMW, Volkswagen, and Fiat.
The brand-level differences are immediately visible — different starting points,
different values throughout the decay — but the overall shape is the same: steep
early depreciation that gradually flattens out.
* nur zum ende hin verschwimmen die Grenzen, 200k km kann hier als grobe grenze gesehen werden, ab der die daten etwas chaotischer werden
* das könnte auch daran liegen, dass es hier einfach weniger datenpunkte gibt und damit die durchschnittbrechnung schlechter wird
* also lass uns genauer auf die verteilungen an diesem Ende der skala schauen
### The Survivors: Cars Beyond 250k km
{{/*
==============================================
PLAN: Analysis chapters to write
==============================================
1. Price vs. Mileage Relationship
- Scatter/heatmap of price vs. mileage across all listings
- Depreciation curves: how fast do different makes lose value?
- The "sweet spot": best mileage-to-price ratio by model
2. The Survivors: Cars Beyond 300k km
- Which makes/models appear most often at extreme mileage?
- Fuel type breakdown (diesel vs. petrol at high mileage)
- Average price of high-mileage cars — are they dirt cheap or still holding value?
+ ausfallrate abschätzen
4. Fuel & Drivetrain Trends
- Fuel type distribution (diesel/petrol/electric/hybrid/LPG)
- Price and mileage by fuel type
- Are EVs showing up in used markets yet? At what price?
5. Geography
- Listings by country and region (zip code clusters)
- Regional price differences for the same model
- Where are the cheap cars?
==============================================
*/}}