2026-05-28 17:48:53 +02:00
2026-05-11 13:17:25 +02:00
2026-05-11 13:17:25 +02:00
2026-05-28 17:48:53 +02:00
2026-05-11 13:36:02 +02:00
2026-05-11 13:51:35 +02:00

Search HAR Capture and CSV Tools

This project contains small tools for collecting browser network data from search engines and converting the resulting HAR files into CSV datasets.

The intended workflow is:

  1. Capture one or more .har files with Playwright.
  2. Convert the HAR files to CSV.
  3. Use the generated CSV files in Excel, Power Query, Power BI, or another analysis tool.

Tools

capture_search_har

Captures search result pages as HAR files.

Supported search engines:

  • Google
  • DuckDuckGo
  • Bing
  • Brave Search

Supported browsers:

  • Firefox
  • Chromium

har_entries_to_csv

Converts HAR files to two CSV files:

  • har_entries.csv: one row per HAR log.entries[] item.
  • har_summary.csv: one row per HAR file, with aggregated measurements.

Requirements

Activate the project environment before running the tools:

source .noroff-env/bin/activate

Playwright must be installed in the active environment:

pip install playwright
playwright install firefox chromium

Check that the commands are available:

which capture_search_har
which har_entries_to_csv

Basic Capture

Run a search with all supported search engines using Firefox:

capture_search_har \
  --query "weather oslo"

Run a search with Chromium:

capture_search_har \
  --query "weather oslo" \
  --browser chromium

Show the browser window during capture:

capture_search_har \
  --query "weather oslo" \
  --browser chromium \
  --headed

Use a fixed wait condition and timeout:

capture_search_har \
  --query "weather oslo" \
  --browser chromium \
  --wait-until load \
  --timeout-ms 60000 \
  --headed

Capture Selected Search Engines

Only Google:

capture_search_har \
  --query "weather oslo" \
  --engines google

Google and DuckDuckGo:

capture_search_har \
  --query "weather oslo" \
  --engines google duckduckgo

Output Directory

Use --output-dir to choose where HAR files are written:

capture_search_har \
  --query "weather oslo" \
  --browser chromium \
  --output-dir normal_chromium

The generated HAR filenames include timestamp, search engine, and query:

20260508_144327_google_weather_oslo.har

Tor / Proxy Capture

If Tor is available as a SOCKS proxy on 127.0.0.1:9050:

capture_search_har \
  --query "weather oslo" \
  --browser chromium \
  --wait-until load \
  --timeout-ms 60000 \
  --headed \
  --proxy socks5://127.0.0.1:9050 \
  --output-dir tor_chromium

Test Tor before capture:

curl --socks5-hostname 127.0.0.1:9050 https://check.torproject.org/api/ip

The response should contain:

"IsTor": true

Convert HAR to CSV

Run from a folder that contains a data/ directory with HAR files:

har_entries_to_csv

This reads:

data/*.har

and writes:

har_entries.csv
har_summary.csv

Use a custom input folder:

har_entries_to_csv \
  --input-dir normal_chromium

Use custom output filenames:

har_entries_to_csv \
  --input-dir normal_chromium \
  --entries-output entries_normal_chromium.csv \
  --summary-output summary_normal_chromium.csv

One practical setup is to keep each test condition in its own folder:

work/
  normal_chromium/
    data/
      *.har
    har_entries.csv
    har_summary.csv

  normal_firefox/
    data/
      *.har
    har_entries.csv
    har_summary.csv

  tor_chromium/
    data/
      *.har
    har_entries.csv
    har_summary.csv

  tor_firefox/
    data/
      *.har
    har_entries.csv
    har_summary.csv

Alternatively, write HAR files directly into condition folders and pass those folders to har_entries_to_csv with --input-dir.

CSV Outputs

har_entries.csv

One row per network request in the HAR file.

Useful for inspecting details such as:

  • request URL
  • domain
  • HTTP method
  • status code
  • request cookies
  • response cookies
  • query parameters
  • POST data presence
  • approximate transfer size

har_summary.csv

One row per HAR file.

Useful for analysis and visualisation. Important columns include:

  • har_filename
  • search_engine
  • query_text
  • requests_total
  • unique_domains
  • third_party_requests
  • request_cookies_total
  • response_cookies_total
  • query_params_total
  • post_requests_total
  • tracking_hint_requests
  • transferred_kb_approx
  • page_load_ms
  • status_2xx
  • status_3xx
  • status_4xx
  • status_5xx

Notes on Interpretation

The HAR files and CSV files show observable browser-side network activity only. They do not show what a search engine stores internally on the server side.

tracking_hint_requests is a keyword-based flag. It is useful for filtering and inspection, but it is not proof of tracking by itself.

har_entries.csv may contain sensitive data such as full URLs, cookie values, headers, and query parameters. Treat it as raw data.

For charts and reporting, har_summary.csv is usually the better starting point.

Description
Analysis of Cookie Activity in Web Search Traffic
Readme MIT 1.2 MiB
Languages
TeX 74.9%
Python 22%
Shell 3.1%