# Search HAR Capture and CSV Tools This project contains small tools for collecting browser network data from search engines and converting the resulting HAR files into CSV datasets. The intended workflow is: 1. Capture one or more `.har` files with Playwright. 2. Convert the HAR files to CSV. 3. Use the generated CSV files in Excel, Power Query, Power BI, or another analysis tool. ## Tools ### `capture_search_har` Captures search result pages as HAR files. Supported search engines: - Google - DuckDuckGo - Bing - Brave Search Supported browsers: - Firefox - Chromium ### `har_entries_to_csv` Converts HAR files to two CSV files: - `har_entries.csv`: one row per HAR `log.entries[]` item. - `har_summary.csv`: one row per HAR file, with aggregated measurements. ## Requirements Activate the project environment before running the tools: ```bash source .noroff-env/bin/activate ``` Playwright must be installed in the active environment: ```bash pip install playwright playwright install firefox chromium ``` Check that the commands are available: ```bash which capture_search_har which har_entries_to_csv ``` ## Basic Capture Run a search with all supported search engines using Firefox: ```bash capture_search_har \ --query "weather oslo" ``` Run a search with Chromium: ```bash capture_search_har \ --query "weather oslo" \ --browser chromium ``` Show the browser window during capture: ```bash capture_search_har \ --query "weather oslo" \ --browser chromium \ --headed ``` Use a fixed wait condition and timeout: ```bash capture_search_har \ --query "weather oslo" \ --browser chromium \ --wait-until load \ --timeout-ms 60000 \ --headed ``` ## Capture Selected Search Engines Only Google: ```bash capture_search_har \ --query "weather oslo" \ --engines google ``` Google and DuckDuckGo: ```bash capture_search_har \ --query "weather oslo" \ --engines google duckduckgo ``` ## Output Directory Use `--output-dir` to choose where HAR files are written: ```bash capture_search_har \ --query "weather oslo" \ --browser chromium \ --output-dir normal_chromium ``` The generated HAR filenames include timestamp, search engine, and query: ```text 20260508_144327_google_weather_oslo.har ``` ## Tor / Proxy Capture If Tor is available as a SOCKS proxy on `127.0.0.1:9050`: ```bash capture_search_har \ --query "weather oslo" \ --browser chromium \ --wait-until load \ --timeout-ms 60000 \ --headed \ --proxy socks5://127.0.0.1:9050 \ --output-dir tor_chromium ``` Test Tor before capture: ```bash curl --socks5-hostname 127.0.0.1:9050 https://check.torproject.org/api/ip ``` The response should contain: ```json "IsTor": true ``` ## Convert HAR to CSV Run from a folder that contains a `data/` directory with HAR files: ```bash har_entries_to_csv ``` This reads: ```text data/*.har ``` and writes: ```text har_entries.csv har_summary.csv ``` Use a custom input folder: ```bash har_entries_to_csv \ --input-dir normal_chromium ``` Use custom output filenames: ```bash har_entries_to_csv \ --input-dir normal_chromium \ --entries-output entries_normal_chromium.csv \ --summary-output summary_normal_chromium.csv ``` ## Recommended Folder Structure One practical setup is to keep each test condition in its own folder: ```text work/ normal_chromium/ data/ *.har har_entries.csv har_summary.csv normal_firefox/ data/ *.har har_entries.csv har_summary.csv tor_chromium/ data/ *.har har_entries.csv har_summary.csv tor_firefox/ data/ *.har har_entries.csv har_summary.csv ``` Alternatively, write HAR files directly into condition folders and pass those folders to `har_entries_to_csv` with `--input-dir`. ## CSV Outputs ### `har_entries.csv` One row per network request in the HAR file. Useful for inspecting details such as: - request URL - domain - HTTP method - status code - request cookies - response cookies - query parameters - POST data presence - approximate transfer size ### `har_summary.csv` One row per HAR file. Useful for analysis and visualisation. Important columns include: - `har_filename` - `search_engine` - `query_text` - `requests_total` - `unique_domains` - `third_party_requests` - `request_cookies_total` - `response_cookies_total` - `query_params_total` - `post_requests_total` - `tracking_hint_requests` - `transferred_kb_approx` - `page_load_ms` - `status_2xx` - `status_3xx` - `status_4xx` - `status_5xx` ## Notes on Interpretation The HAR files and CSV files show observable browser-side network activity only. They do not show what a search engine stores internally on the server side. `tracking_hint_requests` is a keyword-based flag. It is useful for filtering and inspection, but it is not proof of tracking by itself. `har_entries.csv` may contain sensitive data such as full URLs, cookie values, headers, and query parameters. Treat it as raw data. For charts and reporting, `har_summary.csv` is usually the better starting point.