diff --git a/README.md b/README.md index 2cd91b8..b1feb21 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,284 @@ -# NoroffExam +# Search HAR Capture and CSV Tools + +This project contains small tools for collecting browser network data from +search engines and converting the resulting HAR files into CSV datasets. + +The intended workflow is: + +1. Capture one or more `.har` files with Playwright. +2. Convert the HAR files to CSV. +3. Use the generated CSV files in Excel, Power Query, Power BI, or another + analysis tool. + +## Tools + +### `capture_search_har` + +Captures search result pages as HAR files. + +Supported search engines: + +- Google +- DuckDuckGo +- Bing +- Brave Search + +Supported browsers: + +- Firefox +- Chromium + +### `har_entries_to_csv` + +Converts HAR files to two CSV files: + +- `har_entries.csv`: one row per HAR `log.entries[]` item. +- `har_summary.csv`: one row per HAR file, with aggregated measurements. + +## Requirements + +Activate the project environment before running the tools: + +```bash +source .noroff-env/bin/activate +``` + +Playwright must be installed in the active environment: + +```bash +pip install playwright +playwright install firefox chromium +``` + +Check that the commands are available: + +```bash +which capture_search_har +which har_entries_to_csv +``` + +## Basic Capture + +Run a search with all supported search engines using Firefox: + +```bash +capture_search_har \ + --query "weather oslo" +``` + +Run a search with Chromium: + +```bash +capture_search_har \ + --query "weather oslo" \ + --browser chromium +``` + +Show the browser window during capture: + +```bash +capture_search_har \ + --query "weather oslo" \ + --browser chromium \ + --headed +``` + +Use a fixed wait condition and timeout: + +```bash +capture_search_har \ + --query "weather oslo" \ + --browser chromium \ + --wait-until load \ + --timeout-ms 60000 \ + --headed +``` + +## Capture Selected Search Engines + +Only Google: + +```bash +capture_search_har \ + --query "weather oslo" \ + --engines google +``` + +Google and DuckDuckGo: + +```bash +capture_search_har \ + --query "weather oslo" \ + --engines google duckduckgo +``` + +## Output Directory + +Use `--output-dir` to choose where HAR files are written: + +```bash +capture_search_har \ + --query "weather oslo" \ + --browser chromium \ + --output-dir normal_chromium +``` + +The generated HAR filenames include timestamp, search engine, and query: + +```text +20260508_144327_google_weather_oslo.har +``` + +## Tor / Proxy Capture + +If Tor is available as a SOCKS proxy on `127.0.0.1:9050`: + +```bash +capture_search_har \ + --query "weather oslo" \ + --browser chromium \ + --wait-until load \ + --timeout-ms 60000 \ + --headed \ + --proxy socks5://127.0.0.1:9050 \ + --output-dir tor_chromium +``` + +Test Tor before capture: + +```bash +curl --socks5-hostname 127.0.0.1:9050 https://check.torproject.org/api/ip +``` + +The response should contain: + +```json +"IsTor": true +``` + +## Convert HAR to CSV + +Run from a folder that contains a `data/` directory with HAR files: + +```bash +har_entries_to_csv +``` + +This reads: + +```text +data/*.har +``` + +and writes: + +```text +har_entries.csv +har_summary.csv +``` + +Use a custom input folder: + +```bash +har_entries_to_csv \ + --input-dir normal_chromium +``` + +Use custom output filenames: + +```bash +har_entries_to_csv \ + --input-dir normal_chromium \ + --entries-output entries_normal_chromium.csv \ + --summary-output summary_normal_chromium.csv +``` + +## Recommended Folder Structure + +One practical setup is to keep each test condition in its own folder: + +```text +work/ + normal_chromium/ + data/ + *.har + har_entries.csv + har_summary.csv + + normal_firefox/ + data/ + *.har + har_entries.csv + har_summary.csv + + tor_chromium/ + data/ + *.har + har_entries.csv + har_summary.csv + + tor_firefox/ + data/ + *.har + har_entries.csv + har_summary.csv +``` + +Alternatively, write HAR files directly into condition folders and pass those +folders to `har_entries_to_csv` with `--input-dir`. + +## CSV Outputs + +### `har_entries.csv` + +One row per network request in the HAR file. + +Useful for inspecting details such as: + +- request URL +- domain +- HTTP method +- status code +- request cookies +- response cookies +- query parameters +- POST data presence +- approximate transfer size + +### `har_summary.csv` + +One row per HAR file. + +Useful for analysis and visualisation. Important columns include: + +- `har_filename` +- `search_engine` +- `query_text` +- `requests_total` +- `unique_domains` +- `third_party_requests` +- `request_cookies_total` +- `response_cookies_total` +- `query_params_total` +- `post_requests_total` +- `tracking_hint_requests` +- `transferred_kb_approx` +- `page_load_ms` +- `status_2xx` +- `status_3xx` +- `status_4xx` +- `status_5xx` + +## Notes on Interpretation + +The HAR files and CSV files show observable browser-side network activity only. +They do not show what a search engine stores internally on the server side. + +`tracking_hint_requests` is a keyword-based flag. It is useful for filtering and +inspection, but it is not proof of tracking by itself. + +`har_entries.csv` may contain sensitive data such as full URLs, cookie values, +headers, and query parameters. Treat it as raw data. + +For charts and reporting, `har_summary.csv` is usually the better starting point. -Analysis of Cookie Activity in Web Search Traffic \ No newline at end of file