Files
NoroffExam/README.md
2026-05-11 13:51:35 +02:00

285 lines
5.0 KiB
Markdown

# Search HAR Capture and CSV Tools
This project contains small tools for collecting browser network data from
search engines and converting the resulting HAR files into CSV datasets.
The intended workflow is:
1. Capture one or more `.har` files with Playwright.
2. Convert the HAR files to CSV.
3. Use the generated CSV files in Excel, Power Query, Power BI, or another
analysis tool.
## Tools
### `capture_search_har`
Captures search result pages as HAR files.
Supported search engines:
- Google
- DuckDuckGo
- Bing
- Brave Search
Supported browsers:
- Firefox
- Chromium
### `har_entries_to_csv`
Converts HAR files to two CSV files:
- `har_entries.csv`: one row per HAR `log.entries[]` item.
- `har_summary.csv`: one row per HAR file, with aggregated measurements.
## Requirements
Activate the project environment before running the tools:
```bash
source .noroff-env/bin/activate
```
Playwright must be installed in the active environment:
```bash
pip install playwright
playwright install firefox chromium
```
Check that the commands are available:
```bash
which capture_search_har
which har_entries_to_csv
```
## Basic Capture
Run a search with all supported search engines using Firefox:
```bash
capture_search_har \
--query "weather oslo"
```
Run a search with Chromium:
```bash
capture_search_har \
--query "weather oslo" \
--browser chromium
```
Show the browser window during capture:
```bash
capture_search_har \
--query "weather oslo" \
--browser chromium \
--headed
```
Use a fixed wait condition and timeout:
```bash
capture_search_har \
--query "weather oslo" \
--browser chromium \
--wait-until load \
--timeout-ms 60000 \
--headed
```
## Capture Selected Search Engines
Only Google:
```bash
capture_search_har \
--query "weather oslo" \
--engines google
```
Google and DuckDuckGo:
```bash
capture_search_har \
--query "weather oslo" \
--engines google duckduckgo
```
## Output Directory
Use `--output-dir` to choose where HAR files are written:
```bash
capture_search_har \
--query "weather oslo" \
--browser chromium \
--output-dir normal_chromium
```
The generated HAR filenames include timestamp, search engine, and query:
```text
20260508_144327_google_weather_oslo.har
```
## Tor / Proxy Capture
If Tor is available as a SOCKS proxy on `127.0.0.1:9050`:
```bash
capture_search_har \
--query "weather oslo" \
--browser chromium \
--wait-until load \
--timeout-ms 60000 \
--headed \
--proxy socks5://127.0.0.1:9050 \
--output-dir tor_chromium
```
Test Tor before capture:
```bash
curl --socks5-hostname 127.0.0.1:9050 https://check.torproject.org/api/ip
```
The response should contain:
```json
"IsTor": true
```
## Convert HAR to CSV
Run from a folder that contains a `data/` directory with HAR files:
```bash
har_entries_to_csv
```
This reads:
```text
data/*.har
```
and writes:
```text
har_entries.csv
har_summary.csv
```
Use a custom input folder:
```bash
har_entries_to_csv \
--input-dir normal_chromium
```
Use custom output filenames:
```bash
har_entries_to_csv \
--input-dir normal_chromium \
--entries-output entries_normal_chromium.csv \
--summary-output summary_normal_chromium.csv
```
## Recommended Folder Structure
One practical setup is to keep each test condition in its own folder:
```text
work/
normal_chromium/
data/
*.har
har_entries.csv
har_summary.csv
normal_firefox/
data/
*.har
har_entries.csv
har_summary.csv
tor_chromium/
data/
*.har
har_entries.csv
har_summary.csv
tor_firefox/
data/
*.har
har_entries.csv
har_summary.csv
```
Alternatively, write HAR files directly into condition folders and pass those
folders to `har_entries_to_csv` with `--input-dir`.
## CSV Outputs
### `har_entries.csv`
One row per network request in the HAR file.
Useful for inspecting details such as:
- request URL
- domain
- HTTP method
- status code
- request cookies
- response cookies
- query parameters
- POST data presence
- approximate transfer size
### `har_summary.csv`
One row per HAR file.
Useful for analysis and visualisation. Important columns include:
- `har_filename`
- `search_engine`
- `query_text`
- `requests_total`
- `unique_domains`
- `third_party_requests`
- `request_cookies_total`
- `response_cookies_total`
- `query_params_total`
- `post_requests_total`
- `tracking_hint_requests`
- `transferred_kb_approx`
- `page_load_ms`
- `status_2xx`
- `status_3xx`
- `status_4xx`
- `status_5xx`
## Notes on Interpretation
The HAR files and CSV files show observable browser-side network activity only.
They do not show what a search engine stores internally on the server side.
`tracking_hint_requests` is a keyword-based flag. It is useful for filtering and
inspection, but it is not proof of tracking by itself.
`har_entries.csv` may contain sensitive data such as full URLs, cookie values,
headers, and query parameters. Treat it as raw data.
For charts and reporting, `har_summary.csv` is usually the better starting point.