README
This commit is contained in:
285
README.md
285
README.md
@@ -1,3 +1,284 @@
|
|||||||
# NoroffExam
|
# Search HAR Capture and CSV Tools
|
||||||
|
|
||||||
|
This project contains small tools for collecting browser network data from
|
||||||
|
search engines and converting the resulting HAR files into CSV datasets.
|
||||||
|
|
||||||
|
The intended workflow is:
|
||||||
|
|
||||||
|
1. Capture one or more `.har` files with Playwright.
|
||||||
|
2. Convert the HAR files to CSV.
|
||||||
|
3. Use the generated CSV files in Excel, Power Query, Power BI, or another
|
||||||
|
analysis tool.
|
||||||
|
|
||||||
|
## Tools
|
||||||
|
|
||||||
|
### `capture_search_har`
|
||||||
|
|
||||||
|
Captures search result pages as HAR files.
|
||||||
|
|
||||||
|
Supported search engines:
|
||||||
|
|
||||||
|
- Google
|
||||||
|
- DuckDuckGo
|
||||||
|
- Bing
|
||||||
|
- Brave Search
|
||||||
|
|
||||||
|
Supported browsers:
|
||||||
|
|
||||||
|
- Firefox
|
||||||
|
- Chromium
|
||||||
|
|
||||||
|
### `har_entries_to_csv`
|
||||||
|
|
||||||
|
Converts HAR files to two CSV files:
|
||||||
|
|
||||||
|
- `har_entries.csv`: one row per HAR `log.entries[]` item.
|
||||||
|
- `har_summary.csv`: one row per HAR file, with aggregated measurements.
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
Activate the project environment before running the tools:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
source .noroff-env/bin/activate
|
||||||
|
```
|
||||||
|
|
||||||
|
Playwright must be installed in the active environment:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install playwright
|
||||||
|
playwright install firefox chromium
|
||||||
|
```
|
||||||
|
|
||||||
|
Check that the commands are available:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
which capture_search_har
|
||||||
|
which har_entries_to_csv
|
||||||
|
```
|
||||||
|
|
||||||
|
## Basic Capture
|
||||||
|
|
||||||
|
Run a search with all supported search engines using Firefox:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
capture_search_har \
|
||||||
|
--query "weather oslo"
|
||||||
|
```
|
||||||
|
|
||||||
|
Run a search with Chromium:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
capture_search_har \
|
||||||
|
--query "weather oslo" \
|
||||||
|
--browser chromium
|
||||||
|
```
|
||||||
|
|
||||||
|
Show the browser window during capture:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
capture_search_har \
|
||||||
|
--query "weather oslo" \
|
||||||
|
--browser chromium \
|
||||||
|
--headed
|
||||||
|
```
|
||||||
|
|
||||||
|
Use a fixed wait condition and timeout:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
capture_search_har \
|
||||||
|
--query "weather oslo" \
|
||||||
|
--browser chromium \
|
||||||
|
--wait-until load \
|
||||||
|
--timeout-ms 60000 \
|
||||||
|
--headed
|
||||||
|
```
|
||||||
|
|
||||||
|
## Capture Selected Search Engines
|
||||||
|
|
||||||
|
Only Google:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
capture_search_har \
|
||||||
|
--query "weather oslo" \
|
||||||
|
--engines google
|
||||||
|
```
|
||||||
|
|
||||||
|
Google and DuckDuckGo:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
capture_search_har \
|
||||||
|
--query "weather oslo" \
|
||||||
|
--engines google duckduckgo
|
||||||
|
```
|
||||||
|
|
||||||
|
## Output Directory
|
||||||
|
|
||||||
|
Use `--output-dir` to choose where HAR files are written:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
capture_search_har \
|
||||||
|
--query "weather oslo" \
|
||||||
|
--browser chromium \
|
||||||
|
--output-dir normal_chromium
|
||||||
|
```
|
||||||
|
|
||||||
|
The generated HAR filenames include timestamp, search engine, and query:
|
||||||
|
|
||||||
|
```text
|
||||||
|
20260508_144327_google_weather_oslo.har
|
||||||
|
```
|
||||||
|
|
||||||
|
## Tor / Proxy Capture
|
||||||
|
|
||||||
|
If Tor is available as a SOCKS proxy on `127.0.0.1:9050`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
capture_search_har \
|
||||||
|
--query "weather oslo" \
|
||||||
|
--browser chromium \
|
||||||
|
--wait-until load \
|
||||||
|
--timeout-ms 60000 \
|
||||||
|
--headed \
|
||||||
|
--proxy socks5://127.0.0.1:9050 \
|
||||||
|
--output-dir tor_chromium
|
||||||
|
```
|
||||||
|
|
||||||
|
Test Tor before capture:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl --socks5-hostname 127.0.0.1:9050 https://check.torproject.org/api/ip
|
||||||
|
```
|
||||||
|
|
||||||
|
The response should contain:
|
||||||
|
|
||||||
|
```json
|
||||||
|
"IsTor": true
|
||||||
|
```
|
||||||
|
|
||||||
|
## Convert HAR to CSV
|
||||||
|
|
||||||
|
Run from a folder that contains a `data/` directory with HAR files:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
har_entries_to_csv
|
||||||
|
```
|
||||||
|
|
||||||
|
This reads:
|
||||||
|
|
||||||
|
```text
|
||||||
|
data/*.har
|
||||||
|
```
|
||||||
|
|
||||||
|
and writes:
|
||||||
|
|
||||||
|
```text
|
||||||
|
har_entries.csv
|
||||||
|
har_summary.csv
|
||||||
|
```
|
||||||
|
|
||||||
|
Use a custom input folder:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
har_entries_to_csv \
|
||||||
|
--input-dir normal_chromium
|
||||||
|
```
|
||||||
|
|
||||||
|
Use custom output filenames:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
har_entries_to_csv \
|
||||||
|
--input-dir normal_chromium \
|
||||||
|
--entries-output entries_normal_chromium.csv \
|
||||||
|
--summary-output summary_normal_chromium.csv
|
||||||
|
```
|
||||||
|
|
||||||
|
## Recommended Folder Structure
|
||||||
|
|
||||||
|
One practical setup is to keep each test condition in its own folder:
|
||||||
|
|
||||||
|
```text
|
||||||
|
work/
|
||||||
|
normal_chromium/
|
||||||
|
data/
|
||||||
|
*.har
|
||||||
|
har_entries.csv
|
||||||
|
har_summary.csv
|
||||||
|
|
||||||
|
normal_firefox/
|
||||||
|
data/
|
||||||
|
*.har
|
||||||
|
har_entries.csv
|
||||||
|
har_summary.csv
|
||||||
|
|
||||||
|
tor_chromium/
|
||||||
|
data/
|
||||||
|
*.har
|
||||||
|
har_entries.csv
|
||||||
|
har_summary.csv
|
||||||
|
|
||||||
|
tor_firefox/
|
||||||
|
data/
|
||||||
|
*.har
|
||||||
|
har_entries.csv
|
||||||
|
har_summary.csv
|
||||||
|
```
|
||||||
|
|
||||||
|
Alternatively, write HAR files directly into condition folders and pass those
|
||||||
|
folders to `har_entries_to_csv` with `--input-dir`.
|
||||||
|
|
||||||
|
## CSV Outputs
|
||||||
|
|
||||||
|
### `har_entries.csv`
|
||||||
|
|
||||||
|
One row per network request in the HAR file.
|
||||||
|
|
||||||
|
Useful for inspecting details such as:
|
||||||
|
|
||||||
|
- request URL
|
||||||
|
- domain
|
||||||
|
- HTTP method
|
||||||
|
- status code
|
||||||
|
- request cookies
|
||||||
|
- response cookies
|
||||||
|
- query parameters
|
||||||
|
- POST data presence
|
||||||
|
- approximate transfer size
|
||||||
|
|
||||||
|
### `har_summary.csv`
|
||||||
|
|
||||||
|
One row per HAR file.
|
||||||
|
|
||||||
|
Useful for analysis and visualisation. Important columns include:
|
||||||
|
|
||||||
|
- `har_filename`
|
||||||
|
- `search_engine`
|
||||||
|
- `query_text`
|
||||||
|
- `requests_total`
|
||||||
|
- `unique_domains`
|
||||||
|
- `third_party_requests`
|
||||||
|
- `request_cookies_total`
|
||||||
|
- `response_cookies_total`
|
||||||
|
- `query_params_total`
|
||||||
|
- `post_requests_total`
|
||||||
|
- `tracking_hint_requests`
|
||||||
|
- `transferred_kb_approx`
|
||||||
|
- `page_load_ms`
|
||||||
|
- `status_2xx`
|
||||||
|
- `status_3xx`
|
||||||
|
- `status_4xx`
|
||||||
|
- `status_5xx`
|
||||||
|
|
||||||
|
## Notes on Interpretation
|
||||||
|
|
||||||
|
The HAR files and CSV files show observable browser-side network activity only.
|
||||||
|
They do not show what a search engine stores internally on the server side.
|
||||||
|
|
||||||
|
`tracking_hint_requests` is a keyword-based flag. It is useful for filtering and
|
||||||
|
inspection, but it is not proof of tracking by itself.
|
||||||
|
|
||||||
|
`har_entries.csv` may contain sensitive data such as full URLs, cookie values,
|
||||||
|
headers, and query parameters. Treat it as raw data.
|
||||||
|
|
||||||
|
For charts and reporting, `har_summary.csv` is usually the better starting point.
|
||||||
|
|
||||||
Analysis of Cookie Activity in Web Search Traffic
|
|
||||||
Reference in New Issue
Block a user