README
This commit is contained in:
285
README.md
285
README.md
@@ -1,3 +1,284 @@
|
||||
# NoroffExam
|
||||
# Search HAR Capture and CSV Tools
|
||||
|
||||
This project contains small tools for collecting browser network data from
|
||||
search engines and converting the resulting HAR files into CSV datasets.
|
||||
|
||||
The intended workflow is:
|
||||
|
||||
1. Capture one or more `.har` files with Playwright.
|
||||
2. Convert the HAR files to CSV.
|
||||
3. Use the generated CSV files in Excel, Power Query, Power BI, or another
|
||||
analysis tool.
|
||||
|
||||
## Tools
|
||||
|
||||
### `capture_search_har`
|
||||
|
||||
Captures search result pages as HAR files.
|
||||
|
||||
Supported search engines:
|
||||
|
||||
- Google
|
||||
- DuckDuckGo
|
||||
- Bing
|
||||
- Brave Search
|
||||
|
||||
Supported browsers:
|
||||
|
||||
- Firefox
|
||||
- Chromium
|
||||
|
||||
### `har_entries_to_csv`
|
||||
|
||||
Converts HAR files to two CSV files:
|
||||
|
||||
- `har_entries.csv`: one row per HAR `log.entries[]` item.
|
||||
- `har_summary.csv`: one row per HAR file, with aggregated measurements.
|
||||
|
||||
## Requirements
|
||||
|
||||
Activate the project environment before running the tools:
|
||||
|
||||
```bash
|
||||
source .noroff-env/bin/activate
|
||||
```
|
||||
|
||||
Playwright must be installed in the active environment:
|
||||
|
||||
```bash
|
||||
pip install playwright
|
||||
playwright install firefox chromium
|
||||
```
|
||||
|
||||
Check that the commands are available:
|
||||
|
||||
```bash
|
||||
which capture_search_har
|
||||
which har_entries_to_csv
|
||||
```
|
||||
|
||||
## Basic Capture
|
||||
|
||||
Run a search with all supported search engines using Firefox:
|
||||
|
||||
```bash
|
||||
capture_search_har \
|
||||
--query "weather oslo"
|
||||
```
|
||||
|
||||
Run a search with Chromium:
|
||||
|
||||
```bash
|
||||
capture_search_har \
|
||||
--query "weather oslo" \
|
||||
--browser chromium
|
||||
```
|
||||
|
||||
Show the browser window during capture:
|
||||
|
||||
```bash
|
||||
capture_search_har \
|
||||
--query "weather oslo" \
|
||||
--browser chromium \
|
||||
--headed
|
||||
```
|
||||
|
||||
Use a fixed wait condition and timeout:
|
||||
|
||||
```bash
|
||||
capture_search_har \
|
||||
--query "weather oslo" \
|
||||
--browser chromium \
|
||||
--wait-until load \
|
||||
--timeout-ms 60000 \
|
||||
--headed
|
||||
```
|
||||
|
||||
## Capture Selected Search Engines
|
||||
|
||||
Only Google:
|
||||
|
||||
```bash
|
||||
capture_search_har \
|
||||
--query "weather oslo" \
|
||||
--engines google
|
||||
```
|
||||
|
||||
Google and DuckDuckGo:
|
||||
|
||||
```bash
|
||||
capture_search_har \
|
||||
--query "weather oslo" \
|
||||
--engines google duckduckgo
|
||||
```
|
||||
|
||||
## Output Directory
|
||||
|
||||
Use `--output-dir` to choose where HAR files are written:
|
||||
|
||||
```bash
|
||||
capture_search_har \
|
||||
--query "weather oslo" \
|
||||
--browser chromium \
|
||||
--output-dir normal_chromium
|
||||
```
|
||||
|
||||
The generated HAR filenames include timestamp, search engine, and query:
|
||||
|
||||
```text
|
||||
20260508_144327_google_weather_oslo.har
|
||||
```
|
||||
|
||||
## Tor / Proxy Capture
|
||||
|
||||
If Tor is available as a SOCKS proxy on `127.0.0.1:9050`:
|
||||
|
||||
```bash
|
||||
capture_search_har \
|
||||
--query "weather oslo" \
|
||||
--browser chromium \
|
||||
--wait-until load \
|
||||
--timeout-ms 60000 \
|
||||
--headed \
|
||||
--proxy socks5://127.0.0.1:9050 \
|
||||
--output-dir tor_chromium
|
||||
```
|
||||
|
||||
Test Tor before capture:
|
||||
|
||||
```bash
|
||||
curl --socks5-hostname 127.0.0.1:9050 https://check.torproject.org/api/ip
|
||||
```
|
||||
|
||||
The response should contain:
|
||||
|
||||
```json
|
||||
"IsTor": true
|
||||
```
|
||||
|
||||
## Convert HAR to CSV
|
||||
|
||||
Run from a folder that contains a `data/` directory with HAR files:
|
||||
|
||||
```bash
|
||||
har_entries_to_csv
|
||||
```
|
||||
|
||||
This reads:
|
||||
|
||||
```text
|
||||
data/*.har
|
||||
```
|
||||
|
||||
and writes:
|
||||
|
||||
```text
|
||||
har_entries.csv
|
||||
har_summary.csv
|
||||
```
|
||||
|
||||
Use a custom input folder:
|
||||
|
||||
```bash
|
||||
har_entries_to_csv \
|
||||
--input-dir normal_chromium
|
||||
```
|
||||
|
||||
Use custom output filenames:
|
||||
|
||||
```bash
|
||||
har_entries_to_csv \
|
||||
--input-dir normal_chromium \
|
||||
--entries-output entries_normal_chromium.csv \
|
||||
--summary-output summary_normal_chromium.csv
|
||||
```
|
||||
|
||||
## Recommended Folder Structure
|
||||
|
||||
One practical setup is to keep each test condition in its own folder:
|
||||
|
||||
```text
|
||||
work/
|
||||
normal_chromium/
|
||||
data/
|
||||
*.har
|
||||
har_entries.csv
|
||||
har_summary.csv
|
||||
|
||||
normal_firefox/
|
||||
data/
|
||||
*.har
|
||||
har_entries.csv
|
||||
har_summary.csv
|
||||
|
||||
tor_chromium/
|
||||
data/
|
||||
*.har
|
||||
har_entries.csv
|
||||
har_summary.csv
|
||||
|
||||
tor_firefox/
|
||||
data/
|
||||
*.har
|
||||
har_entries.csv
|
||||
har_summary.csv
|
||||
```
|
||||
|
||||
Alternatively, write HAR files directly into condition folders and pass those
|
||||
folders to `har_entries_to_csv` with `--input-dir`.
|
||||
|
||||
## CSV Outputs
|
||||
|
||||
### `har_entries.csv`
|
||||
|
||||
One row per network request in the HAR file.
|
||||
|
||||
Useful for inspecting details such as:
|
||||
|
||||
- request URL
|
||||
- domain
|
||||
- HTTP method
|
||||
- status code
|
||||
- request cookies
|
||||
- response cookies
|
||||
- query parameters
|
||||
- POST data presence
|
||||
- approximate transfer size
|
||||
|
||||
### `har_summary.csv`
|
||||
|
||||
One row per HAR file.
|
||||
|
||||
Useful for analysis and visualisation. Important columns include:
|
||||
|
||||
- `har_filename`
|
||||
- `search_engine`
|
||||
- `query_text`
|
||||
- `requests_total`
|
||||
- `unique_domains`
|
||||
- `third_party_requests`
|
||||
- `request_cookies_total`
|
||||
- `response_cookies_total`
|
||||
- `query_params_total`
|
||||
- `post_requests_total`
|
||||
- `tracking_hint_requests`
|
||||
- `transferred_kb_approx`
|
||||
- `page_load_ms`
|
||||
- `status_2xx`
|
||||
- `status_3xx`
|
||||
- `status_4xx`
|
||||
- `status_5xx`
|
||||
|
||||
## Notes on Interpretation
|
||||
|
||||
The HAR files and CSV files show observable browser-side network activity only.
|
||||
They do not show what a search engine stores internally on the server side.
|
||||
|
||||
`tracking_hint_requests` is a keyword-based flag. It is useful for filtering and
|
||||
inspection, but it is not proof of tracking by itself.
|
||||
|
||||
`har_entries.csv` may contain sensitive data such as full URLs, cookie values,
|
||||
headers, and query parameters. Treat it as raw data.
|
||||
|
||||
For charts and reporting, `har_summary.csv` is usually the better starting point.
|
||||
|
||||
Analysis of Cookie Activity in Web Search Traffic
|
||||
Reference in New Issue
Block a user