Compare commits

...

9 Commits

Author SHA1 Message Date
77be37b8f5 report finished 2026-05-28 17:48:53 +02:00
78ce34d1bc teori 2026-05-22 16:38:58 +02:00
e356879542 ontheway 2026-05-20 08:41:59 +02:00
4b2e1455c9 writingmethod 2026-05-15 14:30:58 +02:00
ffa26797a1 starting method 2026-05-15 09:29:36 +02:00
069b3acb32 ignore 2026-05-14 15:17:49 +02:00
500d67efd2 pngstatus 2026-05-14 15:14:42 +02:00
a9044c3e46 today 2026-05-12 16:30:01 +02:00
8ab1d547ef README 2026-05-11 13:51:35 +02:00
23 changed files with 1707 additions and 143 deletions

5
.gitignore vendored
View File

@@ -9,7 +9,7 @@
# PDF (valgfritt) # PDF (valgfritt)
*.pdf *.pdf
!main.pdf
# Temporary # Temporary
*.blg *.blg
*.bbl *.bbl
@@ -35,6 +35,7 @@ work/*
# But keep shell scripts # But keep shell scripts
!work/*.sh !work/*.sh
*.png
*.pptx
.noroff-env .noroff-env

285
README.md
View File

@@ -1,3 +1,284 @@
# NoroffExam # Search HAR Capture and CSV Tools
This project contains small tools for collecting browser network data from
search engines and converting the resulting HAR files into CSV datasets.
The intended workflow is:
1. Capture one or more `.har` files with Playwright.
2. Convert the HAR files to CSV.
3. Use the generated CSV files in Excel, Power Query, Power BI, or another
analysis tool.
## Tools
### `capture_search_har`
Captures search result pages as HAR files.
Supported search engines:
- Google
- DuckDuckGo
- Bing
- Brave Search
Supported browsers:
- Firefox
- Chromium
### `har_entries_to_csv`
Converts HAR files to two CSV files:
- `har_entries.csv`: one row per HAR `log.entries[]` item.
- `har_summary.csv`: one row per HAR file, with aggregated measurements.
## Requirements
Activate the project environment before running the tools:
```bash
source .noroff-env/bin/activate
```
Playwright must be installed in the active environment:
```bash
pip install playwright
playwright install firefox chromium
```
Check that the commands are available:
```bash
which capture_search_har
which har_entries_to_csv
```
## Basic Capture
Run a search with all supported search engines using Firefox:
```bash
capture_search_har \
--query "weather oslo"
```
Run a search with Chromium:
```bash
capture_search_har \
--query "weather oslo" \
--browser chromium
```
Show the browser window during capture:
```bash
capture_search_har \
--query "weather oslo" \
--browser chromium \
--headed
```
Use a fixed wait condition and timeout:
```bash
capture_search_har \
--query "weather oslo" \
--browser chromium \
--wait-until load \
--timeout-ms 60000 \
--headed
```
## Capture Selected Search Engines
Only Google:
```bash
capture_search_har \
--query "weather oslo" \
--engines google
```
Google and DuckDuckGo:
```bash
capture_search_har \
--query "weather oslo" \
--engines google duckduckgo
```
## Output Directory
Use `--output-dir` to choose where HAR files are written:
```bash
capture_search_har \
--query "weather oslo" \
--browser chromium \
--output-dir normal_chromium
```
The generated HAR filenames include timestamp, search engine, and query:
```text
20260508_144327_google_weather_oslo.har
```
## Tor / Proxy Capture
If Tor is available as a SOCKS proxy on `127.0.0.1:9050`:
```bash
capture_search_har \
--query "weather oslo" \
--browser chromium \
--wait-until load \
--timeout-ms 60000 \
--headed \
--proxy socks5://127.0.0.1:9050 \
--output-dir tor_chromium
```
Test Tor before capture:
```bash
curl --socks5-hostname 127.0.0.1:9050 https://check.torproject.org/api/ip
```
The response should contain:
```json
"IsTor": true
```
## Convert HAR to CSV
Run from a folder that contains a `data/` directory with HAR files:
```bash
har_entries_to_csv
```
This reads:
```text
data/*.har
```
and writes:
```text
har_entries.csv
har_summary.csv
```
Use a custom input folder:
```bash
har_entries_to_csv \
--input-dir normal_chromium
```
Use custom output filenames:
```bash
har_entries_to_csv \
--input-dir normal_chromium \
--entries-output entries_normal_chromium.csv \
--summary-output summary_normal_chromium.csv
```
## Recommended Folder Structure
One practical setup is to keep each test condition in its own folder:
```text
work/
normal_chromium/
data/
*.har
har_entries.csv
har_summary.csv
normal_firefox/
data/
*.har
har_entries.csv
har_summary.csv
tor_chromium/
data/
*.har
har_entries.csv
har_summary.csv
tor_firefox/
data/
*.har
har_entries.csv
har_summary.csv
```
Alternatively, write HAR files directly into condition folders and pass those
folders to `har_entries_to_csv` with `--input-dir`.
## CSV Outputs
### `har_entries.csv`
One row per network request in the HAR file.
Useful for inspecting details such as:
- request URL
- domain
- HTTP method
- status code
- request cookies
- response cookies
- query parameters
- POST data presence
- approximate transfer size
### `har_summary.csv`
One row per HAR file.
Useful for analysis and visualisation. Important columns include:
- `har_filename`
- `search_engine`
- `query_text`
- `requests_total`
- `unique_domains`
- `third_party_requests`
- `request_cookies_total`
- `response_cookies_total`
- `query_params_total`
- `post_requests_total`
- `tracking_hint_requests`
- `transferred_kb_approx`
- `page_load_ms`
- `status_2xx`
- `status_3xx`
- `status_4xx`
- `status_5xx`
## Notes on Interpretation
The HAR files and CSV files show observable browser-side network activity only.
They do not show what a search engine stores internally on the server side.
`tracking_hint_requests` is a keyword-based flag. It is useful for filtering and
inspection, but it is not proof of tracking by itself.
`har_entries.csv` may contain sensitive data such as full URLs, cookie values,
headers, and query parameters. Treat it as raw data.
For charts and reporting, `har_summary.csv` is usually the better starting point.
Analysis of Cookie Activity in Web Search Traffic

View File

@@ -28,6 +28,10 @@
} }
], ],
"ltex.language": "en-US",
"ltex.enabled": [
"latex"
],
"latex-workshop.latex.clean.fileTypes": [ "latex-workshop.latex.clean.fileTypes": [
"*.aux", "*.aux",
"*.bbl", "*.bbl",

BIN
report/main.pdf Normal file

Binary file not shown.

View File

@@ -5,7 +5,8 @@
\usepackage[T1]{fontenc} \usepackage[T1]{fontenc}
\usepackage[english]{babel} \usepackage[english]{babel}
\usepackage{lmodern} \usepackage{lmodern}
\usepackage{enumitem}
\usepackage{multicol}
\usepackage[a4paper,margin=2.5cm]{geometry} \usepackage[a4paper,margin=2.5cm]{geometry}
\usepackage{setspace} \usepackage{setspace}
\onehalfspacing \onehalfspacing
@@ -17,9 +18,9 @@
\usepackage{xcolor} \usepackage{xcolor}
\usepackage{listings} \usepackage{listings}
\usepackage{lipsum} \usepackage{lipsum}
\usepackage{float}
\usepackage{cleveref} \usepackage{cleveref}
\usepackage{csquotes}
\usepackage[backend=biber,style=authoryear]{biblatex} \usepackage[backend=biber,style=authoryear]{biblatex}
\addbibresource{references.bib} \addbibresource{references.bib}
@@ -45,13 +46,22 @@
\author{Tord-Vincent Heggland} \author{Tord-Vincent Heggland}
\date{\today} \date{\today}
\begin{document} \begin{document}
\pagenumbering{roman} \pagenumbering{roman}
\maketitle \maketitle
\begin{abstract} \begin{abstract}
\label{abs:abstr} \label{abs:abstr}
\centering \centering
\lipsum[1]
%This report aims to find trends in Web~Search activity connected to \texttt{tracking\_hints} and cookies. Playwright is used automate the process of performing web-queries. Data is downloaded as \texttt{HAR}, and transformed to \texttt{CSV} files using automation scripts. \texttt{CSV} files are loaded into Power~BI, to be transformed and visualized using Power~Query. It is found that private-focused tools tend to respect privacy in web-queries. And at last, Google leaks all user privacy.
This report aims to find trends in Web~Search activity connected to \texttt{tracking\_hints} and cookies. Playwright is used to automate the process of performing web-queries. Data is downloaded as \texttt{HAR} files and transformed into \texttt{CSV} files using automation scripts. The \texttt{CSV} files are loaded into Power~BI, where the data is transformed and visualized using Power~Query. It is found that privacy-focused tools tend to better respect privacy in web-queries, while Google shows significantly more tracking-related behavior and privacy leakage compared to the other Search~Engines.
\end{abstract} \end{abstract}
\clearpage \clearpage
\tableofcontents \tableofcontents
@@ -60,21 +70,19 @@
\setcounter{page}{1} \setcounter{page}{1}
\input{sections/introduction.tex} \input{sections/01_introduction.tex}
\input{sections/method.tex} \input{sections/01A_theory.tex}
\input{sections/02_method.tex}
\input{sections/03_results.tex}
\input{sections/04_discussion.tex}
\input{sections/05_conclusion.tex}
\input{sections/results.tex}
%\begin{figure}[h]
% \centering
% \includegraphics[width=\textwidth]{figures/Figure1.png}
% \caption{Total video game sales by genre in North America (millions of units).}
%\label{fig:Figure1}
%\end{figure}
\input{sections/discussion.tex}
\input{sections/conclusion.tex}
\clearpage
\cite{noauthor_video_nodate}
\clearpage \clearpage
\printbibliography[title={References}] \printbibliography[title={References}]
\clearpage
\appendix
\renewcommand{\thepage}{A-\arabic{page}}
\setcounter{page}{1}
\input{sections/99_appendix.tex}
\end{document} \end{document}

View File

@@ -1,9 +1,83 @@
@online{noauthor_video_nodate,
title = {Video Game Sales Dataset Updated -Extra Feat}, @online{TOR,
url = {https://www.kaggle.com/datasets/ibriiee/video-games-sales-dataset-2022-updated-extra-feat}, title = {About Tor Browser},
abstract = {Uncover the Gaming Industry Trends with the Most Comprehensive Sales Data}, url = {https://support.torproject.org/tor-browser/getting-started/about-tor-browser/},
urldate = {2026-04-21}, abstract = {Tor Browser is a privacy-focused web browser that routes your traffic through the Tor network, hiding your real {IP} address, preventing tracking, and protecting you against surveillance and censorship. Tor Browser uses the Tor network to protect your privacy and anonymity.},
titleaddon = {Support},
author = {Tor Project, Inc},
urldate = {2026-05-15},
langid = {english}, langid = {english},
file = {Snapshot:/home/tvh/snap/zotero-snap/common/Zotero/storage/C5LJ5QMG/video-games-sales-dataset-2022-updated-extra-feat.html:text/html}, file = {Snapshot:/home/tvh/snap/zotero-snap/common/Zotero/storage/R5P9688K/about-tor-browser.html:text/html},
} }
@online{CloudflareProxy,
author = {{Cloudflare}},
title = {What is a reverse proxy? Proxy servers explained},
url = {https://www.cloudflare.com/learning/cdn/glossary/reverse-proxy/},
urldate = {2026-05-22}
}
@online{MDNcookies,
author = {{MDN Web Docs}},
title = {Using HTTP cookies},
url = {https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/Cookies},
urldate = {2026-05-22}
}
@online{HAR,
title = {Network request list — Firefox Source Docs documentation},
url = {https://firefox-source-docs.mozilla.org/devtools-user/network_monitor/request_list/index.html?utm_source=chatgpt.com},
urldate = {2026-05-15},
file = {Network request list — Firefox Source Docs documentation:/home/tvh/snap/zotero-snap/common/Zotero/storage/P7S338MU/index.html:text/html},
}
@online{Playwright,
title = {Installation {\textbar} Playwright Python},
url = {https://playwright.dev/python/docs/intro},
abstract = {Introduction},
urldate = {2026-05-15},
langid = {english},
file = {Snapshot:/home/tvh/snap/zotero-snap/common/Zotero/storage/M3HT6FNN/intro.html:text/html},
}
@online{VENV,
title = {12. Virtual Environments and Packages},
url = {https://docs.python.org/3/tutorial/venv.html},
abstract = {Introduction: Python applications will often use packages and modules that dont come as part of the standard library. Applications will sometimes need a specific version of a library, because the ...},
titleaddon = {Python documentation},
urldate = {2026-05-15},
langid = {english},
file = {Snapshot:/home/tvh/snap/zotero-snap/common/Zotero/storage/QEN5QM2A/venv.html:text/html},
}
@article{PDF,
title = {Book: Module 7. Lessons and Tasks},
author = {{Noroff}},
langid = {english},
file = {PDF:/home/tvh/snap/zotero-snap/common/Zotero/storage/RVWQE24L/Heggland - Book Module 7. Lessons and Tasks.pdf:application/pdf},
}
@article{heggland_book_nodate-1,
title = {Book: Module 4. Lessons and Tasks},
author = {{Noroff}},
langid = {english},
file = {PDF:/home/tvh/snap/zotero-snap/common/Zotero/storage/YC4C99HY/Heggland - Book Module 4. Lessons and Tasks.pdf:application/pdf},
}
@article{heggland_book_nodate-2,
title = {Book: Module 2. Lessons and Tasks},
author = {{Noroff}},
langid = {english},
file = {PDF:/home/tvh/snap/zotero-snap/common/Zotero/storage/ZUATB293/Heggland - Book Module 2. Lessons and Tasks.pdf:application/pdf},
}
@misc{noroff_modules,
author = {{Noroff}},
title = {Modules 2, 4 and 7: Lessons and Tasks},
year = {2026},
note = {Internal course material used in the Data Analytics programme},
langid = {english},
file = {
PDF:/path/module2.pdf:application/pdf;
PDF:/path/module4.pdf:application/pdf;
PDF:/path/module7.pdf:application/pdf
}
}

View File

@@ -0,0 +1,130 @@
#!/usr/bin/env python3
"""
Capture HAR files for search engine result pages using Playwright.
This script starts a fresh browser context per search engine, navigates to the
configured search URL, and writes one HAR file per engine.
It can use Tor if you pass --proxy socks5://HOST:PORT.
"""
from __future__ import annotations
import argparse
from datetime import datetime
from pathlib import Path
from urllib.parse import quote_plus
from playwright.sync_api import sync_playwright
SEARCH_ENGINES = {
"google": "https://www.google.com/search?q={query}",
"duckduckgo": "https://duckduckgo.com/?q={query}&ia=web",
"bing": "https://www.bing.com/search?q={query}",
"brave": "https://search.brave.com/search?q={query}",
}
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(
description="Capture search result HAR files with Playwright."
)
parser.add_argument(
"--query",
required=True,
help="Search query to use, for example: 'migraine symptoms'",
)
parser.add_argument(
"--engines",
nargs="+",
default=list(SEARCH_ENGINES),
choices=sorted(SEARCH_ENGINES),
help="Search engines to capture. Default: all",
)
parser.add_argument(
"--output-dir",
type=Path,
default=Path("data"),
help="Directory where HAR files are written. Default: ../data/har_capture",
)
parser.add_argument(
"--proxy",
default="",
help="Optional proxy, for example: socks5://127.0.0.1:9050",
)
parser.add_argument(
"--browser",
choices=["firefox", "chromium"],
default="firefox",
help="Browser engine to use. Default: firefox",
)
parser.add_argument(
"--timeout-ms",
type=int,
default=45000,
help="Navigation timeout in milliseconds. Default: 45000",
)
parser.add_argument(
"--wait-until",
choices=["load", "domcontentloaded", "networkidle"],
default="networkidle",
help="Navigation wait condition. Default: networkidle",
)
parser.add_argument(
"--headed",
action="store_true",
help="Show the browser window instead of running headless.",
)
return parser.parse_args()
def safe_filename_part(value: str) -> str:
keep = []
for char in value.lower():
if char.isalnum():
keep.append(char)
elif char in {" ", "-", "_"}:
keep.append("_")
cleaned = "".join(keep).strip("_")
return cleaned[:80] or "query"
def main() -> None:
args = parse_args()
args.output_dir.mkdir(parents=True, exist_ok=True)
encoded_query = quote_plus(args.query)
query_part = safe_filename_part(args.query)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
with sync_playwright() as playwright:
browser_launcher = getattr(playwright, args.browser)
launch_options = {"headless": not args.headed}
if args.proxy:
launch_options["proxy"] = {"server": args.proxy}
browser = browser_launcher.launch(**launch_options)
try:
for engine in args.engines:
search_url = SEARCH_ENGINES[engine].format(query=encoded_query)
har_path = args.output_dir / f"{timestamp}_{engine}_{query_part}.har"
context = browser.new_context(
record_har_path=str(har_path),
record_har_content="embed",
)
page = context.new_page()
page.set_default_timeout(args.timeout_ms)
page.goto(search_url, wait_until=args.wait_until, timeout=args.timeout_ms)
context.close()
print(f"{engine}: {har_path}")
finally:
browser.close()
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,352 @@
#!/usr/bin/env python3
"""
Convert HAR files to readable CSV files.
Output 1: har_entries.csv
One row per entry in log.entries. This is the most direct way to inspect
the HAR structure: each { ... } inside entries[] becomes one CSV row.
Output 2: har_summary.csv
One row per HAR file with simple totals.
The script does not remove cookie values or URLs. Treat the output as sensitive.
"""
from __future__ import annotations
import argparse
import csv
import json
from pathlib import Path
from urllib.parse import parse_qs, urlparse
ENTRY_FIELDS = [
"har_filename",
"search_engine",
"entry_index",
"startedDateTime",
"time_ms",
"method",
"url",
"domain",
"path",
"query_text",
"status",
"statusText",
"request_cookie_count",
"request_cookie_names",
"request_cookie_values",
"response_cookie_count",
"response_cookie_names",
"response_cookie_values",
"query_param_count",
"query_param_names",
"query_param_values",
"request_header_count",
"response_header_count",
"post_data_present",
"request_body_size",
"response_body_size",
"response_content_size",
"transferred_bytes_approx",
"is_third_party_domain",
"tracking_hint",
]
SUMMARY_FIELDS = [
"har_filename",
"search_engine",
"query_text",
"requests_total",
"unique_domains",
"third_party_requests",
"request_cookies_total",
"response_cookies_total",
"query_params_total",
"post_requests_total",
"tracking_hint_requests",
"transferred_kb_approx",
"page_load_ms",
"status_2xx",
"status_3xx",
"status_4xx",
"status_5xx",
]
TRACKING_WORDS = [
"ads",
"adservice",
"analytics",
"collect",
"conversion",
"doubleclick",
"event",
"gen_204",
"googleadservices",
"improving",
"log",
"metrics",
"pagead",
"telemetry",
"track",
]
def detect_search_engine(har_path: Path) -> str:
name = har_path.name.lower()
if "duckduckgo" in name:
return "DuckDuckGo"
if "google" in name:
return "Google"
return "Unknown"
def read_har(path: Path) -> dict:
with path.open(encoding="utf-8", errors="replace") as file:
return json.load(file)
def entries_from_har(har_data: dict) -> list[dict]:
return har_data.get("log", {}).get("entries", []) or []
def pages_from_har(har_data: dict) -> list[dict]:
return har_data.get("log", {}).get("pages", []) or []
def cookie_names(cookies: list[dict]) -> str:
return "|".join(cookie.get("name", "") for cookie in cookies)
def cookie_values(cookies: list[dict]) -> str:
return "|".join(cookie.get("value", "") for cookie in cookies)
def query_names(query_items: list[dict]) -> str:
return "|".join(item.get("name", "") for item in query_items)
def query_values(query_items: list[dict]) -> str:
return "|".join(item.get("value", "") for item in query_items)
def positive_number(value: object) -> int:
if isinstance(value, (int, float)) and value > 0:
return int(value)
return 0
def approximate_transferred_bytes(entry: dict) -> int:
request = entry.get("request", {}) or {}
response = entry.get("response", {}) or {}
content = response.get("content", {}) or {}
return (
positive_number(request.get("headersSize"))
+ positive_number(request.get("bodySize"))
+ positive_number(response.get("headersSize"))
+ positive_number(response.get("bodySize"))
+ positive_number(content.get("size"))
)
def extract_query_text_from_url(url: str) -> str:
parsed = urlparse(url)
query = parse_qs(parsed.query, keep_blank_values=True)
values = query.get("q", [])
return values[0] if values else ""
def has_tracking_hint(domain: str, path: str, url: str) -> str:
text = f"{domain} {path} {url}".lower()
return "yes" if any(word in text for word in TRACKING_WORDS) else "no"
def max_page_load_ms(entries: list[dict], pages: list[dict]) -> float:
max_time = 0.0
for page in pages:
on_load = (page.get("pageTimings", {}) or {}).get("onLoad", -1)
if isinstance(on_load, (int, float)) and on_load > max_time:
max_time = float(on_load)
for entry in entries:
entry_time = entry.get("time", -1)
if isinstance(entry_time, (int, float)) and entry_time > max_time:
max_time = float(entry_time)
return max_time
def main_domain_for_engine(search_engine: str) -> str:
if search_engine == "Google":
return "google."
if search_engine == "DuckDuckGo":
return "duckduckgo.com"
return ""
def make_entry_rows(har_path: Path) -> list[dict]:
har_data = read_har(har_path)
entries = entries_from_har(har_data)
search_engine = detect_search_engine(har_path)
main_domain = main_domain_for_engine(search_engine)
rows = []
for index, entry in enumerate(entries, start=1):
request = entry.get("request", {}) or {}
response = entry.get("response", {}) or {}
content = response.get("content", {}) or {}
url = request.get("url", "")
parsed = urlparse(url)
request_cookies = request.get("cookies", []) or []
response_cookies = response.get("cookies", []) or []
query_items = request.get("queryString", []) or []
domain = parsed.netloc.lower()
path = parsed.path
query_text = extract_query_text_from_url(url)
third_party = "no"
if main_domain and domain and main_domain not in domain:
third_party = "yes"
rows.append(
{
"har_filename": har_path.name,
"search_engine": search_engine,
"entry_index": index,
"startedDateTime": entry.get("startedDateTime", ""),
"time_ms": entry.get("time", ""),
"method": request.get("method", ""),
"url": url,
"domain": domain,
"path": path,
"query_text": query_text,
"status": response.get("status", ""),
"statusText": response.get("statusText", ""),
"request_cookie_count": len(request_cookies),
"request_cookie_names": cookie_names(request_cookies),
"request_cookie_values": cookie_values(request_cookies),
"response_cookie_count": len(response_cookies),
"response_cookie_names": cookie_names(response_cookies),
"response_cookie_values": cookie_values(response_cookies),
"query_param_count": len(query_items),
"query_param_names": query_names(query_items),
"query_param_values": query_values(query_items),
"request_header_count": len(request.get("headers", []) or []),
"response_header_count": len(response.get("headers", []) or []),
"post_data_present": "yes" if request.get("postData") else "no",
"request_body_size": request.get("bodySize", ""),
"response_body_size": response.get("bodySize", ""),
"response_content_size": content.get("size", ""),
"transferred_bytes_approx": approximate_transferred_bytes(entry),
"is_third_party_domain": third_party,
"tracking_hint": has_tracking_hint(domain, path, url),
}
)
return rows
def make_summary_row(har_path: Path, entry_rows: list[dict]) -> dict:
har_data = read_har(har_path)
entries = entries_from_har(har_data)
pages = pages_from_har(har_data)
domains = {row["domain"] for row in entry_rows if row["domain"]}
status_counts = {2: 0, 3: 0, 4: 0, 5: 0}
query_text = ""
for row in entry_rows:
if row["query_text"] and not query_text:
query_text = row["query_text"]
status = row["status"]
if isinstance(status, int):
group = status // 100
if group in status_counts:
status_counts[group] += 1
transferred_bytes = sum(int(row["transferred_bytes_approx"]) for row in entry_rows)
return {
"har_filename": har_path.name,
"search_engine": detect_search_engine(har_path),
"query_text": query_text,
"requests_total": len(entry_rows),
"unique_domains": len(domains),
"third_party_requests": sum(
1 for row in entry_rows if row["is_third_party_domain"] == "yes"
),
"request_cookies_total": sum(int(row["request_cookie_count"]) for row in entry_rows),
"response_cookies_total": sum(
int(row["response_cookie_count"]) for row in entry_rows
),
"query_params_total": sum(int(row["query_param_count"]) for row in entry_rows),
"post_requests_total": sum(1 for row in entry_rows if row["method"] == "POST"),
"tracking_hint_requests": sum(1 for row in entry_rows if row["tracking_hint"] == "yes"),
"transferred_kb_approx": round(transferred_bytes / 1024, 2),
"page_load_ms": round(max_page_load_ms(entries, pages), 2),
"status_2xx": status_counts[2],
"status_3xx": status_counts[3],
"status_4xx": status_counts[4],
"status_5xx": status_counts[5],
}
def write_csv(path: Path, fieldnames: list[str], rows: list[dict]) -> None:
with path.open("w", newline="", encoding="utf-8") as file:
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(rows)
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(description="Convert HAR files to readable CSV files.")
parser.add_argument(
"--input-dir",
type=Path,
default=Path("data"),
help="Folder with .har files. Default: data",
)
parser.add_argument(
"--entries-output",
type=Path,
default=Path("har_entries.csv"),
help="CSV with one row per log.entries item. Default: har_entries.csv",
)
parser.add_argument(
"--summary-output",
type=Path,
default=Path("har_summary.csv"),
help="CSV with one row per HAR file. Default: har_summary.csv",
)
return parser.parse_args()
def main() -> None:
args = parse_args()
har_files = sorted(args.input_dir.glob("*.har"))
if not har_files:
raise SystemExit(f"No HAR files found in {args.input_dir}")
all_entry_rows = []
summary_rows = []
for har_path in har_files:
entry_rows = make_entry_rows(har_path)
all_entry_rows.extend(entry_rows)
summary_rows.append(make_summary_row(har_path, entry_rows))
write_csv(args.entries_output, ENTRY_FIELDS, all_entry_rows)
write_csv(args.summary_output, SUMMARY_FIELDS, summary_rows)
print(f"Wrote {len(all_entry_rows)} entry rows to {args.entries_output}")
print(f"Wrote {len(summary_rows)} summary rows to {args.summary_output}")
if __name__ == "__main__":
main()

55
report/scripts/many_search.sh Executable file
View File

@@ -0,0 +1,55 @@
#!/usr/bin/env bash
set -euo pipefail
QUERIES=(
"weather oslo"
"migraine symptoms"
"vitamin d deficiency"
"running shoes"
"coffee grinder"
"best laptop for students"
"electric car charging"
"cheap flights to london"
"home insurance"
"python list tutorial"
"banana bread recipe"
"news norway"
)
for query in "${QUERIES[@]}"; do
echo "Running query: $query"
capture_search_har \
--query "$query" \
--browser chromium \
--wait-until load \
--headed \
--output-dir normal_chromium \
--timeout-ms 60000
capture_search_har \
--query "$query" \
--browser chromium \
--wait-until load \
--headed \
--output-dir tor_chromium \
--timeout-ms 60000 \
--proxy socks5://127.0.0.1:9050
capture_search_har \
--query "$query" \
--browser firefox \
--wait-until load \
--headed \
--output-dir tor_firefox \
--timeout-ms 60000 \
--proxy socks5://127.0.0.1:9050
capture_search_har \
--query "$query" \
--browser firefox \
--wait-until load \
--headed \
--output-dir normal_firefox \
--timeout-ms 60000
done

View File

@@ -0,0 +1,82 @@
let
Kilde = Csv.Document(
Web.Contents(
"https://example.sharepoint.com/.../tor_chromium/har_entries.csv"
),
[
Delimiter = ",",
Columns = 30,
QuoteStyle = QuoteStyle.None
]
),
#"Promoted Headers" =
Table.PromoteHeaders(
Kilde,
[PromoteAllScalars = true]
),
#"Changed Column Types" =
Table.TransformColumnTypes(
#"Promoted Headers",
{
{"har_filename", type text},
{"search_engine", type text},
{"entry_index", Int64.Type},
{"startedDateTime", type datetime},
{"time_ms", type text},
{"method", type text},
{"url", type text},
{"domain", type text},
{"path", type text},
{"query_text", type text},
{"status", Int64.Type},
{"statusText", type text},
{"request_cookie_count", Int64.Type},
{"response_cookie_count", Int64.Type},
{"query_param_count", Int64.Type},
{"request_header_count", Int64.Type},
{"response_header_count", Int64.Type},
{"tracking_hint", type text}
},
"en"
),
#"Added Search Engine Column" =
Table.AddColumn(
#"Changed Column Types",
"SearchEngine",
each
if Text.Contains([har_filename], "bing")
then "Bing"
else if Text.Contains([har_filename], "google")
then "Google"
else if Text.Contains([har_filename], "duckduckgo")
then "DuckDuckGo"
else if Text.Contains([har_filename], "brave")
then "Brave"
else "Unknown"
),
#"Added Proxy Column" =
Table.TransformColumnTypes(
Table.AddColumn(
#"Added Search Engine Column",
"Proxy",
each "Tor"
),
{{"Proxy", type text}}
),
#"Added Browser Column" =
Table.TransformColumnTypes(
Table.AddColumn(
#"Added Proxy Column",
"Browser",
each "Chromium"
),
{{"Browser", type text}}
)
in
#"Added Browser Column"

View File

@@ -0,0 +1,9 @@
let
Kilde = Table.Combine({
har_summary_normal_chromium,
har_summary_normal_firefox,
har_summary_tor_chromium,
har_summary_tor_firefox
})
in
Kilde

View File

@@ -0,0 +1,102 @@
\section{Theory\label{sec:theor}}
This section explores the necessary theory for this work. There are two main parts of the theory: one that covers Noroff's course materials, and one that covers topics that are necessary in order to understand the progress of this work. The topic necessary to understand for this work are \texttt{EDA - Exploratory Data Analysis}, \texttt{VENV - Virtual Environment}, and \texttt{HAR - HTTP Archive} will be explored in subsection~\ref{sec:theor:eda}, \ref{sec:theor:venv}, and~\ref{sec:theor:har} respectively.
\subsection{EDA - Exploratory Data Analysis\label{sec:theor:eda}}
EDA - Exploratory Data Analysis, this is a crucial part of a data analysis process. It takes the process from when retrieved the raw dataset, and investigates how to present it professionally, and academically. \cite{noroff_modules}
To visualize a raw, unfiltered dataset, some decisions and restrictions need to be made before the data can be presented. Take, for example, a dataset where a lot of data is missing, and some data in the table is misleading. The EDA process is supposed to make the unfiltered data ready for visualization. During that process, several important questions need to be stated in order to narrow the research topic and the findings to one specific theme.
One example the EDA process may narrow the dataset, is to focus on one specific data variables, and investigates the trends corresponding that variables. Take for instance in this course, all dataset is filtered for data parameters where variable \texttt{tracking\_hint} is equal to yes, because this work tries to identify the trends of \texttt{HTTP} request and responses, when the dataset returns that the entry of interest addresses that it could be a tracking parameter.
The EDA process may investigate outliers in the dataset, visually, which do not seem to reflect the rest of the dataset. For example why some parameters may have a lot more cookie count per entry than other entries. And if the outliers actually reflect the data correctly, why may the trend be different in some cases than others.
\subsection{VENV - Virtual Environment\label{sec:theor:venv}}
This section covers the theory of the methodology for this work. To automate the process to retrieve raw dataset as \texttt{HAR}~files, Virtual Environment was used to create and run scripts that simulate the process of manually retrieving raw \texttt{HAR}~files when doing web~queries. More on that in \autoref{sec:metho}
\begin{lstlisting}[caption={Creating the virtual environment using python for this work}, label={lst:theor:venv}]
cd "$WORKSPACE"
python -m venv .example-env
source .example-env/bin/activate
pip install pytest-playwright
playwright install chromium
playwright install firefox
\end{lstlisting}
\autoref{lst:theor:venv} demonstrates how to create a virtual environment using python to run specific python scripts with specific packages. In \autoref{lst:theor:venv} illustrates how to install playwright and some web~browser in that virtual environment. When installing packages and program into the virtual environment, all programs are stored in following folder: \texttt{"\$WORKSPACE"/.example-env/}, in the folder created when running line~2 in \autoref{lst:theor:venv}. And once executed line~3, all files in folder \texttt{"\$WORKSPACE"/.example-env/bin/} are executable from the command~line \autocite{VENV,Playwright}.
\subsection{HAR - HTTP Archive\label{sec:theor:har}}
\texttt{HAR} is short for HTTP Archive. Every time a web-browser is loaded, it performs many thousand requests and returns several responses. Each one is stored in a table in the web~browser, where all parameters may be visualized. It could either be a \texttt{POST} request, or a \texttt{GET} request \autocite{HAR}, either way, each response or request returns as an entry in this table. The table may be downloaded for inspection. All data for the latest web~searches may be found in that table. When \texttt{HAR} table is downloaded, it is not downloaded as a \texttt{CSV}~files, but instead as a \texttt{.har} file. And in order to make use of that dataset, it has to be transformed into something that Power~BI may read. The process of this is explained in \autoref{sec:metho}.
Cookies are essential in order for a web~browser to work normally. It contains login information for the current user experience. For example when a customer is logged in to a website to shop, the shop~cart uses cookies, information about the customer to store the current session. So that next time a customer opens the same website, the same items are still stored in the shopping~cart \autocite{MDNcookies}. This is convenient for most cases, but for some cases it can be used for tracking, and for personalize ads on your web-browser.
\subsection{Proxy\label{sec:theor:proxy}}
A proxy acts as an intermediary between a client and the destination website. This means that all web traffic goes through a virtual tunnel before it reaches the destination. For every web query performed, the web traffic has to go through another server before it reaches its destination. This helps to hide the user's identity on the web because the destination can only see the intermediary and not the client itself. This means that when you visit a website, it cannot see your public IP address, only the public IP address of the proxy being used.
This is where Tor comes in. Tor routes your traffic through multiple servers using encryption. Therefore, when it reaches its destination, the source of the request is essentially untraceable \autocite{TOR,CloudflareProxy}.
% %\subsection{EDA - Exploratory Data Analysis}
% %\subsubsection{Data preprocessing and cleaning}
% %\subsubsection{Data reliability and consistency}
% %\subsubsection{Data visualisation principles}
% %\subsection{Web traffic}
% % HAR files
% %\subsubsection{HTTP requests and responses}
% %
% %\subsubsection{Cookies and tracking parameters}
% %\subsubsection{Search engines and privacy}
% %%%
% \subsection{Web Tracking Technologies}
% \subsubsection{HTTP Requests and Responses}
% \subsubsection{Cookies and Tracking Parameters}
% \subsection{Data Collection and Preprocessing}
% \subsubsection{HAR Files}
% \subsubsection{Data Preprocessing and Cleaning}
% \subsection{Exploratory Data Analysis (EDA)}
% \subsection{Data Visualisation Principles}
% % “data preprocessing” → du gjør faktisk HAR → CSV-transformasjon
% % “EDA” → du lager konkrete Power BI-visualiseringer
% % “data pipeline” → du automatiserer hele workflowen
% % “data reliability” → du standardiserer browser/proxy conditions
% % “data collection methodology” → du dokumenterer Playwright/Tor-oppsettet

View File

@@ -0,0 +1,39 @@
\section{Introduction\label{sec:intro}}
\subsection*{Background}
This project was delivered from Noroff Academy. A student could either choose to work in a group, or work independently. The topic for the project could either be chosen by professor and teacher at Noroff Academy, or the student could choose their own topic.
For this scope, the student chose to work independently with a self-chosen topic. The student has experienced that work may lose interest if there is nothing driving the curiosity behind the project. For that reason, the topic for this work is to find trends in \texttt{tracking\_hints} in web-queries. The student has a strong interest in networking, system administration, and Linux-based environments through personal projects and private experimentation. That is why this topic is chosen, to make use of personal skills in an academic work. Overall, the main work environment and method used in this project are thought in Noroff Academy. This aspect of the work creates motivation to complete the project, because the topic is of personal interest to the student.
As visualized in \autoref{fig:intro:01_thirdparty-domains}, which presents entries returning \texttt{tracking\_hints=yes}, Google was the only Search~Engine that consistently interacted with third-party domains. This finding is significant because Google also heavily used cookies on entries associated with \texttt{tracking\_hints}, as further discussed throughout this paper.
\subsection*{Hypothesis}
The student's hypothesis for this work is that DuckDuckGo and Brave will return fewer instances of \texttt{tracking\_hints} and cookies. On the other hand, the student expects Google and Bing to use cookies alongside \texttt{tracking\_hints}, as these Search~Engines are known for tracking user activity and providing personalised advertisements in the web browser.
\begin{figure}[H]
\centering
\includegraphics[
width=0.95\linewidth,
]{figures/png/01_intro.png}
\caption{Third-party domains across search engines}
\label{fig:intro:01_thirdparty-domains}
\end{figure}

View File

@@ -0,0 +1,203 @@
\section{Method\label{sec:metho}}
%This section describes the methodology used throughout the research process. Some technical concepts and terminology referenced in this section are further explained in the Theory section and later discussed in the Discussion section.
This section describes the methodology used in this research. Any technical concepts and terminology references in this section are further explained in \autoref{sec:theor}, and discussed in \autoref{sec:discu}.
\subsection{Research design\label{sec:metho:research_design}}
% Stikkord:
% observational study
% browser-based network measurements
% same searches across search engines
% comparison between search engines, browsers, and network modes
%This research is design using tools to simulate human interaction with simple web searches. Each search is design to be anomynous, with no browser histories and cookies before each search was done. For each web search the browser history is cleaned and cookies removed. Browser profile used has no login data, so the web queries can not connect the search to any real person.
%4 Search Engines are used for this prosess, once in each webbrowser, Firefox and Chromium. Which means each web query is don 8 times. For this work, several queries are create to widen the data collection. The web quires are following:
This research is designed using tools to simulate human interaction with simple web searches. Each search is designed to be anonymous, with no browser history or cookies stored before the search is performed. Before each search, the browser history is cleared and cookies are removed. The browser profiles used contain no login data, preventing the web queries from being connected to any real person.
Four search engines are used in this process (Brave, Bing, DuckDuckGo and Google), once in each web browser: Firefox and Chromium. This means that each web query is performed eight times. For this work, several queries are created to widen the data collection. The web queries are as follows:
\begin{multicols}{2}
\begin{itemize}[noitemsep, topsep=0pt]
\item weather oslo
\item migraine symptoms
\item vitamin d deficiency
\item running shoes
\item coffee grinder
\item best laptop for students
\item electric car charging
\item cheap flights to london
\item home insurance
\item python list tutorial
\item banana bread recipe
\item news norway
\end{itemize}
\end{multicols}
Data collection is performed using either a Tor proxy to help hide the identity of the person performing the web searches, or a normal network connection where web traffic may be used to identify the user.
%ata collected is filtered using either Tor proxy to hide the identity of the person premforing websearch, and not using any proxy, where any web traffic can identify you traffic.
\subsection{Test environment\label{sec:metho:test_environment}}
% Stikkord:
% operating system / controlled environment
% Playwright
% Chromium and Firefox
% normal network and Tor proxy
% clean browser context
% cookies allowed
% same wait condition and timeout
%When you tap ctrl + shift + C
When pressing \texttt{Ctrl + Shift + C} and click on \texttt{Network}, a log of Network traffic shows up. This window is open and the web-history is emptied before performing a web-search manually. This process gives a clean anonymous log web traffic from only one web query as known in \autoref{fig:metho:manually_har}. Right to \texttt{"No throttling"} is a settings icon. Clicking on that bottom gives the options on \autoref{fig:metho:export_har}.
Each query could be done manually, or the processes of collecting data could be automated. For this research the process of collecting first-hand raw data was automated. A tool used to automate web-queries is python using Playwright \parencite{Playwright}. Playwright is installed in a virtual environment packages using python \parencite{VENV}. All collection of data is done in Linux shell, and doing EDA is done in Microsoft PowerBI.
Once the installation done, web-browsers of choice may be installed inside the virtual environment. Firefox and chromium were installed inside the virtual environment. Now the environment for retrieving raw, real-world-event data for this analysis.
\begin{figure}[H]
\centering
\includegraphics[
width=\linewidth,
]{figures/png/09_importing_har_manually.png}
\caption{Network traffic by a simply web search}
\label{fig:metho:manually_har}
\end{figure}
\begin{figure}[h]
\centering
\includegraphics[
width=0.27\linewidth,
]{figures/png/10_har_options.png}
\caption{Download Har files.}
\label{fig:metho:export_har}
\end{figure}
%\subsection{Search engines and search queries\label{sec:metho:search_engines}}
% Stikkord:
% Google
% Bing
% DuckDuckGo
% Brave Search
% list of search queries
% same query used across all engines
%\subsection{Variables and measurements\label{sec:metho:Variables_measurements}}
% Stikkord:
% requests_total
% unique_domains
% third_party_requests
% request_cookies_total
% response_cookies_total
% query_params_total
% post_requests_total
% tracking_hint_requests
% transferred_kb_approx
% page_load_ms
% HTTP status groups
\subsection{Data collection\label{sec:metho:data_collection}}
Three Scripts, two python files and one bash files are used to automate the data collection process. The files can be found under folder \texttt{./scripts/}. The bash file (\path{./scripts/many_search.sh}) uses the python file (\path{./scripts/capture_search_har.py}) to automate the process of retrieving data. It essentially loops for each query used, and each web-browser used, and for each proxy used and stores them into different folders.
\begin{lstlisting}[language=bash, caption={Playwright data collection command}, label={lst:metho:playwright_command}, basicstyle=\ttfamily\small, breaklines=true]
capture_search_har \
--query "weather oslo" \
--browser chromium \
--wait-until load \
--headed \
--output-dir tor_chromium \
--proxy socks5://127.0.0.1:9050
\end{lstlisting}
The \autoref{lst:metho:playwright_command} is an example of a prompt using the python file to automate the retrieving the data. It opens a web-browser, preform a web-search, and saves the output to a directory of choice, or in the default current directory. Input \verb|--query| is the only mandatory input for this function. All the option are optional, but defaults to a default value. Input \verb|--proxy| is optional, if not used, it uses the desktops current official IP address to preform the web-search. If the \verb|--proxy| option is specified, the provided value will be used as the proxy endpoint. Meaning the web-search would only see, for this instance, the Tor's endpoint and its official IP address when preforming the web-search \parencite{TOR}. \verb|--headed| and \verb|--wait-until load| are important. The first one tells Playwright to open a physical window when performing a web-search, and not just a \texttt{HTTPS} call, the second one tells Playwright to wait until the web-browser is fully loaded. The rest of the inputs are self-explanatory.
After retrieving all the HAR files through the automated Python workflow, the dataset is ready for processing.
% Stikkord:
% HAR files
% one HAR file per search engine/query/browser/network mode
% capture_search_har script
% headed browser
% wait-until load
% timeout 60000 ms
% Tor via SOCKS proxy where applicable
\subsection{Data processing\label{sec:metho:data_processing}}
This section contains several important steps in the data processing pipeline, all leading to the exploratory data analysis (EDA) performed in Microsoft Power~BI. As of now, the data set is segregated into several HAR files, which is unreadable to Microsoft Power~BI. This section is a step-by-step process from raw data collection to finished visualized tables in Power~BI.
\subsubsection{From HAR to CSV files\label{sec:metho:har_to_csv}}
In order to perform data analysis in Power~BI, the data sett had to be converted from HAR files to CSV files. Once collectimg data in \autoref{sec:metho:data_collection}, several data entries needed to be ready before it could be extracted, transformed and loaded into tables in \autoref{sec:metho:etl}.
A python script at (\path{./scripts/har_entries_to_csv.py}) reads all the \texttt{.har} files in folder \texttt{./data/} and prints two output files. The first one is \texttt{./har\_entries.csv}, and the second one is \texttt{./har\_summary.csv}. Four of each of those files were created, each for each proxy type and web-browser of choice. Working directory decides which proxy and web-browser that is used for that current data collection. More on that in \autoref{sec:discu}. \texttt{./har\_entries.csv} contains every request from the web-search as one entry, or one row in the csv file. \texttt{./har\_summary.csv} summaries its respective \texttt{./har\_entries.csv} file. Which means it takes all the input from several \texttt{.har} file and summarize one \texttt{.har} file in one row. In contrast, the output in file \texttt{./har\_entries.csv} does not summaries the \texttt{.har} files, it takes the raw data and presents one entry in a \texttt{.har} as one row in file \texttt{./har\_entries.csv}, and does not do any data processing.
The file \texttt{./har\_summary.csv} was discarded in favour of \texttt{./har\_entries.csv}. It contains the raw data, and will be used on the ETL process in Power~BI in \autoref{sec:metho:etl}
\subsubsection{ETL process in Power~BI\label{sec:metho:etl}}
ETL stands for Extract, Transform, and Load. Some of the Extract and Transform process was done in \autoref{sec:metho:har_to_csv}. The dataset is not ready to be loaded and merged into Power~BI.
%The dataset is separated in four \texttt{CSV} files in each folder which represents the case of those entries. For instance the raw data in the case of proxy is Tor, and browser used is Chromium, the location of the dataset is as following: \texttt{./tor_chromium}, as \autoref{lst:metho:playwright_command} indicates.
The dataset is separated into four \texttt{CSV} files in each folder, which represents the case of those entries. For instance, if the proxy used is Tor and the browser used is Chromium, the location of the dataset is as follows: \texttt{./tor\_chromium/}, as \autoref{lst:metho:playwright_command} indicates. The \texttt{CSV} files consist of equal file name, which is generated from the python script \texttt{capture\_search\_har}. All \texttt{CSV} files was loaded into each segregated folders in the student private working area at SharePoint. \autoref{lst:appen:pq} in \autoref{sec:appen:pq} shows the total query done in Power~BI for each instance. When a \texttt{CSV} file is loaded as source into Power~BI, some autoformatting is done by Power~BI itself, and \autoref{lst:appen:pq} illustrate the code Power~BI generates, and some more formatting.
As explained, only folder name describes which proxy used and which browser used for each instance. To take account for this, each \texttt{./har\_entries.csv} had to manually loaded to Power~BI for each instance. Once one table was loaded, two new columns had to be added which specified its proxy and browser for the current entries. After this is done all for tables could be merged into one table main table. Before merging, each table was named as following: \texttt{har\_entries\_<proxy>\_<browser>} for its respective proxy and browser. Once merged, the new table got the name \texttt{har\_entries\_all} which was further used for creating tables for this work. Those tables are presented in \autoref{sec:resul}.
At last, for all observation, the whole table \texttt{har\_entries\_all} was filtered for the variable \texttt{tracking\_hint} to be equal to yes. Meaning every entry that did not address any hint to tracking was immediately filtered out.
% Stikkord:
% HAR files converted to CSV
% har_entries.csv: one row per HAR entry/request
% har_summary.csv: one row per HAR file
% Power Query used to combine summary files
% folder names used to identify browser/network mode
\subsection{Limitations of the method\label{sec:metho:limitations}}
Some limitations of this work are related to the process in which the analysis is performed. When retrieving a \texttt{.har} file, the text file is unstructured and contains large amounts of data noise. The scripts do not always guarantee data consistency. The output files did not specify which proxy or browser used.
% Stikkord:
% HAR shows observable browser-side traffic only
% cannot prove server-side storage
% Playwright may differ from manual browsing
% Tor may change website behaviour
% cookie consent state affects results
% tracking_hint is keyword-based, not proof of tracking

View File

@@ -0,0 +1,70 @@
\section{Results\label{sec:resul}}
%\includegraphics[width=\linewidth]{figures/pdf/01_3rdparty.pdf}
This section introduces all the findings in this work. The main priority is the variable \texttt{tracking\_hints}, as the work tries to identify relationships between trackings and cookies. Every graph in this work is filtered for \texttt{tracking\_hints=yes}. Meaning the tables retrieved are larger than those visualized in this work.
\begin{figure}[h]
\centering
\includegraphics[
width=\linewidth,
]{figures/png/02_browser.png}
\caption{Tracking hints on each chromium and firefox}
\label{fig:resul:02_browser}
\end{figure}
The first figure, \autoref{fig:resul:02_browser}, illustrates the distribution of tracking hints across browsers and search engines. DuckDuckgo appears to be the Search~Engine that caches and identify most hints to tracking. Bing and Brave appear to conservative addressing any tracking hints. While only on DuckDuckGo and Google, the choice of web-browser seem to play a crucial role. Chromium addresses more hints to tracking in Google than Firefox does.
Furthermore, request cookies counts and response cookies counts will be presented in the Results below, which will be the main focus of the following results.
\begin{figure}[h]
\centering
\includegraphics[
width=\linewidth,
]{figures/png/03_proxy-request.png}
\caption{Proxy mode configuration for request cookies}
\label{fig:resul:03_proxy_request}
\end{figure}
Figures~\ref{fig:resul:03_proxy_request} and~\ref{fig:resul:04_proxy_response} illustrate the cookie counts for each tracking hint. DuckDuckGo and Brave show no cookies across all \texttt{tracking\_hints}. In contrast, only Bing and Google use cookies on entries identified as tracking hints. The number above the name of the Search~Engine indicates the number of cookies, and the left bar indicates how many instances of those are associated with a tracking hint. Both Bing and Google shows cookies count on requests, but only Google shows cookies count on response.
Figures~\ref{fig:resul:03_proxy_request} and~\ref{fig:resul:04_proxy_response} express how using a proxy influences tracking hits on the web. Using Tor as a proxy returns no cookie responses on tracking hints, while some cookies still leak through in request cookies associated with tracking hints.
\begin{figure}[H]
\centering
\includegraphics[
width=\linewidth,
]{figures/png/04_proxy-response.png}
\caption{Proxy mode configuration for response cookies}
\label{fig:resul:04_proxy_response}
\end{figure}
In \autoref{sec:appen:figures} additional visualization on cookies comparison on request and responses are added.
% \subsection{Dataset overview}
% \subsection{Exploratory Data Analysis}
% \subsection{Cookie activity by search engine}
% \subsection{Third-party request analysis}
% \subsection{Tracking-related domains}
% \subsection{Temporal patterns}
% \subsection{Outliers and anomalies}
% \subsection{Summary of findings}

View File

@@ -0,0 +1,136 @@
\section{Discussion\label{sec:discu}}
Throughout this paper, several statements and choices have been made without been accounted for in earlier sections. This section seeks to explain and connect all the loose threads. Why is \texttt{HAR} files used, and fully explain the order of methodology, and every part of it. Lastly, results will be explained in this section, in \autoref{sec:discu:result}.
%%% Differences in in Search Engines and browser
%%% Proxy reduces the cookies lest behind on the web. Tacking hint is going down.
% Use of AI.
% 4.1 Discussion of Methodology and Theoretical Choices
\subsection{Discussion of Methodology and Theory\label{sec:discu:choices}}
% Her diskuterer du:
As explained, this work aims to connect the trends between \texttt{tracking\_hint} and cookies in web~queries. There are many ways to approach this. Each web~queries had to be done anonymously. Any cookies stored in a web-browser leaks personalized data for the current user when performing a web~query. To make process of collection data, EDA, as anonymously as possible, a new browser profile was created for this cause. However, for each web-query, the cookies in this new browser profile had to be deleted. To make this as anonymous as possible, all DNS blocking and personalized configuration for student's private network was disabled for a limited period of time. The student's personal computer was used in this EDA process. The only data that might point to the student is such as the official IP addresses. Such IP addresses are dynamically assigned from the IPS provider, so that is not a crucial leakage of data.
The only data parameters visualized in Power~BI is those of importance for supporting the student's theories. Such parameters as \texttt{Request Cookies}, \texttt{Response Cookies} and \texttt{tracking\_hint}.
\subsubsection*{Dataset parameters}
For this course only entries for when \texttt{tracking\_hints=yes} are inspected. Any entry for when \texttt{tracking\_hints=no}, is filtered out. This is due to main problem on this work, to find trends on web~queries when the entry hints to tracking. This means all other entries are of no interest for this work.
The parameter \texttt{tracking\_hint} indicates any entry on web~requests or responses that could be tracking or not. Might not be tracking, but the entry thinks that it may be tracking. This means the number of instances of \texttt{tracking\_hints} does not reflect how much tracking that is actually queried on that one web~query. In contrast, if a Search~Engines many \texttt{tracking\_hints}, it may reflect that Search~Engine used may be paranoid toward any hints for tracking. Likely when a motorcyclist may be paranoid on any car on the road. This is a good thing, because it allows the user to inspect those parameter that hints to tracking, and inspect what source the entry has. However, what is worth noticing is when tracking hint shows cookies requests and responses on the same entries. Cookies are data of information on the current session of the user using the web~browser. Meaning if an entry returns tracking hint = yes and no cookies, then the server on the other and where the request is retrieved from has no data on you. In contrast, if the entry return both cookies and \texttt{tracking\_hints} = yes, then this means that the server on the other end that might track your activity on the web has personal data on you from the web queries.
Meaning DuckDuckGo has a high occurrence rate of \texttt{tracking\_hints} that could have indicated to that the Search~Engine did not filter out any entries that may point tracking. However, that is not the case, this high occurrence rate actually addresses all cases that could be tracking, that other Search~Engines did not detect. So a high occurrences rate on \texttt{tracking\_hints} does not reflect your data actually being tracked. It is only when the entries where tracking hint is equal to yes, and the current entry returns cookie count, that your data could be tracked. From figures in \autoref{sec:resul}, only entries from Search~Engines such as Google and Bing return cookies on entries where \texttt{tracking\_hints} is equal to yes. When preform data collection, the student by habit rejected cookies on Google Search~Engines, and still the Search~Engine returned cookies, while DuckDuckGo and Brave did not return any cookies, on neither requests nor responses.
\subsubsection*{Tor proxy}
Tor proxy was used to help hide the user's identity on the web. As explained in \autoref{sec:theor:proxy}, a proxy routes traffic through an intermediary server before accessing the web. Tor proxy tunnels traffic through several relay servers, making direct identification of the user more difficult. This is because the destination web-server does not see the user's original IP address, but instead observes the IP address of the Tor exit relay.
The results show that fewer entries on cookies for \texttt{tracking\_hints} when using a Tor proxy. This is implicit due to that web servers are able to identify Tor exit relay by its IP address, so any tracking do not point to the user. So any malware, or unethical hackers are not able to track your activity on web. However, normal websites are able to identify any Tor exit relay. For example, finn.no block any incoming requests from a Tor exit relay. This is due to that Tor is not a trusted exit relay. Trusted exit relay are normally connected to home routers to any normal household using, for example, Telenor as ISP. That is because anyone accessing the web through Telenor's ISP is automatically assumed to be a Norwegian human being accessing the web. What's interesting, even though when a Tor proxy is used, only Google and Bing Search~Engines returns cookies on requests, and only Google returns cookies on response as well. Meaning Google uses cookies still when user is using proxy.
%%Tor
\subsubsection*{Working environment} %% such as Playwright, why it was used instead of manually performing web queires
%The working environment for this project was segregated into two steps. One for data collection, and one for data analysis. The process of data collection used in this project is out of the scope for the learning materials at Noroff Academy. However, the EDA process plays a crucial role for this process. What parameters to show, how to retrieve data efficiently, etc. The count of different queries is twelve, meaning, it would take too much time resources to collect this data doing it manually. Codex was used to create the automated scripts to automate the process of retrieving data, see sections \ref{sec:appen:auto}. However, the student has investigated the dataset, and anyone can investigate them themselves. After inspecting the automation scripts, it can be confirmed that the only thing it does is to open a web~browser with a qeury using the Search~Engine of choice, and downloading the result into a raw \texttt{HAR} file. It is essensially doing the same as explained in Methodology, visualized in \autoref{fig:metho:manually_har} and \autoref{fig:metho:export_har}, over and over again. Just to confirm, this process is not plagiarism, it is an automated process of collecting a large amount of data. And larger the dataset is, the more trustworthy is the results. Because any outliers or tendencies that do not follow the overall trend tend to be neglected in a larger data collections.
The working environment for this project was segregated into two steps. One for data collection, and one for data analysis.
\subsubsection*{Working environment - data collection}
The process of data collection used in this project is out of the scope for the learning materials at Noroff Academy. However, the EDA process plays a crucial role for this process. What parameters to show, how to retrieve data efficiently, etc. The count of different queries is twelve, meaning it would take too much time and resources to collect this data manually. Codex was used to create the automated scripts to automate the process of retrieving data, see sections \ref{sec:appen:auto}. However, the student has investigated the dataset, and anyone can investigate them themselves. After inspecting the automation scripts, it can be confirmed that the only thing it does is open a web~browser with a query using the Search~Engine of choice, and download the result into a raw \texttt{HAR} file. It is essentially doing the same as explained in Methodology, visualized in \autoref{fig:metho:manually_har} and \autoref{fig:metho:export_har}, repeatedly throughout the data collection process. A larger dataset may reduce the impact of isolated outliers or irregular observations that do not reflect the overall trend.
\subsubsection*{Working environment - data analysis}
For this part, Power~Query is used to perform data analysis of large dataset. With twelve different queries done once in each browser of choice and Search~Engine of choice, the dataset would be too enormous for use in an ordinary Excel sheet. Power~BI is created to handle such large data set. First performing data table convention, and them merging them into one large dataset. When one HAR file could contain entries between 2 and 500, approximately, it is reasonable that such large dataset is too big for Excel to handle. Excel only handles about 200 entries before it starts to lag on a normal student's computer. Power~BI is created to handle such large dataset, and that is the reason for performing data analysis in Power~BI. The automation scripts do not filter for which proxy used and which browser used. This means a student could prompt the automation scripts to use a specific proxy and browser, but the output \texttt{HAR} files do not indicate any proxy nor browser of choice. That is why the \autoref{lst:appen:bash} was created to retrieve raw data in a controlled environment, sorting the results into different folders that segregates any proxy and browser of choice. Each \texttt{HAR} files are stored into a folder \texttt{./data/} inside the case folder because each case created 48 \texttt{HAR} files. By sorting all \texttt{HAR} files into the folder \texttt{./data/} inside each case folder, running the python-scripts became much cleaner when visualizing the content for each case folder and later uploading the data into separate tables in Power~BI.
\autoref{lst:appen:pq} is an example for how one table was read into Power~BI, it was done once for each four cases that make up for using two browsers and two proxies. The methodology is identical for each four cases. \autoref{lst:appen:pq} performs data cleaning and data conversion. Power~BI automatically recognized data type when a new data source is loaded into Power~BI. So the only thing needed to be done was to make sure that all convention is finished, and the duplicate the query, and change data source according to which table to read into Power~BI. For each query, two columns were added. Each to hardcode the Proxy of case, and the Browser of case. This is important, because \autoref{lst:appen:merge} merges all queries into one large table that is used to create all the figures in \autoref{sec:resul}. If the dataset did not have any hardcoded variable for Proxy and Browser, it would not be possible to filter for those variables when creating those figures.
Power~BI allows for only to select variables of interest when creating figures. This feature saves time used for data analysis, because the student could just ignore any parameter of no interest for the case of this work. And only chose parameters that is important for this work when creating figures. For example, parameter such as \texttt{Proxy}, \texttt{Browser}, \texttt{Search Engine} and \texttt{tracking\_hint} are essential for this work. In contrast, parameters such as \texttt{time\_ms} and \texttt{startedDateTime} are not essential for the results.
Due to the size and complexity of the dataset, the scope of the analysis had to be narrowed to selected variables and trends relevant to the research topic. For example, some parameters would be interesting to analyse how it trends with \texttt{tracking\_hints}. This work has already created many figures, and creating to many figures would only be overwhelming and not supporting the overall findings of this work. So next time, it would be interesting to look into parameter \texttt{is\_third\_party\_domain}, and perform visualization to find out how it trends with other parameters. For this work, the parameter \texttt{is\_third\_party\_domain} is only used in \autoref{sec:intro} as an introduction to what the trends may indicate.
% hvorfor HAR-filer ble brukt
% hvorfor cookies/\texttt{tracking\_hints} ble valgt
% hvorfor EDA var relevant
% hvorfor visualiseringer var nyttige
% hvorfor controlled environments var nødvendig
% proxy/Tor som metodevalg
% reliability limitations
% preprocessing challenges
% noisy/unstructured data
% Dette matcher veldig godt pensum om:
\subsection{Discussion of the Result\label{sec:discu:result}}
Now that the dataset, working environment, and parameters have been discussed, the findings and results of this work can be further analyzed. This includes discussing how the findings may affect users and what can be learned from the statistics presented in this work.
All discussion in this section points to the figures in \autoref{sec:resul}. DuckDuckGo would seem to identify a lot of instances for any \texttt{tracking\_hints}, while other Search~Engines do not indicate as many instances as DuckDuckGo on \texttt{tracking\_hints}. However, as discussed in \autoref{sec:discu:choices}, that does not mean that DuckDuckGo leaks user information. It means that DuckDuckGo is the most aggressive Search~Engine towards addressing possible hints to tracking. Keep in mind that all these results came only from performing single web-queries, and not doing web-browser for a longer period of time. And all instances were done with a clean slate of cookies. Meaning web-browser and Search~Engines had no previous data on you. That is due to Playwright performs the web-query with no previous data when a new query instances is started. This is more efficient than manually deleting cookies between each query, both for each Browser and Search~Engine. In total 192 web-queries were performed for this work. That is why the automation scripts came in handy.
One aspect that can be learned from the results is how data security relates to web tracking and cookies. Google tries to track activity on web even when you submit do not share any cookies, which the student did by habit. What's common is that using tools that promote security actually work. Take for instance \autoref{fig:appen:08_google-cookies} in \autoref{sec:appen:figures} shows that using Firefox actually reduces cookies on entries with \texttt{tracking\_hints}, and overall Firefox caches fewer \texttt{tracking\_hints} than Chromium. And using a Proxy such as Tor does reduce cookies on \texttt{tracking\_hints}. Overall, only Brave and DuckDuckGo do not return any cookies on \texttt{tracking\_hints}, while Goolge and Bing do.
% reliability
% data quality
% error handling
% EDA
% communication of findings
% 4.2 Discussion of Results
% Her diskuterer du:
% hva resultatene faktisk viste
% forskjeller mellom Google/Bing vs DDG/Brave
% proxy-effekten
% \texttt{tracking\_hints} uten cookies
% patterns
% anomalies
% real-world implications

View File

@@ -0,0 +1,24 @@
\clearpage
\section{Conclusion\label{sec:concl}}
This section concludes the overall work throughout this paper.
%From the discussion in \autoref{sec:discu} regarding the findings in \autoref{sec:resul}, it can be concluded that Google appears to be a less privacy-focused Search~Engine when it comes to tracking-related behavior, and Bing follows Google just behind. Brave and DuckDuckGo were the only ones that respected privacy in tracking-related behavior, and DuckDuckGo aggressively identifies more tracking behavior than all the other cookies. After all, there is no privacy leakage related to that due to cookies leakage on DuckDuckGo.
From the discussion in \autoref{sec:discu} regarding the findings in \autoref{sec:resul}, it can be concluded that Google appears to be a less privacy-focused Search~Engine when it comes to tracking-related behavior, with Bing following closely behind. Brave and DuckDuckGo were the only Search~Engines that consistently respected privacy in terms of tracking-related behavior. DuckDuckGo also identified significantly more instances of \texttt{tracking\_hints} than the other Search~Engines. However, despite these detections, DuckDuckGo did not return cookies related to those entries, indicating that identifying potential tracking-related behavior does not necessarily imply privacy leakage through cookies. Using Firefox rather than Chromium and a Tor proxy also reduces instances of cookies related to \texttt{tracking\_hints}.
%The student's hypothesis were partly true. The student though that using tools related to privacy-focus would reduce tracking instances, which was right. However, what surprised is that DuckDuckGo had a high count to \texttt{tracking\_hints}, however no cookies were related to those instaces.
The student's hypothesis was partly correct. The student expected that using privacy-focused tools would reduce tracking-related instances, which proved to be correct. However, it was surprising that DuckDuckGo returned a high count of \texttt{tracking\_hints}, while no cookies were associated with those instances. For future work, it would be relevant to further investigate parameter \texttt{is\_third\_party\_domain}, as mentioned in \autoref{sec:discu}
% \subsection{Summary}
% \subsection{Hypothesis evaluation}
% \subsection{Future work}

View File

@@ -0,0 +1,109 @@
\section{Appendices\label{sec:appen}}
\subsection{Additional visualization on Browser specific cookies.\label{sec:appen:figures}}
\begin{figure}[H]
\centering
\includegraphics[
width=\linewidth,
]{figures/png/05_bing-cookies.png}
\caption{Cookies comparison in Bing}
\label{fig:appen:05_bing-cookies}
\end{figure}
\begin{figure}[H]
\centering
\includegraphics[
width=\linewidth,
]{figures/png/06_brave-cookies.png}
\caption{Cookies comparison in Brave}
\label{fig:appen:06_brave-cookies}
\end{figure}
\begin{figure}[H]
\centering
\includegraphics[
width=\linewidth,
]{figures/png/07_ddg-cookies.png}
\caption{Cookies comparison in DuckDuckGo}
\label{fig:appen:07_ddg-cookies}
\end{figure}
\begin{figure}[H]
\centering
\includegraphics[
width=\linewidth,
]{figures/png/08_google-cookies.png}
\caption{Cookies comparison in Google}
\label{fig:appen:08_google-cookies}
\end{figure}
\subsection{Automation scripts\label{sec:appen:auto}}
\lstinputlisting[
basicstyle=\ttfamily\scriptsize
language={},
caption={Bash automation script},
breaklines=true,
label={lst:appen:bash}
]{scripts/many_search.sh}
\lstinputlisting[
basicstyle=\ttfamily\scriptsize
language={},
caption={Python capture HAR script},
breaklines=true,
label={lst:appen:capture}
]{scripts/capture_search_har.py}
\lstinputlisting[
basicstyle=\ttfamily\scriptsize
language={},
caption={Python HAR to CSV script},
breaklines=true,
label={lst:appen:hartocsv}
]{scripts/har_entries_to_csv.py}
\subsection{Power Query transformation\label{sec:appen:pq}}
\lstinputlisting[
basicstyle=\ttfamily\scriptsize
language={},
caption={Power Query ETL script},
breaklines=true,
label={lst:appen:pq}
]{scripts/power_query_etl.txt}
\lstinputlisting[
basicstyle=\ttfamily\scriptsize
language={},
caption={Power Query merge script},
breaklines=true,
label={lst:appen:merge}
]{scripts/power_query_merge.txt}

View File

@@ -1,8 +0,0 @@
\section{Conclusion\label{sec:concl}}
% \subsection{Summary}
% \subsection{Hypothesis evaluation}
% \subsection{Future work}

View File

@@ -1,10 +0,0 @@
\section{Discussion\label{sec:discu}}
% \subsection{Interpretation of findings}
% \subsection{Privacy implications}
% \subsection{Reliability and limitations}
% \subsection{Ethical considerations}

View File

@@ -1,8 +0,0 @@
\section{Introduction\label{sec:intro}}
\subsection{Background\label{sec:intro:background}}
\subsection{Problem statement\label{sec:intro:statement}}
\subsection{Research objectives\label{sec:intro:research}}
\subsection{Hypotheses\label{sec:intro:hypotheses}}

View File

@@ -1,71 +0,0 @@
\section{Method\label{sec:metho}}
\subsection{Research design\label{sec:metho:research_design}}
% Stikkord:
% observational study
% browser-based network measurements
% same searches across search engines
% comparison between search engines, browsers, and network modes
\subsection{Test environment\label{sec:metho:test_environment}}
% Stikkord:
% operating system / controlled environment
% Playwright
% Chromium and Firefox
% normal network and Tor proxy
% clean browser context
% cookies allowed
% same wait condition and timeout
\subsection{Search engines and search queries\label{sec:metho:search_engines}}
% Stikkord:
% Google
% Bing
% DuckDuckGo
% Brave Search
% list of search queries
% same query used across all engines
\subsection{Variables and measurements\label{sec:metho:Variables_measurements}}
% Stikkord:
% requests_total
% unique_domains
% third_party_requests
% request_cookies_total
% response_cookies_total
% query_params_total
% post_requests_total
% tracking_hint_requests
% transferred_kb_approx
% page_load_ms
% HTTP status groups
\subsection{Data collection\label{sec:metho:data_collection}}
% Stikkord:
% HAR files
% one HAR file per search engine/query/browser/network mode
% capture_search_har script
% headed browser
% wait-until load
% timeout 60000 ms
% Tor via SOCKS proxy where applicable
\subsection{Data processing\label{sec:metho:data_processing}}
% Stikkord:
% HAR files converted to CSV
% har_entries.csv: one row per HAR entry/request
% har_summary.csv: one row per HAR file
% Power Query used to combine summary files
% folder names used to identify browser/network mode
\subsection{Limitations of the method\label{sec:metho:limitations}}
% Stikkord:
% HAR shows observable browser-side traffic only
% cannot prove server-side storage
% Playwright may differ from manual browsing
% Tor may change website behaviour
% cookie consent state affects results
% tracking_hint is keyword-based, not proof of tracking

View File

@@ -1,18 +0,0 @@
\section{Results\label{sec:resul}}
% \subsection{Dataset overview}
% \subsection{Exploratory Data Analysis}
% \subsection{Cookie activity by search engine}
% \subsection{Third-party request analysis}
% \subsection{Tracking-related domains}
% \subsection{Temporal patterns}
% \subsection{Outliers and anomalies}
% \subsection{Summary of findings}