\section{Method\label{sec:metho}} %This section describes the methodology used throughout the research process. Some technical concepts and terminology referenced in this section are further explained in the Theory section and later discussed in the Discussion section. This section describes the methodology used in this research. Any technical concepts and terminology references in this section are further explained in \autoref{sec:theor}, and discussed in \autoref{sec:discu}. \subsection{Research design\label{sec:metho:research_design}} % Stikkord: % observational study % browser-based network measurements % same searches across search engines % comparison between search engines, browsers, and network modes %This research is design using tools to simulate human interaction with simple web searches. Each search is design to be anomynous, with no browser histories and cookies before each search was done. For each web search the browser history is cleaned and cookies removed. Browser profile used has no login data, so the web queries can not connect the search to any real person. %4 Search Engines are used for this prosess, once in each webbrowser, Firefox and Chromium. Which means each web query is don 8 times. For this work, several queries are create to widen the data collection. The web quires are following: This research is designed using tools to simulate human interaction with simple web searches. Each search is designed to be anonymous, with no browser history or cookies stored before the search is performed. Before each search, the browser history is cleared and cookies are removed. The browser profiles used contain no login data, preventing the web queries from being connected to any real person. Four search engines are used in this process (Brave, Bing, DuckDuckGo and Google), once in each web browser: Firefox and Chromium. This means that each web query is performed eight times. For this work, several queries are created to widen the data collection. The web queries are as follows: \begin{multicols}{2} \begin{itemize}[noitemsep, topsep=0pt] \item weather oslo \item migraine symptoms \item vitamin d deficiency \item running shoes \item coffee grinder \item best laptop for students \item electric car charging \item cheap flights to london \item home insurance \item python list tutorial \item banana bread recipe \item news norway \end{itemize} \end{multicols} Data collection is performed using either a Tor proxy to help hide the identity of the person performing the web searches, or a normal network connection where web traffic may be used to identify the user. %ata collected is filtered using either Tor proxy to hide the identity of the person premforing websearch, and not using any proxy, where any web traffic can identify you traffic. \subsection{Test environment\label{sec:metho:test_environment}} % Stikkord: % operating system / controlled environment % Playwright % Chromium and Firefox % normal network and Tor proxy % clean browser context % cookies allowed % same wait condition and timeout %When you tap ctrl + shift + C When pressing \texttt{Ctrl + Shift + C} and click on \texttt{Network}, a log of Network traffic shows up. This window is open and the web-history is emptied before performing a web-search manually. This process gives a clean anonymous log web traffic from only one web query as known in \autoref{fig:metho:manually_har}. Right to \texttt{"No throttling"} is a settings icon. Clicking on that bottom gives the options on \autoref{fig:metho:export_har}. Each query could be done manually, or the processes of collecting data could be automated. For this research the process of collecting first-hand raw data was automated. A tool used to automate web-queries is python using Playwright \parencite{Playwright}. Playwright is installed in a virtual environment packages using python \parencite{VENV}. All collection of data is done in Linux shell, and doing EDA is done in Microsoft PowerBI. Once the installation done, web-browsers of choice may be installed inside the virtual environment. Firefox and chromium were installed inside the virtual environment. Now the environment for retrieving raw, real-world-event data for this analysis. \begin{figure}[H] \centering \includegraphics[ width=\linewidth, ]{figures/png/09_importing_har_manually.png} \caption{Network traffic by a simply web search} \label{fig:metho:manually_har} \end{figure} \begin{figure}[h] \centering \includegraphics[ width=0.27\linewidth, ]{figures/png/10_har_options.png} \caption{Download Har files.} \label{fig:metho:export_har} \end{figure} %\subsection{Search engines and search queries\label{sec:metho:search_engines}} % Stikkord: % Google % Bing % DuckDuckGo % Brave Search % list of search queries % same query used across all engines %\subsection{Variables and measurements\label{sec:metho:Variables_measurements}} % Stikkord: % requests_total % unique_domains % third_party_requests % request_cookies_total % response_cookies_total % query_params_total % post_requests_total % tracking_hint_requests % transferred_kb_approx % page_load_ms % HTTP status groups \subsection{Data collection\label{sec:metho:data_collection}} Three Scripts, two python files and one bash files are used to automate the data collection process. The files can be found under folder \texttt{./scripts/}. The bash file (\path{./scripts/many_search.sh}) uses the python file (\path{./scripts/capture_search_har.py}) to automate the process of retrieving data. It essentially loops for each query used, and each web-browser used, and for each proxy used and stores them into different folders. \begin{lstlisting}[language=bash, caption={Playwright data collection command}, label={lst:metho:playwright_command}, basicstyle=\ttfamily\small, breaklines=true] capture_search_har \ --query "weather oslo" \ --browser chromium \ --wait-until load \ --headed \ --output-dir tor_chromium \ --proxy socks5://127.0.0.1:9050 \end{lstlisting} The \autoref{lst:metho:playwright_command} is an example of a prompt using the python file to automate the retrieving the data. It opens a web-browser, preform a web-search, and saves the output to a directory of choice, or in the default current directory. Input \verb|--query| is the only mandatory input for this function. All the option are optional, but defaults to a default value. Input \verb|--proxy| is optional, if not used, it uses the desktops current official IP address to preform the web-search. If the \verb|--proxy| option is specified, the provided value will be used as the proxy endpoint. Meaning the web-search would only see, for this instance, the Tor's endpoint and its official IP address when preforming the web-search \parencite{TOR}. \verb|--headed| and \verb|--wait-until load| are important. The first one tells Playwright to open a physical window when performing a web-search, and not just a \texttt{HTTPS} call, the second one tells Playwright to wait until the web-browser is fully loaded. The rest of the inputs are self-explanatory. After retrieving all the HAR files through the automated Python workflow, the dataset is ready for processing. % Stikkord: % HAR files % one HAR file per search engine/query/browser/network mode % capture_search_har script % headed browser % wait-until load % timeout 60000 ms % Tor via SOCKS proxy where applicable \subsection{Data processing\label{sec:metho:data_processing}} This section contains several important steps in the data processing pipeline, all leading to the exploratory data analysis (EDA) performed in Microsoft Power~BI. As of now, the data set is segregated into several HAR files, which is unreadable to Microsoft Power~BI. This section is a step-by-step process from raw data collection to finished visualized tables in Power~BI. \subsubsection{From HAR to CSV files\label{sec:metho:har_to_csv}} In order to perform data analysis in Power~BI, the data sett had to be converted from HAR files to CSV files. Once collectimg data in \autoref{sec:metho:data_collection}, several data entries needed to be ready before it could be extracted, transformed and loaded into tables in \autoref{sec:metho:etl}. A python script at (\path{./scripts/har_entries_to_csv.py}) reads all the \texttt{.har} files in folder \texttt{./data/} and prints two output files. The first one is \texttt{./har\_entries.csv}, and the second one is \texttt{./har\_summary.csv}. Four of each of those files were created, each for each proxy type and web-browser of choice. Working directory decides which proxy and web-browser that is used for that current data collection. More on that in \autoref{sec:discu}. \texttt{./har\_entries.csv} contains every request from the web-search as one entry, or one row in the csv file. \texttt{./har\_summary.csv} summaries its respective \texttt{./har\_entries.csv} file. Which means it takes all the input from several \texttt{.har} file and summarize one \texttt{.har} file in one row. In contrast, the output in file \texttt{./har\_entries.csv} does not summaries the \texttt{.har} files, it takes the raw data and presents one entry in a \texttt{.har} as one row in file \texttt{./har\_entries.csv}, and does not do any data processing. The file \texttt{./har\_summary.csv} was discarded in favour of \texttt{./har\_entries.csv}. It contains the raw data, and will be used on the ETL process in Power~BI in \autoref{sec:metho:etl} \subsubsection{ETL process in Power~BI\label{sec:metho:etl}} ETL stands for Extract, Transform, and Load. Some of the Extract and Transform process was done in \autoref{sec:metho:har_to_csv}. The dataset is not ready to be loaded and merged into Power~BI. %The dataset is separated in four \texttt{CSV} files in each folder which represents the case of those entries. For instance the raw data in the case of proxy is Tor, and browser used is Chromium, the location of the dataset is as following: \texttt{./tor_chromium}, as \autoref{lst:metho:playwright_command} indicates. The dataset is separated into four \texttt{CSV} files in each folder, which represents the case of those entries. For instance, if the proxy used is Tor and the browser used is Chromium, the location of the dataset is as follows: \texttt{./tor\_chromium/}, as \autoref{lst:metho:playwright_command} indicates. The \texttt{CSV} files consist of equal file name, which is generated from the python script \texttt{capture\_search\_har}. All \texttt{CSV} files was loaded into each segregated folders in the student private working area at SharePoint. \autoref{lst:appen:pq} in \autoref{sec:appen:pq} shows the total query done in Power~BI for each instance. When a \texttt{CSV} file is loaded as source into Power~BI, some autoformatting is done by Power~BI itself, and \autoref{lst:appen:pq} illustrate the code Power~BI generates, and some more formatting. As explained, only folder name describes which proxy used and which browser used for each instance. To take account for this, each \texttt{./har\_entries.csv} had to manually loaded to Power~BI for each instance. Once one table was loaded, two new columns had to be added which specified its proxy and browser for the current entries. After this is done all for tables could be merged into one table main table. Before merging, each table was named as following: \texttt{har\_entries\_\_} for its respective proxy and browser. Once merged, the new table got the name \texttt{har\_entries\_all} which was further used for creating tables for this work. Those tables are presented in \autoref{sec:resul}. At last, for all observation, the whole table \texttt{har\_entries\_all} was filtered for the variable \texttt{tracking\_hint} to be equal to yes. Meaning every entry that did not address any hint to tracking was immediately filtered out. % Stikkord: % HAR files converted to CSV % har_entries.csv: one row per HAR entry/request % har_summary.csv: one row per HAR file % Power Query used to combine summary files % folder names used to identify browser/network mode \subsection{Limitations of the method\label{sec:metho:limitations}} Some limitations of this work are related to the process in which the analysis is performed. When retrieving a \texttt{.har} file, the text file is unstructured and contains large amounts of data noise. The scripts do not always guarantee data consistency. The output files did not specify which proxy or browser used. % Stikkord: % HAR shows observable browser-side traffic only % cannot prove server-side storage % Playwright may differ from manual browsing % Tor may change website behaviour % cookie consent state affects results % tracking_hint is keyword-based, not proof of tracking