ontheway

2026-05-20 08:41:59 +02:00
parent 4b2e1455c9
commit e356879542
12 changed files with 816 additions and 14 deletions
--- a/report/sections/02_method.tex
+++ b/report/sections/02_method.tex
@@ -2,7 +2,7 @@

 %This section describes the methodology used throughout the research process. Some technical concepts and terminology referenced in this section are further explained in the Theory section and later discussed in the Discussion section.

-This section describes the methodology used in this research. Any technical concepts and terminology references in this section are further explained in Section~\ref{sec:theor}, and discussed in Section~\ref{sec:discu}. 
+This section describes the methodology used in this research. Any technical concepts and terminology references in this section are further explained in \autoref{sec:theor}, and discussed in \autoref{sec:discu}. 

 \subsection{Research design\label{sec:metho:research_design}}
 % Stikkord:
@@ -59,11 +59,16 @@ Data collection is performed using either a Tor proxy to help hide the identity


 %When you tap ctrl + shift + C
-When pressing \texttt{Ctrl + Shift + C} and click on \texttt{Network}, a log of Network traffic shows up. This window is open and the web-history is emptied before performing a web-search manually. This process gives a clean anonymous log web traffic from only one web query as known in Figure~\ref{fig:metho:manually_har}. Right to \texttt{"No throttling"} is a settings icon. Clicking on that bottom gives the options on Figure~\ref{fig:metho:export_har}.
-Each query could be done manually, or the processes of collecting data could be automated. For this research the process of collecting first-hand raw data was automated. A tool used to automate web-queries is python using Playwright \parencite{Playwright}. Playwright is installed in a virtual environment packages using python \parencite{VENV} \texttt{script/capture\_search\_har.py}
+When pressing \texttt{Ctrl + Shift + C} and click on \texttt{Network}, a log of Network traffic shows up. This window is open and the web-history is emptied before performing a web-search manually. This process gives a clean anonymous log web traffic from only one web query as known in \autoref{fig:metho:manually_har}. Right to \texttt{"No throttling"} is a settings icon. Clicking on that bottom gives the options on \autoref{fig:metho:export_har}.
+Each query could be done manually, or the processes of collecting data could be automated. For this research the process of collecting first-hand raw data was automated. A tool used to automate web-queries is python using Playwright \parencite{Playwright}. Playwright is installed in a virtual environment packages using python \parencite{VENV}. All collection of data is done in Linux shell, and doing EDA is done in Microsoft PowerBI.
+
+Once the installation done, web-browsers of choice may be installed inside the virtual environment. Firefox and chromium were installed inside the virtual environment. Now the environment for retrieving raw, real-world-event data for this analysis.


-\begin{figure}[h]
+
+
+
+\begin{figure}[H]
    \centering
    \includegraphics[
        width=\linewidth,
@@ -105,6 +110,38 @@ Each query could be done manually, or the processes of collecting data could be
 % HTTP status groups

 \subsection{Data collection\label{sec:metho:data_collection}}
+
+
+
+Three Scripts, two python files and one bash files are used to automate the data collection process. The files can be found under folder \texttt{./scripts/}. The bash file (\path{./scripts/many_search.sh}) uses the python file (\path{./scripts/capture_search_har.py}) to automate the process of retrieving data. It essentially loops for each query used, and each web-browser used, and for each proxy used and stores them into different folders.
+
+
+\begin{lstlisting}[language=bash, caption={Playwright data collection command}, label={lst:metho:playwright_command}, basicstyle=\ttfamily\small, breaklines=true]
+capture_search_har \
+    --query "weather oslo" \
+    --browser chromium \
+    --wait-until load \
+    --headed \
+    --output-dir tor_chromium \
+    --proxy socks5://127.0.0.1:9050
+\end{lstlisting}
+
+The \autoref{lst:metho:playwright_command} is an example of a prompt using the python file to automate the retrieving the data. It opens a web-browser, preform a web-search, and saves the output to a directory of choice, or in the default current directory. Input \verb|--query| is the only mandatory input  for this function. All the option are optional, but defaults to a default value. Input \verb|--proxy| is optional, if not used, it uses the desktops current official IP address to preform the web-search. If the \verb|--proxy| option is specified, the provided value will be used as the proxy endpoint. Meaning the web-search would only see, for this instance, the Tor's endpoint and its official IP address when preforming the web-search \parencite{TOR}.  \verb|--headed| and  \verb|--wait-until load| are important. The first one tells Playwright to open a physical window when performing a web-search, and not just a \texttt{HTTPS} call, the second one tells Playwright to wait until the web-browser is fully loaded. The rest of the inputs are self-explanatory. 
+
+After retrieving all the HAR files through the automated Python workflow, the dataset is ready for processing.
+
+
+
+
+
+
+
+
+
+
+
+
+
 % Stikkord:
 % HAR files
 % one HAR file per search engine/query/browser/network mode
@@ -115,6 +152,35 @@ Each query could be done manually, or the processes of collecting data could be
 % Tor via SOCKS proxy where applicable

 \subsection{Data processing\label{sec:metho:data_processing}}
+
+This section contains several important steps in the data processing pipeline, all leading to the exploratory data analysis (EDA) performed in Microsoft Power~BI. As of now, the data set is segregated into several HAR files, which is unreadable to Microsoft Power~BI. This section is a step-by-step process from raw data collection to finished visualized tables in Power~BI.
+
+
+
+\subsubsection{From HAR to CSV files\label{sec:metho:har_to_csv}}
+
+In order to perform data analysis in Power~BI, the data sett had to be converted from HAR files to CSV files. Once collectimg data in \autoref{sec:metho:data_collection}, several data entries needed to be ready before it could be extracted, transformed and loaded into tables in \autoref{sec:metho:etl}. 
+
+A python script at (\path{./scripts/har_entries_to_csv.py}) reads all the \texttt{.har} files in folder \texttt{./data/} and prints two output files. The first one is \texttt{./har\_entries.csv}, and the second one is \texttt{./har\_summary.csv}. Four of each of those files were created, each for each proxy type and web-browser of choice. Working directory decides which proxy and web-browser that is used for that current data collection. More on that in \autoref{sec:discu}. \texttt{./har\_entries.csv} contains every request from the web-search as one entry, or one row in the csv file. \texttt{./har\_summary.csv} summaries its respective \texttt{./har\_entries.csv} file. Which means it takes all the input from several \texttt{.har} file and summarize one \texttt{.har} file in one row. In contrast, the output in file \texttt{./har\_entries.csv} does not summaries the \texttt{.har} files, it takes the raw data and presents one entry in a \texttt{.har} as one row in file \texttt{./har\_entries.csv}, and does not do any data processing. 
+
+The file \texttt{./har\_summary.csv} was discarded in favour of \texttt{./har\_entries.csv}. It contains the raw data, and will be used on the ETL process in Power~BI in \autoref{sec:metho:etl}
+
+
+
+\subsubsection{ETL process in Power~BI\label{sec:metho:etl}}
+
+ETL stands for Extract, Transform, and Load. Some of the Extract and Transform process was done in \autoref{sec:metho:har_to_csv}. The dataset is not ready to be loaded and merged into Power~BI. 
+
+%The dataset is separated in four \texttt{CSV} files in each folder which represents the case of those entries. For instance the raw data in the case of proxy is Tor, and browser used is Chromium, the location of the dataset is as following: \texttt{./tor_chromium}, as \autoref{lst:metho:playwright_command} indicates.
+
+The dataset is separated into four \texttt{CSV} files in each folder, which represents the case of those entries. For instance, if the proxy used is Tor and the browser used is Chromium, the location of the dataset is as follows: \texttt{./tor\_chromium/}, as \autoref{lst:metho:playwright_command} indicates. The \texttt{CSV} files consist of equal file name, which is generated from the python script \texttt{capture\_search\_har}. All \texttt{CSV} files was loaded into each segregated folders in the student private working area at SharePoint. \autoref{lst:appen:pq} in \autoref{sec:appen:pq} shows the total query done in Power~BI for each instance. When a \texttt{CSV} file is loaded as source into Power~BI, some autoformatting is done by Power~BI itself, and \autoref{lst:appen:pq} illustrate the code Power~BI generates, and some more formatting. 
+
+
+As explained, only folder name describes which proxy used and which browser used for each instance. To take account for this, each \texttt{./har\_entries.csv} had to manually loaded to Power~BI for each instance. Once one table was loaded, two new columns had to be added which specified its proxy and browser for the current entries. After this is done all for tables could be merged into one table main table. Before merging, each table was named as following: \texttt{har\_entries\_<proxy>\_<browser>} for its respective proxy and browser. Once merged, the new table got the name \texttt{har\_entries\_all} which was further used for creating tables for this work. Those tables are presented in \autoref{sec:resul}.
+
+
+At last, for all observation, the whole table \texttt{har\_entries\_all} was filtered for the variable \texttt{tracking\_hint} to be equal to yes. Meaning every entry that did not address any hint to tracking was immediately filtered out.
+
 % Stikkord:
 % HAR files converted to CSV
 % har_entries.csv: one row per HAR entry/request
@@ -125,6 +191,9 @@ Each query could be done manually, or the processes of collecting data could be


 \subsection{Limitations of the method\label{sec:metho:limitations}}
+
+Some limitations of this work are related to the process in which the analysis is performed. When retrieving a \texttt{.har} file, the text file is unstructured and contains large amounts of data noise. The scripts do not always guarantee data consistency. The output files did not specify which proxy or browser used.
+
 % Stikkord:
 % HAR shows observable browser-side traffic only
 % cannot prove server-side storage