<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>project | Joe Liang</title><link>https://joeliang0520.github.io/tag/project/</link><atom:link href="https://joeliang0520.github.io/tag/project/index.xml" rel="self" type="application/rss+xml"/><description>project</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Sat, 20 Apr 2024 00:00:00 +0000</lastBuildDate><image><url>https://joeliang0520.github.io/media/icon_hu00eb1932855fc2e3835b26f5de7e6bcd_204974_512x512_fill_lanczos_center_3.png</url><title>project</title><link>https://joeliang0520.github.io/tag/project/</link></image><item><title>ProtectYourVoice: Detacting AI Generated Voice using Deep Learning</title><link>https://joeliang0520.github.io/project/protectyourvoice/</link><pubDate>Sat, 20 Apr 2024 00:00:00 +0000</pubDate><guid>https://joeliang0520.github.io/project/protectyourvoice/</guid><description>
&lt;details class="toc-inpage d-print-none " open>
&lt;summary class="font-weight-bold">Table of Contents&lt;/summary>
&lt;nav id="TableOfContents">
&lt;ul>
&lt;li>&lt;a href="#introduction">Introduction&lt;/a>&lt;/li>
&lt;li>&lt;a href="#methodology">Methodology&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#current-appoarch">Current appoarch&lt;/a>&lt;/li>
&lt;li>&lt;a href="#pipline">Pipline&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#import-packages">Import packages&lt;/a>&lt;/li>
&lt;li>&lt;a href="#dataset">Dataset&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#deep-voice-deepfake-voice-recognition-from-kaggle">“DEEP-VOICE: DeepFake Voice Recognition” from kaggle&lt;/a>&lt;/li>
&lt;li>&lt;a href="#asvspoof-2019-from-datashare">ASVspoof 2019 from DataShare&lt;/a>&lt;/li>
&lt;li>&lt;a href="#unseen-samples-for-model-evaluation">Unseen samples for model evaluation&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#model-training-and-eveluation">Model training and eveluation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#baseline-model">Baseline Model&lt;/a>&lt;/li>
&lt;li>&lt;a href="#transfer-learning-pre-train-res-net">Transfer Learning: Pre-train Res-Net&lt;/a>&lt;/li>
&lt;li>&lt;a href="#unseen-samples">Unseen samples&lt;/a>&lt;/li>
&lt;li>&lt;a href="#qualitative-explanation">Qualitative explanation&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#discussion">Discussion&lt;/a>&lt;/li>
&lt;li>&lt;a href="#references">References&lt;/a>&lt;/li>
&lt;/ul>
&lt;/nav>
&lt;/details>
&lt;p>&lt;strong>Disclaimer: This project, developed by Jiazhou Liang, Joe Liu, Tunghoi Yeung, and Yuechen Shi, is provided for non-commercial use. You are welcome to utilize the code for non-commercial purposes. Please note that the datasets incorporated in this project retain their original copyright held by their respective authors, as cited below.&lt;/strong>&lt;/p>
&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>AI-generated (conversion) voices/speech have become increasingly popular, but they also present challenges in fraud prevention, such as impersonating someone to make fraudulent calls.&lt;/p>
&lt;p>The problem that our deep fake voice detector aims to solve is the proliferation of audio-based misinformation and fraudulent activities facilitated by the advancement of deep learning techniques. With the rise of deep fake technology, individuals can manipulate audio recordings to create convincing fake voices that can be used for various malicious purposes, such as spreading false information, impersonating others, or committing fraud.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="https://i.insider.com/64949b2c1465b6001998a9e0?width=2000&amp;amp;format=jpeg&amp;amp;auto=webp&amp;amp;quality=90,90" alt="" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>We think this project is crucial for safeguarding the authenticity and reliability of audio content, thereby preserving trust in communication channels and preventing harm caused by misinformation or malicious activities.&lt;/p>
&lt;h2 id="methodology">Methodology&lt;/h2>
&lt;p>To address this issue, we wish to implement a deep learning model to detect whether a piece of audio comes from a real person or is AI-generated, specifically focusing on the fake voices generated by deep learning models, e.g. Audio Deepfake (AD).&lt;/p>
&lt;h3 id="current-appoarch">Current appoarch&lt;/h3>
&lt;p>Current approaches, such as the work from Bird et al., rely on classical statistical learning models like XGBoost and SVM for this task. Also, most approaches have various limitations, such as requiring special data processing to perform well, not resisting noises, and being limited to English speeches. Also, Bird et al. approach the dataset size as related to small. We aim to explore the performance of deep learning models in this problem domain, with a focus on resolving some limitations in the current approaches.&lt;/p>
&lt;h3 id="pipline">Pipline&lt;/h3>
&lt;p>The proposed approach involves four steps:&lt;/p>
&lt;ul>
&lt;li>Converting the length of audio into a fixed length (30 seconds in this project).&lt;/li>
&lt;li>Converting the raw audio (signal) into a spectrogram, which is a visual representation of its Fourier transformation.&lt;/li>
&lt;li>Applying a CNN model, treating the spectrogram as an RGB image, to find the unique features within it.&lt;/li>
&lt;li>Performing binary classification on these features to detect whether the audio is AI-generated or from a real human.&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="pipline.png" alt="png" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>We will demonstrate the training process of this proposed approach, including dataset extraction, model selection, and performance evluation.&lt;/p>
&lt;h2 id="import-packages">Import packages&lt;/h2>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># install missing packages in colab&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># for audio segment&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="err">!&lt;/span>&lt;span class="n">pip&lt;/span> &lt;span class="n">install&lt;/span> &lt;span class="n">pydub&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># for dataset download&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="err">!&lt;/span>&lt;span class="n">pip&lt;/span> &lt;span class="n">install&lt;/span> &lt;span class="o">-&lt;/span>&lt;span class="n">q&lt;/span> &lt;span class="n">kaggle&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#import required packages&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pydub&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">AudioSegment&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">scipy.io&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">wavfile&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">soundfile&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="nn">sf&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">tqdm&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">tqdm&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">os&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">numpy&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="nn">np&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">matplotlib.pyplot&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="nn">plt&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">pandas&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="nn">pd&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">sklearn.model_selection&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">train_test_split&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">matplotlib.image&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="nn">mpimg&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">os&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">shutil&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">location&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># for better visualization&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">warnings&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">warnings&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">filterwarnings&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;ignore&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># mount google drive folder&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">google.colab&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">drive&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">drive&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">mount&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">location&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="dataset">Dataset&lt;/h2>
&lt;p>It is worth noting that there are various AD datasets available.&lt;/p>
&lt;ol>
&lt;li>This project will use the &amp;ldquo;DEEP-VOICE: DeepFake Voice Recognition&amp;rdquo; dataset from Bird et al., as mentioned in the proposal, which was published on &lt;a href="https://www.kaggle.com/datasets/birdy654/deep-voice-deepfake-voice-recognition" target="_blank" rel="noopener">Kaggle&lt;/a>. It is a smaller dataset that contains 64 raw audio files converted from the speeches of 8 public figures into other people&amp;rsquo;s voices using Retrieval-based Voice Conversion.
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2039603%2F921dc2241837cd784329955d570f7802%2Fdfcover.png?generation=1692897655324630&amp;amp;alt=media" alt="" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/li>
&lt;/ol>
&lt;p>(Here is the illustration of how fake data is generated, according to Bird &amp;amp; Lotfi, 2023.)
The average length of each sample is 600 seconds. To increase the sample size and reduce the length of each sample, we cut each sample into groups of 30-second clips.&lt;/p>
&lt;ol start="2">
&lt;li>While the Kaggle data contains audios coming from or mimicking a small number of celebrities, we would like our model to perform well on a wide variety of voices. ASVspoof 2019 was a database used for the Third Automatic Speaker Verification Spoofing and Countermeasures Challenge. We looked at a partition of ASVspoof 2019(within the &amp;ldquo;LA&amp;rdquo; folder in the original dataset) where the fake audios were generated by text-to-speech or voice conversion systems, from speech data captured from 107 speakers (46 males, 61 females) reading a list of text corpora. Here is the link to the original dataset: (&lt;a href="https://datashare.ed.ac.uk/handle/10283/3336" target="_blank" rel="noopener">https://datashare.ed.ac.uk/handle/10283/3336&lt;/a>)&lt;/li>
&lt;/ol>
&lt;p>However, in contrast to the &amp;lsquo;Deep-Voice&amp;rsquo; dataset, most of the audios are around 2 to 6 seconds long. We concatenate all the audios that have the same label (real or fake) and correspond to the same speaker together. In this way, the concatenated audios will have a closer length to the ones from Kaggle, and then we can split them into 30-second intervals in the same way.&lt;/p>
&lt;p>Let us discuss the preprocessing of each dataset separately.&lt;/p>
&lt;h3 id="deep-voice-deepfake-voice-recognition-from-kaggle">“DEEP-VOICE: DeepFake Voice Recognition” from kaggle&lt;/h3>
&lt;h4 id="download-and-process-dataset">Download and process dataset&lt;/h4>
&lt;p>We need to download both dataset at the very first time, processed it, stored it in google drive for future usage.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">google.colab&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">files&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#upload your kaggle api json (https://www.kaggle.com/docs/api#interacting-with-datasets)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">files&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">upload&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="err">!&lt;/span>&lt;span class="n">ls&lt;/span> &lt;span class="o">-&lt;/span>&lt;span class="n">lha&lt;/span> &lt;span class="n">kaggle&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">json&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="err">!&lt;/span>&lt;span class="n">mkdir&lt;/span> &lt;span class="o">-&lt;/span>&lt;span class="n">p&lt;/span> &lt;span class="o">~/.&lt;/span>&lt;span class="n">kaggle&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="err">!&lt;/span>&lt;span class="n">cp&lt;/span> &lt;span class="n">kaggle&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">json&lt;/span> &lt;span class="o">~/.&lt;/span>&lt;span class="n">kaggle&lt;/span>&lt;span class="o">/&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="err">!&lt;/span>&lt;span class="n">chmod&lt;/span> &lt;span class="mi">600&lt;/span> &lt;span class="o">/&lt;/span>&lt;span class="n">root&lt;/span>&lt;span class="o">/.&lt;/span>&lt;span class="n">kaggle&lt;/span>&lt;span class="o">/&lt;/span>&lt;span class="n">kaggle&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">json&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="err">!&lt;/span>&lt;span class="n">pwd&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="err">!&lt;/span>&lt;span class="n">kaggle&lt;/span> &lt;span class="n">datasets&lt;/span> &lt;span class="n">download&lt;/span> &lt;span class="o">-&lt;/span>&lt;span class="n">d&lt;/span> &lt;span class="n">birdy654&lt;/span>&lt;span class="o">/&lt;/span>&lt;span class="n">deep&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="n">voice&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="n">deepfake&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="n">voice&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="n">recognition&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>unzip and processing dataset&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="err">!&lt;/span>&lt;span class="n">unzip&lt;/span> &lt;span class="o">/&lt;/span>&lt;span class="n">content&lt;/span>&lt;span class="o">/&lt;/span>&lt;span class="n">deep&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="n">voice&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="n">deepfake&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="n">voice&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="n">recognition&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">zip&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">fake_files&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">listdir&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/KAGGLE/AUDIO/FAKE&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">real_files&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">listdir&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/KAGGLE/AUDIO/REAL&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;Real voice samples: &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">real_files&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;Fake voice samples: &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">fake_files&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>Real voice samples: 8
Fake voice samples: 56
&lt;/code>&lt;/pre>
&lt;p>It worth noting that this is an imbalanced dataset (much more fake sample than real samples), we need to take this issue into account when eveluating the model.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">audio&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">AudioSegment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">from_wav&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/KAGGLE/AUDIO/REAL/&amp;#39;&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">real_files&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">4&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">length_audio&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">audio&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;a random sample from real voice has &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">length_audio&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1"> millisecond&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>a random sample from real voice has 600176 millisecond
&lt;/code>&lt;/pre>
&lt;p>We need to first convert the audio from wave into spetrogram before feed it into deep learning model. For &amp;lsquo;Deep-Voice&amp;rsquo; dataset, as we metioned above, since each voice samples is relatively long (600 second) and overall sample size is small (64 samples), we want to utilizing data augmentation techniques to cut each sample into smaller segement to reduces the pixels per samples and increasing the sample size.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1">## some helper functions&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">remove_all_files&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">folder_path&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;&amp;#34;&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> Remove all old files in a folder and its subfolders.
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> &amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">filename&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">listdir&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">folder_path&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">file_path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">folder_path&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">filename&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">isfile&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">file_path&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">remove&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">file_path&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">else&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">remove_all_files&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">file_path&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">split_audio&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">file_path&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">segment_length&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">30000&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">random_length&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kc">None&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">-&amp;gt;&lt;/span> &lt;span class="nb">list&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;&amp;#34;&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> Split the audio file into segments of given length (in milliseconds).
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> Returns a list of audio segments.
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> &amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">audio&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">AudioSegment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">from_wav&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">file_path&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">length_audio&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">audio&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segments&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># Random Length: None or [lower, upper]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># Generating a random segement length from given lower to upper range&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">random_length&lt;/span> &lt;span class="ow">is&lt;/span> &lt;span class="ow">not&lt;/span> &lt;span class="kc">None&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">current_length&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">0&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">while&lt;/span> &lt;span class="n">current_length&lt;/span> &lt;span class="o">&amp;lt;&lt;/span> &lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">audio&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segment_length&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">random&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">randint&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">random_length&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">],&lt;/span>&lt;span class="n">random_length&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">],&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">current_length&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">segment_length&lt;/span> &lt;span class="o">&amp;gt;=&lt;/span> &lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">audio&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segment&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">audio&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">current_length&lt;/span>&lt;span class="p">:]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">else&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segment&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">audio&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">current_length&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="n">current_length&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">segment_length&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">current_length&lt;/span> &lt;span class="o">+=&lt;/span> &lt;span class="n">segment_length&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segments&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">segment&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># fix length: cut samples based on given fixed segment_length&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">else&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">length_audio&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">segment_length&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segment&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">audio&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="n">i&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">segment_length&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segments&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">segment&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">segments&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Our audio-cutting function allows both fixed-length cutting (30 seconds by default) and random-length cutting (within a specified range) to accommodate real-world examples that different voice tends to have different length (leaving design space for possible future expansion). For now, we will cut each voice sample into fixed 30-second intervals and convert each interval into a spectrogram.&lt;/p>
&lt;h4 id="converting-real-voice">Converting real voice&lt;/h4>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># obtain all files from fake and real folders&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fake_files&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">listdir&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/KAGGLE/AUDIO/FAKE&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">real_files&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">listdir&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/KAGGLE/AUDIO/REAL&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">makedirs&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;real&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">exist_ok&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#remove old files if exists&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">remove_all_files&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;real&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#iterating through all audio in the real files&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">wav_file&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">real_files&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Processing: &amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">wav_file&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">output_folder&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;real&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1">#split to 30 seconds segments&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segments&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">split_audio&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/KAGGLE/AUDIO/REAL/&amp;#39;&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="n">wav_file&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">segment&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="nb">enumerate&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">segments&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segment&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">segment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">set_channels&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">samples&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">array&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">segment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get_array_of_samples&lt;/span>&lt;span class="p">())&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1">#converting each segments into spectrograms&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">specgram&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">freqs&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">times&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">im&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">specgram&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">samples&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">Fs&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">segment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">frame_rate&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">noverlap&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">512&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;off&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1">#save the spectrograms for future usage&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">savefig&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;real&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">wav_file&lt;/span>&lt;span class="p">[:&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39;_&amp;#39;&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="nb">str&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39;.jpg&amp;#39;&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="nb">format&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;jpg&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">close&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;all&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>Processing: linus-original.wav
Processing: biden-original.wav
Processing: ryan-original.wav
Processing: margot-original.wav
Processing: taylor-original.wav
Processing: musk-original.wav
Processing: obama-original.wav
Processing: trump-original.wav
&lt;/code>&lt;/pre>
&lt;h4 id="convert-fake-voice">Convert fake voice&lt;/h4>
&lt;p>Using the same process as the real samples, we can convert all AI generated samples into the 30 seconds spectrogram&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">makedirs&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;fake&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">exist_ok&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#remove old files if exists&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#remove_all_files(&amp;#39;fake&amp;#39;)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">wav_file&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">fake_files&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">46&lt;/span>&lt;span class="p">:]:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Processing: &amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">wav_file&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segments&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">split_audio&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/KAGGLE/AUDIO/FAKE/&amp;#39;&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="n">wav_file&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">segment&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="nb">enumerate&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">segments&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segment&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">segment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">set_channels&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">samples&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">array&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">segment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get_array_of_samples&lt;/span>&lt;span class="p">())&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">specgram&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">freqs&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">times&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">im&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">specgram&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">samples&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">Fs&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">segment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">frame_rate&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">NFFT&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">1024&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">noverlap&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">512&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;off&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">savefig&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;fake&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">wav_file&lt;/span>&lt;span class="p">[:&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39;_&amp;#39;&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="nb">str&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39;.jpg&amp;#39;&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="nb">format&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;jpg&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">close&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;all&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>Processing: musk-to-taylor.wav
Processing: margot-to-taylor.wav
Processing: margot-to-ryan.wav
Processing: linus-to-musk.wav
Processing: ryan-to-trump.wav
Processing: taylor-to-biden.wav
Processing: trump-to-linus.wav
Processing: taylor-to-margot.wav
Processing: margot-to-obama.wav
Processing: biden-to-margot.wav
&lt;/code>&lt;/pre>
&lt;h4 id="some-examples">Some examples&lt;/h4>
&lt;p>Lets see some examples of the real and fake voices in the spectrograms&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#coverted segment&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">real_spectro&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">listdir&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/kaggle/real&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fake_spectro&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">listdir&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/kaggle/fake&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Display random five real and fake voice spectrogram&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">count&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">0&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fig&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">axs&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">subplots&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">random&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">choice&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">real_spectro&lt;/span>&lt;span class="p">)),&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">image_path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">real_spectro&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">image&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">mpimg&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imread&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/kaggle/real&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">image_path&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imshow&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">image_path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">fake_spectro&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">image&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">mpimg&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imread&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/kaggle/fake&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">image_path&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imshow&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;off&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;off&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">count&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">set_title&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;real voice&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">set_title&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;fake voice&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">count&lt;/span> &lt;span class="o">+=&lt;/span> &lt;span class="mi">1&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="output_24_0.png" alt="png" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">image&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">shape&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>(539, 542, 3)
&lt;/code>&lt;/pre>
&lt;p>The result spectrogram is represented as an RGB image with dimension 539 * 542. We will transformed it into a fix dimension for CNN model.&lt;/p>
&lt;h4 id="traintest-splitting-and-regrouping-samples-into-pytorch-imagefolder-structure">Train/Test splitting and regrouping samples into Pytorch ImageFolder structure&lt;/h4>
&lt;p>However, the spectrogram of real and fake voice segments are stored in two different folders. To utilize the convenient features in Torch&amp;rsquo;s ImageFolder, we have to split the dataset into train and test sets and convert them into the required format.&lt;/p>
&lt;p>Moreover, we will spliting the dataset into train/test set in this steps, with 0.2 as the thersholds&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">posix&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">remove&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">sklearn.model_selection&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">train_test_split&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">PIL&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">Image&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># if the google drive is mounted, we can store the dataset into google drive for future usage&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">if&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">exists&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;drive&amp;#39;&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">location&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/kaggle&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">else&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># if not, stored it in runtime (required downloading processing data everytime)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">location&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">if&lt;/span> &lt;span class="ow">not&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">exists&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">location&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;train&amp;#39;&lt;/span>&lt;span class="p">)):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># Create the directory.&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">makedirs&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">location&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;train&amp;#39;&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1">#real samples&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">makedirs&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">location&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;train&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;1&amp;#39;&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1">#fake samples&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">makedirs&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">location&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;train&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;0&amp;#39;&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># combine all spectrogram&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">all_spectro&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">real_spectro&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">fake_spectro&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># create label 1: real 0:fake&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">all_spectro_label&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span> &lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">real_spectro&lt;/span>&lt;span class="p">))]&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span> &lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">fake_spectro&lt;/span>&lt;span class="p">))]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#split&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">X_train&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">X_test&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y_train&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y_test&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">train_test_split&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">all_spectro&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">all_spectro_label&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">test_size&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mf">0.2&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">random_state&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">42&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">X_train&lt;/span>&lt;span class="p">)):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">y&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">y_train&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">destination_path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">location&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;train&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="nb">str&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">X_train&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">with&lt;/span> &lt;span class="n">Image&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">open&lt;/span>&lt;span class="p">([&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/kaggle/fake/&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/kaggle/real/&amp;#39;&lt;/span>&lt;span class="p">][&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">X_train&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">])&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="n">img&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">img&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">save&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">destination_path&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">if&lt;/span> &lt;span class="ow">not&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">exists&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">location&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;test&amp;#39;&lt;/span>&lt;span class="p">)):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">makedirs&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">location&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;test&amp;#39;&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">makedirs&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">location&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;test&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;1&amp;#39;&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">makedirs&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">location&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;test&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;0&amp;#39;&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y_test&lt;/span>&lt;span class="p">)):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">y&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">y_test&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">destination_path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">location&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;test&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="nb">str&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">X_test&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">with&lt;/span> &lt;span class="n">Image&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">open&lt;/span>&lt;span class="p">([&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/kaggle/fake/&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/kaggle/real/&amp;#39;&lt;/span>&lt;span class="p">][&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">X_test&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">])&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="n">img&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">img&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">save&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">destination_path&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>After cutting, converting into spectrogram, and re-ordering the file structure, we obtain a cleaned &amp;lsquo;Deep-Voice&amp;rsquo; dataset that ready from model training. The cleaned dataset is stored in &amp;lsquo;kaggle&amp;rsquo; folder of the shared google drive&lt;/p>
&lt;h3 id="asvspoof-2019-from-datashare">ASVspoof 2019 from DataShare&lt;/h3>
&lt;p>Next, lets process the ASVspoof dataset, this dataset is much larger than the previous Kaggle dataset, so we need to download it directly into the colab and it will take sometimes&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="err">!&lt;/span>&lt;span class="n">wget&lt;/span> &lt;span class="n">https&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="o">//&lt;/span>&lt;span class="n">datashare&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ed&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ac&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">uk&lt;/span>&lt;span class="o">/&lt;/span>&lt;span class="n">bitstream&lt;/span>&lt;span class="o">/&lt;/span>&lt;span class="n">handle&lt;/span>&lt;span class="o">/&lt;/span>&lt;span class="mi">10283&lt;/span>&lt;span class="o">/&lt;/span>&lt;span class="mi">3336&lt;/span>&lt;span class="o">/&lt;/span>&lt;span class="n">LA&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">zip&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>--2024-03-29 15:35:48-- https://datashare.ed.ac.uk/bitstream/handle/10283/3336/LA.zip
Resolving datashare.ed.ac.uk (datashare.ed.ac.uk)... 129.215.67.172
Connecting to datashare.ed.ac.uk (datashare.ed.ac.uk)|129.215.67.172|:443... connected.
HTTP request sent, awaiting response... 200 200
Length: 7640952520 (7.1G) [application/zip]
Saving to: ‘LA.zip’
LA.zip 100%[===================&amp;gt;] 7.12G 26.0MB/s in 5m 34s
2024-03-29 15:41:22 (21.8 MB/s) - ‘LA.zip’ saved [7640952520/7640952520]
&lt;/code>&lt;/pre>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="err">!&lt;/span>&lt;span class="n">unzip&lt;/span> &lt;span class="n">LA&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">zip&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>Note: As the file paths in the following code cells indicated, we processed ASVspoof2019 LA locally. Processing on Google Colab would require large RAM and drive space, which have a high chance of crashing when using free colab account&lt;/strong>&lt;/p>
&lt;h4 id="basic-statistic-of-dataset">Basic statistic of dataset&lt;/h4>
&lt;p>The ASVspoof dataset has a pre-defined train/validate/test split, we will not change its original split, so we retreives and process train/val/test data seperatly to ensure they are properly handle&lt;/p>
&lt;p>&lt;strong>training set&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">train_df&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">pd&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">read_csv&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;LA/ASVspoof2019_LA_cm_protocols/ASVspoof2019.LA.cm.train.trn.txt&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">sep&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34; &amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">header&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">None&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">train_df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">columns&lt;/span> &lt;span class="o">=&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;speaker_id&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;filename&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;system_id&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;null&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;class_name&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">train_df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">drop&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">columns&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;null&amp;#39;&lt;/span>&lt;span class="p">],&lt;/span>&lt;span class="n">inplace&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">train_df&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;filepath&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;LA/ASVspoof2019_LA_train/flac/&amp;#39;&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="n">train_df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">filename&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39;.flac&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># fake audios are labeled 0, real (bona fide) are labeled 1&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">train_df&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;target&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">train_df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">class_name&lt;/span>&lt;span class="o">==&lt;/span>&lt;span class="s1">&amp;#39;bonafide&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">astype&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;int32&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">real_speakers&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">train_df&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;speaker_id&amp;#39;&lt;/span>&lt;span class="p">][&lt;/span>&lt;span class="n">train_df&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;target&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">==&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">unique&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">real_speakers&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1"> real speakers&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fake_speakers&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">train_df&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;speaker_id&amp;#39;&lt;/span>&lt;span class="p">][&lt;/span>&lt;span class="n">train_df&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;target&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">==&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">unique&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">fake_speakers&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1"> fake speakers&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="nb">sum&lt;/span>&lt;span class="p">([&lt;/span>&lt;span class="n">f_speaker&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">real_speakers&lt;/span> &lt;span class="k">for&lt;/span> &lt;span class="n">f_speaker&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">fake_speakers&lt;/span>&lt;span class="p">])&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1"> common speakers&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>20 real speakers
20 fake speakers
20 common speakers
&lt;/code>&lt;/pre>
&lt;p>There are 20 real speakers, and 20 fake speakers, and 20 of them are within both real and fake speaker, which mean all of speakers&amp;rsquo; voice contains in both real and fake samples.&lt;/p>
&lt;p>&lt;strong>validation set&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">valid_df&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">pd&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">read_csv&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;LA/ASVspoof2019_LA_cm_protocols/ASVspoof2019.LA.cm.dev.trl.txt&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">sep&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34; &amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">header&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">None&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">valid_df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">columns&lt;/span> &lt;span class="o">=&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;speaker_id&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;filename&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;system_id&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;null&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;class_name&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">valid_df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">drop&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">columns&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;null&amp;#39;&lt;/span>&lt;span class="p">],&lt;/span>&lt;span class="n">inplace&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">valid_df&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;filepath&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;LA/ASVspoof2019_LA_dev/flac/&amp;#39;&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="n">valid_df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">filename&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39;.flac&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># fake audios are labeled 0, real (bona fide) are labeled 1&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">valid_df&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;target&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">valid_df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">class_name&lt;/span>&lt;span class="o">==&lt;/span>&lt;span class="s1">&amp;#39;bonafide&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">astype&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;int32&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">real_speakers&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">valid_df&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;speaker_id&amp;#39;&lt;/span>&lt;span class="p">][&lt;/span>&lt;span class="n">valid_df&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;target&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">==&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">unique&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">real_speakers&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1"> real speakers&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fake_speakers&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">valid_df&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;speaker_id&amp;#39;&lt;/span>&lt;span class="p">][&lt;/span>&lt;span class="n">valid_df&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;target&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">==&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">unique&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">fake_speakers&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1"> fake speakers&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="nb">sum&lt;/span>&lt;span class="p">([&lt;/span>&lt;span class="n">f_speaker&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">real_speakers&lt;/span> &lt;span class="k">for&lt;/span> &lt;span class="n">f_speaker&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">fake_speakers&lt;/span>&lt;span class="p">])&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1"> common speakers&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>20 real speakers
10 fake speakers
10 common speakers
&lt;/code>&lt;/pre>
&lt;p>There are additional 10 speakers in the validation real dataset, which mean they do not have fake samples&lt;/p>
&lt;p>&lt;strong>test set&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">test_df&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">pd&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">read_csv&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;LA/ASVspoof2019_LA_cm_protocols/ASVspoof2019.LA.cm.eval.trl.txt&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">sep&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34; &amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">header&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">None&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">test_df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">columns&lt;/span> &lt;span class="o">=&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;speaker_id&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;filename&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;system_id&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;null&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;class_name&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">test_df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">drop&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">columns&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;null&amp;#39;&lt;/span>&lt;span class="p">],&lt;/span>&lt;span class="n">inplace&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">test_df&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;filepath&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;LA/ASVspoof2019_LA_eval/flac/&amp;#39;&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="n">test_df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">filename&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39;.flac&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># fake audios are labeled 0, real (bona fide) are labeled 1&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">test_df&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;target&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">test_df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">class_name&lt;/span>&lt;span class="o">==&lt;/span>&lt;span class="s1">&amp;#39;bonafide&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">astype&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;int32&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">real_speakers&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">test_df&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;speaker_id&amp;#39;&lt;/span>&lt;span class="p">][&lt;/span>&lt;span class="n">test_df&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;target&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">==&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">unique&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">real_speakers&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1"> real speakers&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fake_speakers&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">test_df&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;speaker_id&amp;#39;&lt;/span>&lt;span class="p">][&lt;/span>&lt;span class="n">test_df&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;target&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">==&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">unique&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">fake_speakers&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1"> fake speakers&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="nb">sum&lt;/span>&lt;span class="p">([&lt;/span>&lt;span class="n">f_speaker&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">real_speakers&lt;/span> &lt;span class="k">for&lt;/span> &lt;span class="n">f_speaker&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">fake_speakers&lt;/span>&lt;span class="p">])&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1"> common speakers&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>67 real speakers
48 fake speakers
48 common speakers
&lt;/code>&lt;/pre>
&lt;p>Similiar for the validation dataset, there are 19 speakers does not have fake samples. Let us examine the length of some samples&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">index&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">random&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">randint&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">train_df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">shape&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">train_df&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;filepath&amp;#39;&lt;/span>&lt;span class="p">][&lt;/span>&lt;span class="n">index&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">sound&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">AudioSegment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">from_file&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">item&lt;/span>&lt;span class="p">(),&lt;/span> &lt;span class="nb">format&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;flac&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;Lengths (in seconds) of samples: &amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">sound&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">duration_seconds&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>Lengths (in seconds) of samples: 3.74325
Lengths (in seconds) of samples: 4.227125
Lengths (in seconds) of samples: 4.0983125
Lengths (in seconds) of samples: 2.2131875
Lengths (in seconds) of samples: 4.6369375
Lengths (in seconds) of samples: 3.8778125
Lengths (in seconds) of samples: 3.84675
Lengths (in seconds) of samples: 2.6333125
Lengths (in seconds) of samples: 1.569875
Lengths (in seconds) of samples: 1.510125
&lt;/code>&lt;/pre>
&lt;h4 id="combining-samples">Combining samples&lt;/h4>
&lt;p>We observed that the length of each sample is quite short. To accommodate real-world scenarios, we decided to combine the voices from the same speaker together (fake with fake, real with real) to create longer voice segments.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">os.path&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">exists&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># load the csv that contains the speaker id of each audio, and concat the audio into a single audio&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">concat_audio&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">ref&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">dataset&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">speaker&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">ref&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;speaker_id&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">unique&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">current_speaker&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">speaker&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">27&lt;/span>&lt;span class="p">:]:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Processing&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">current_speaker&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">real&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">empty&lt;/span>&lt;span class="p">((&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,),&lt;/span> &lt;span class="n">dtype&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">int16&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">fake&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">empty&lt;/span>&lt;span class="p">((&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,),&lt;/span> &lt;span class="n">dtype&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">int16&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># group by speaker&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">voice_from_speaker&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">ref&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">ref&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;speaker_id&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="n">current_speaker&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">row&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">voice_from_speaker&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">iterrows&lt;/span>&lt;span class="p">():&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">data&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">sample_rate&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">sf&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">read&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">row&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;filepath&amp;#39;&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1">#group by label&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">row&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;target&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">real&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">real&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">data&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">else&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">fake&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">fake&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">data&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">real&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">shape&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">!=&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Real Exist&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">dataset&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="s1">&amp;#39;train&amp;#39;&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">out_path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;LA_concated/train/real&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">elif&lt;/span> &lt;span class="n">dataset&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="s1">&amp;#39;validation&amp;#39;&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">out_path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;LA_concated/validation/real&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">else&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">out_path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;LA_concated/test/real&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">makedirs&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">out_path&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">exist_ok&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1">#combine and save&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">sf&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">write&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">out_path&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">current_speaker&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39;.flac&amp;#39;&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">real&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">sample_rate&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">fake&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">shape&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">!=&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Fake Exist&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">dataset&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="s1">&amp;#39;train&amp;#39;&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">out_path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;LA_concated/train/fake&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">elif&lt;/span> &lt;span class="n">dataset&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="s1">&amp;#39;validation&amp;#39;&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">out_path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;LA_concated/validation/fake&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">else&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">out_path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;LA_concated/test/fake&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">makedirs&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">out_path&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">exist_ok&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">sf&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">write&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">out_path&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">current_speaker&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39;.flac&amp;#39;&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">fake&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">sample_rate&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>concatenate audio separately&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">concat_audio&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">train_df&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s1">&amp;#39;train&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">concat_audio&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">valid_df&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s1">&amp;#39;validation&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">concat_audio&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">test_df&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s1">&amp;#39;test&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>checking the length of concated voice segement in some samples:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">files&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">listdir&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;LA_concated/train/real&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">index&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">random&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">randint&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">files&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">files&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">index&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">]]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">path&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">:]&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="s1">&amp;#39;.flac&amp;#39;&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">sound&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">AudioSegment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">from_file&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;LA_concated/train/real/&amp;#39;&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">path&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nb">format&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;flac&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;Lengths (in seconds) of real samples: &amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">sound&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">duration_seconds&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">sound&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">AudioSegment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">from_file&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;LA_concated/train/fake/&amp;#39;&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">path&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nb">format&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;flac&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;Lengths (in seconds) of fake samples: &amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">sound&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">duration_seconds&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>Lengths (in seconds) of real samples: 438.0238125
Lengths (in seconds) of fake samples: 4188.132875
Lengths (in seconds) of real samples: 424.0614375
Lengths (in seconds) of fake samples: 3624.6846875
Lengths (in seconds) of real samples: 350.6946875
Lengths (in seconds) of fake samples: 3809.648
Lengths (in seconds) of real samples: 458.1223125
Lengths (in seconds) of fake samples: 4087.7876875
Lengths (in seconds) of real samples: 439.1545625
Lengths (in seconds) of fake samples: 4109.019375
&lt;/code>&lt;/pre>
&lt;h4 id="converting-into-spectrogram">Converting into spectrogram&lt;/h4>
&lt;p>We have managed to concatenate the audios. Then we we can follow the same process as previous kaggle dataset to convert them into 30 secs segement and spectrogram&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># this function similiar to previous splitting function, but using .flac files as input instead&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">split_audio_flac&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">file_path&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">segment_length&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">30000&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">random_length&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kc">None&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;&amp;#34;&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> Split the audio file into segments of given length (in milliseconds).
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> Returns a list of audio segments.
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> &amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">audio&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">AudioSegment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">from_file&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">file_path&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nb">format&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;flac&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">length_audio&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">audio&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segments&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># Random Length: None or [lower, upper]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># Generating a random segement length from lower to upper range&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">random_length&lt;/span> &lt;span class="ow">is&lt;/span> &lt;span class="ow">not&lt;/span> &lt;span class="kc">None&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">current_length&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">0&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">while&lt;/span> &lt;span class="n">current_length&lt;/span> &lt;span class="o">&amp;lt;&lt;/span> &lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">audio&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segment_length&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">random&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">randint&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">random_length&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">],&lt;/span>&lt;span class="n">random_length&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">],&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">current_length&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">segment_length&lt;/span> &lt;span class="o">&amp;gt;=&lt;/span> &lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">audio&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segment&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">audio&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">current_length&lt;/span>&lt;span class="p">:]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">else&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segment&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">audio&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">current_length&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="n">current_length&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">segment_length&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">current_length&lt;/span> &lt;span class="o">+=&lt;/span> &lt;span class="n">segment_length&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segments&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">segment&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># fix length: segment_length&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">else&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">length_audio&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">segment_length&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segment&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">audio&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="n">i&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">segment_length&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segments&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">segment&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">segments&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>Training Set&lt;/strong>&lt;/p>
&lt;p>we will first process training dataset in original ASVspoof 2019 contest&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">fake_files&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">listdir&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;LA_concated/train/fake&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">real_files&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">listdir&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;LA_concated/train/real&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>covert real voice into 30 second segement and spectrogram&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">makedirs&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;real&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">exist_ok&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#remove old files if exists&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">remove_all_files&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;real&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">wav_file&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">real_files&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Processing: &amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">wav_file&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">output_folder&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;real&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">wav_file&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">:]&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="s1">&amp;#39;.flac&amp;#39;&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segments&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">split_audio_flac&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;LA_concated/train/real/&amp;#39;&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="n">wav_file&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">segment&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="nb">enumerate&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">segments&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">samples&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">array&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">segment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get_array_of_samples&lt;/span>&lt;span class="p">())&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">specgram&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">freqs&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">times&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">im&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">specgram&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">samples&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">Fs&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">segment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">frame_rate&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">NFFT&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">1024&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">noverlap&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">512&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;off&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">savefig&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;real&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">wav_file&lt;/span>&lt;span class="p">[:&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39;_&amp;#39;&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="nb">str&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39;.jpg&amp;#39;&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="nb">format&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;jpg&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">close&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;all&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>fake voice&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">makedirs&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;fake&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">exist_ok&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># remove_all_files(&amp;#39;fake&amp;#39;)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">wav_file&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">fake_files&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">12&lt;/span>&lt;span class="p">:]:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Processing: &amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">wav_file&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">wav_file&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">:]&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="s1">&amp;#39;.flac&amp;#39;&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segments&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">split_audio_flac&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;LA_concated/train/fake/&amp;#39;&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="n">wav_file&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">segment&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="nb">enumerate&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">segments&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">samples&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">array&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">segment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get_array_of_samples&lt;/span>&lt;span class="p">())&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">specgram&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">freqs&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">times&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">im&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">specgram&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">samples&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">Fs&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">segment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">frame_rate&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">NFFT&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">1024&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">noverlap&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">512&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;off&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">savefig&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;fake&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">wav_file&lt;/span>&lt;span class="p">[:&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39;_&amp;#39;&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="nb">str&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39;.jpg&amp;#39;&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="nb">format&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;jpg&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">close&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;all&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">real_path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/LA_Spectrogram/train/1&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fake_path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/LA_Spectrogram/train/0&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">real_spectro&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">listdir&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">real_path&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fake_spectro&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">listdir&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">fake_path&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;There are &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">real_spectro&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1"> real samples in training set, each contain 30 seconds of voices in spectrogram&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;There are &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">fake_spectro&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1"> fake samples in training set&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>There are 303 real samples in training set, each contain 30 seconds of voices in spectrogram
There are 2615 fake samples in training set
&lt;/code>&lt;/pre>
&lt;p>Note: this is still an imbalanced dataset&lt;/p>
&lt;p>some samples&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Display random five real and fake voice spectrogram&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">count&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">0&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fig&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">axs&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">subplots&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">random&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">choice&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">real_spectro&lt;/span>&lt;span class="p">)),&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">image_path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">real_spectro&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">image&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">mpimg&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imread&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">real_path&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">image_path&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imshow&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">image_path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">fake_spectro&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">image&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">mpimg&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imread&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">fake_path&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">image_path&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imshow&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;off&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;off&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">count&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">set_title&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;real voice&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">set_title&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;fake voice&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">count&lt;/span> &lt;span class="o">+=&lt;/span> &lt;span class="mi">1&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="output_65_0.png" alt="png" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>&lt;strong>Test and Validation Set&lt;/strong>&lt;/p>
&lt;p>Similiarly, we also converted the test set into the spectrogram using the same process as train set&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">fake_files&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">listdir&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;LA_concated/test/fake&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">real_files&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">listdir&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;LA_concated/test/real&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#converting real&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">output_folder&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;LA_Spectrogram/test/1&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">makedirs&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">output_folder&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">exist_ok&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">wav_file&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">real_files&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Processing: &amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">wav_file&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">wav_file&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">:]&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="s1">&amp;#39;.flac&amp;#39;&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segments&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">split_audio_flac&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;LA_concated/test/real/&amp;#39;&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="n">wav_file&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">segment&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="nb">enumerate&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">segments&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">samples&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">array&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">segment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get_array_of_samples&lt;/span>&lt;span class="p">())&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">specgram&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">freqs&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">times&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">im&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">specgram&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">samples&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">Fs&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">segment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">frame_rate&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">NFFT&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">1024&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">noverlap&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">512&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;off&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">savefig&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">output_folder&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">wav_file&lt;/span>&lt;span class="p">[:&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39;_&amp;#39;&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="nb">str&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39;.jpg&amp;#39;&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="nb">format&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;jpg&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">close&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;all&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#converting fake&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">output_folder&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;LA_Spectrogram/test/0&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">makedirs&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">output_folder&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">exist_ok&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">wav_file&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">fake_files&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">11&lt;/span>&lt;span class="p">:]:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Processing: &amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">wav_file&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">wav_file&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">:]&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="s1">&amp;#39;.flac&amp;#39;&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segments&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">split_audio_flac&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;LA_concated/test/fake/&amp;#39;&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="n">wav_file&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">segment&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="nb">enumerate&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">segments&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">samples&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">array&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">segment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get_array_of_samples&lt;/span>&lt;span class="p">())&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">specgram&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">freqs&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">times&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">im&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">specgram&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">samples&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">Fs&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">segment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">frame_rate&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">NFFT&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">1024&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">noverlap&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">512&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;off&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">savefig&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">output_folder&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">wav_file&lt;/span>&lt;span class="p">[:&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39;_&amp;#39;&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="nb">str&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39;.jpg&amp;#39;&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="nb">format&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;jpg&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">close&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;all&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>validation set as well&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">fake_files&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">listdir&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">location&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="s1">&amp;#39;LA_concated/validation/fake&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">real_files&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">listdir&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">location&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="s1">&amp;#39;LA_concated/validation/real&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#converting real&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">output_folder&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;LA_Spectrogram/validation/1&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">makedirs&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">output_folder&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">exist_ok&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">wav_file&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">real_files&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Processing: &amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">wav_file&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">wav_file&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">:]&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="s1">&amp;#39;.flac&amp;#39;&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segments&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">split_audio_flac&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">location&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="s1">&amp;#39;LA_concated/validation/real/&amp;#39;&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="n">wav_file&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">segment&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="nb">enumerate&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">segments&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">samples&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">array&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">segment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get_array_of_samples&lt;/span>&lt;span class="p">())&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">specgram&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">freqs&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">times&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">im&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">specgram&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">samples&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">Fs&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">segment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">frame_rate&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">NFFT&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">1024&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">noverlap&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">512&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;off&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">savefig&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">output_folder&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">wav_file&lt;/span>&lt;span class="p">[:&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39;_&amp;#39;&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="nb">str&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39;.jpg&amp;#39;&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="nb">format&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;jpg&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">close&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;all&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#converting fake&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">output_folder&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;LA_Spectrogram/validation/0&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">makedirs&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">output_folder&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">exist_ok&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">wav_file&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">fake_files&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">7&lt;/span>&lt;span class="p">:]:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Processing: &amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">wav_file&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">wav_file&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">:]&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="s1">&amp;#39;.flac&amp;#39;&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segments&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">split_audio_flac&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">location&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="s1">&amp;#39;LA_concated/validation/fake/&amp;#39;&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="n">wav_file&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">segment&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="nb">enumerate&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">segments&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">samples&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">array&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">segment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get_array_of_samples&lt;/span>&lt;span class="p">())&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">specgram&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">freqs&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">times&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">im&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">specgram&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">samples&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">Fs&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">segment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">frame_rate&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">NFFT&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">1024&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">noverlap&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">512&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;off&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">savefig&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">output_folder&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">wav_file&lt;/span>&lt;span class="p">[:&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39;_&amp;#39;&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="nb">str&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39;.jpg&amp;#39;&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="nb">format&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;jpg&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">close&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;all&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>All samples are pre-splited which allows us to using this way to converting dataset, and the file structure is ready for torchvison to use&lt;/p>
&lt;h3 id="unseen-samples-for-model-evaluation">Unseen samples for model evaluation&lt;/h3>
&lt;p>We need to evaluate the model performance on unseen samples.&lt;/p>
&lt;p>We would use another dataset which consists of original voice audios and fake audios obtained by imitation using Efficient Wavelet Masking (Rodríguez, 2019) to evaluate our final model.&lt;/p>
&lt;p>This dataset was processed into spectrograms in the same way as previously. But we want to keep the length as its original length, which is not fixed 30 seconds, since this is more fit to the real world usage.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">real_unseen&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">listdir&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/Fake_voice_recordings_Imitation/real&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fake_unseen&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">listdir&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/Fake_voice_recordings_Imitation/fake&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#remove old files&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">remove_all_files&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/unseen/test/real&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">remove_all_files&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/unseen/test/fake&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">file_path&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">real_unseen&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">file_path&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mi">4&lt;/span>&lt;span class="p">:]&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="s1">&amp;#39;.wav&amp;#39;&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segments&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">split_audio&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/Fake_voice_recordings_Imitation/real/&amp;#39;&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="n">file_path&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># the audios have various lengths and are shorter than 30 seconds,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># so split_audio just keeps the whole sequence without creating 30-second intervals&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">segment&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="nb">enumerate&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">segments&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segment&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">segment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">set_channels&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">samples&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">array&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">segment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get_array_of_samples&lt;/span>&lt;span class="p">())&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">specgram&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">freqs&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">times&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">im&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">specgram&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">samples&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">Fs&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">segment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">frame_rate&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">im&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axes&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get_xaxis&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">set_visible&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="kc">False&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">im&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axes&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get_yaxis&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">set_visible&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="kc">False&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;off&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">output_folder&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/unseen/&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s1">&amp;#39;test&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s1">&amp;#39;1&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">makedirs&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">output_folder&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="n">exist_ok&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">savefig&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">output_folder&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">file_path&lt;/span>&lt;span class="p">[:&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mi">4&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39;_&amp;#39;&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="nb">str&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39;.jpg&amp;#39;&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="nb">format&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;jpg&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">close&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;all&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">file_path&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">fake_unseen&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">file_path&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mi">4&lt;/span>&lt;span class="p">:]&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="s1">&amp;#39;.wav&amp;#39;&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segments&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">split_audio&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/Fake_voice_recordings_Imitation/fake/&amp;#39;&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="n">file_path&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">segment&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="nb">enumerate&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">segments&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">segment&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">segment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">set_channels&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">samples&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">array&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">segment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get_array_of_samples&lt;/span>&lt;span class="p">())&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">specgram&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">freqs&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">times&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">im&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">specgram&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">samples&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">Fs&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">segment&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">frame_rate&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">im&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axes&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get_xaxis&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">set_visible&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="kc">False&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">im&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axes&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get_yaxis&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">set_visible&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="kc">False&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1">#change y range&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;off&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">output_folder&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/unseen/&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s1">&amp;#39;test&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s1">&amp;#39;0&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">makedirs&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">output_folder&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="n">exist_ok&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">savefig&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">output_folder&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">file_path&lt;/span>&lt;span class="p">[:&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mi">4&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39;_&amp;#39;&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="nb">str&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39;.jpg&amp;#39;&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="nb">format&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;jpg&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">close&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;all&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Lets see some samples in unseen samples&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">real_path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/unseen/test/1&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fake_path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/unseen/test/0&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">real_spectro&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">listdir&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">real_path&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fake_spectro&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">listdir&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">fake_path&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">count&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">0&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fig&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">axs&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">subplots&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">random&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">choice&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">real_spectro&lt;/span>&lt;span class="p">)),&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">image_path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">real_spectro&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">image&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">mpimg&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imread&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">real_path&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">image_path&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imshow&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">image_path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">fake_spectro&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">image&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">mpimg&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imread&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">fake_path&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">image_path&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imshow&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;off&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;off&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">count&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">set_title&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;real voice&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">set_title&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;fake voice&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">count&lt;/span> &lt;span class="o">+=&lt;/span> &lt;span class="mi">1&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="output_76_0.png" alt="png" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="model-training-and-eveluation">Model training and eveluation&lt;/h2>
&lt;p>We first defined some helper functions which created dataloaders to access training data and validation data, and computed the accuracy and the F1 score of the model.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">torch&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">torch.nn&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="nn">nn&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">torch.optim&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="nn">optim&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">torch.nn.functional&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="nn">F&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">torchvision&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">datasets&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">transforms&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">torch.utils.data&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">DataLoader&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">torch.utils.data&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">random_split&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">spliting_data&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">batch_size&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">split&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mf">0.2&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># if the google drive is mounted, we can store the dataset&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># into google drive for future usage (make sure creating a shortcut of &amp;#39;shared&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># project folder&amp;#39;in &amp;#39;My Drive&amp;#39;)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">exists&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;drive&amp;#39;&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">location&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">else&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># if not, stored it in runtime (required downloading processing data everytime)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">location&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1">#kaggle dataset&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">data_dir&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">location&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">path&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1">#resize the spectrogram&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">transform&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">transforms&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Compose&lt;/span>&lt;span class="p">([&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">transforms&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Resize&lt;/span>&lt;span class="p">((&lt;/span>&lt;span class="mi">224&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">224&lt;/span>&lt;span class="p">)),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">transforms&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ToTensor&lt;/span>&lt;span class="p">(),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Reading Data&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1">#using both dataset&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">path&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="s1">&amp;#39;both&amp;#39;&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">dataset1&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">datasets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ImageFolder&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/LA_Spectrogram/train&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">transform&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">transform&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">val_dataset1&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">datasets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ImageFolder&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/LA_Spectrogram/validation&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">transform&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">transform&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1">#kaggle dataset does not pre-split&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">dataset2&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">datasets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ImageFolder&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/kaggle/train&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">transform&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">transform&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">train_size&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">int&lt;/span>&lt;span class="p">((&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="n">split&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">dataset2&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">val_size&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">dataset2&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="n">train_size&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">dataset2&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">val_dataset2&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">random_split&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">dataset2&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="n">train_size&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">val_size&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">train_dataset&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">torch&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">utils&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">data&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ConcatDataset&lt;/span>&lt;span class="p">([&lt;/span>&lt;span class="n">dataset1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">dataset2&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">val_dataset&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">torch&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">utils&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">data&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ConcatDataset&lt;/span>&lt;span class="p">([&lt;/span>&lt;span class="n">val_dataset1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">val_dataset2&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># using single one dataset&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">else&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">path&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="s1">&amp;#39;kaggle&amp;#39;&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1">#using kaggle&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">dataset&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">datasets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ImageFolder&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/kaggle/train&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">transform&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">transform&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">train_size&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">int&lt;/span>&lt;span class="p">((&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="n">split&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">dataset&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">al_size&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">dataset2&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="n">train_size&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">train_dataset&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">val_dataset&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">random_split&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">dataset&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="n">train_size&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">val_size&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">else&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1">#using asvsproof&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">train_dataset&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">datasets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ImageFolder&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/LA_Spectrogram/train&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">transform&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">transform&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">val_dataset&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">datasets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ImageFolder&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/LA_Spectrogram/validation&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">transform&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">transform&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;Training Sample: &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">train_dataset&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;Validation Sample: &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">val_dataset&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">train_loader&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">DataLoader&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">train_dataset&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">batch_size&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">batch_size&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">shuffle&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">val_loader&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">DataLoader&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">val_dataset&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">batch_size&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">batch_size&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">shuffle&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">train_loader&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">val_loader&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># evalation the performance&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">evaluate&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">model&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">test_loader&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">model&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">eval&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">device&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">torch&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">device&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;cuda:0&amp;#34;&lt;/span> &lt;span class="k">if&lt;/span> &lt;span class="n">torch&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">cuda&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">is_available&lt;/span>&lt;span class="p">()&lt;/span> &lt;span class="k">else&lt;/span> &lt;span class="s2">&amp;#34;cpu&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">test_loss&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">0&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">test_accuracy&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">0&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">test_f1&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">0&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">criterion&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">nn&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">CrossEntropyLoss&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">with&lt;/span> &lt;span class="n">torch&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">no_grad&lt;/span>&lt;span class="p">():&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">inputs&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">labels&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">test_loader&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">inputs&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">labels&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">inputs&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">to&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">device&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">labels&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">to&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">device&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">outputs&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">model&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">inputs&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">test_loss&lt;/span> &lt;span class="o">+=&lt;/span> &lt;span class="n">criterion&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">outputs&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">labels&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">_&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">predicted&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">torch&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">max&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">outputs&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">data&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">labels&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">labels&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">cpu&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">predicted&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">predicted&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">cpu&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">test_accuracy&lt;/span> &lt;span class="o">+=&lt;/span> &lt;span class="n">accuracy_score&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">labels&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">predicted&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">test_f1&lt;/span> &lt;span class="o">+=&lt;/span> &lt;span class="n">f1_score&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">labels&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">predicted&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">test_loss&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">detach&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">cpu&lt;/span>&lt;span class="p">()&lt;/span> &lt;span class="o">/&lt;/span> &lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">test_loader&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">test_f1&lt;/span> &lt;span class="o">/&lt;/span> &lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">test_loader&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="n">test_accuracy&lt;/span> &lt;span class="o">/&lt;/span> &lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">test_loader&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Also, another helper function for training&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">sklearn.metrics&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">f1_score&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">sklearn.metrics&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">accuracy_score&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">train&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">model&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">dataset_path&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">epoches&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">30&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">learning_rate&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mf">0.0001&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">batch_size&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">32&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">check_point&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kc">True&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">random_state&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">42&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">plot&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kc">True&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">torch&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">manual_seed&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">random_state&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">model&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">model&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">train_loader&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">val_loader&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">spliting_data&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">dataset_path&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">batch_size&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># Assuming you have your train_loader set up as before&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">criterion&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">nn&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">CrossEntropyLoss&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">optimizer&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">optim&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Adam&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">model&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">parameters&lt;/span>&lt;span class="p">(),&lt;/span> &lt;span class="n">lr&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">learning_rate&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># Move the model to the GPU if available&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">device&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">torch&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">device&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;cuda:0&amp;#34;&lt;/span> &lt;span class="k">if&lt;/span> &lt;span class="n">torch&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">cuda&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">is_available&lt;/span>&lt;span class="p">()&lt;/span> &lt;span class="k">else&lt;/span> &lt;span class="s2">&amp;#34;cpu&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">model&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">to&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">device&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">num_epochs&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">epoches&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">training_loss&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">training_accuracy&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">training_f1&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">validation_loss&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">validation_accuracy&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">validation_f1&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">epoch&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">tqdm&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">num_epochs&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">leave&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kc">False&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">desc&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;Epoch&amp;#39;&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># Set the model to training mode&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">model&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">train&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">running_loss&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mf">0.0&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">correct&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">0&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">total&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">0&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">f1&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">0&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">accuracy&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">0&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">inputs&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">labels&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">train_loader&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">inputs&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">labels&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">inputs&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">to&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">device&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">labels&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">to&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">device&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">optimizer&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">zero_grad&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">outputs&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">model&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">inputs&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">loss&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">criterion&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">outputs&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">labels&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">loss&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">backward&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">optimizer&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">step&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">running_loss&lt;/span> &lt;span class="o">+=&lt;/span> &lt;span class="n">loss&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">item&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># Calculate accuracy&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">_&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">predicted&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">torch&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">max&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">outputs&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">data&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">labels&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">labels&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">cpu&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">predicted&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">predicted&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">cpu&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">accuracy&lt;/span> &lt;span class="o">+=&lt;/span> &lt;span class="n">accuracy_score&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">labels&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">predicted&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">f1&lt;/span> &lt;span class="o">+=&lt;/span> &lt;span class="n">f1_score&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">labels&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">predicted&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">epoch_loss&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">running_loss&lt;/span> &lt;span class="o">/&lt;/span> &lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">train_loader&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">epoch_f1&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">f1&lt;/span> &lt;span class="o">/&lt;/span> &lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">train_loader&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">epoch_accuracy&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">accuracy&lt;/span>&lt;span class="o">/&lt;/span> &lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">train_loader&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1">#print, eval, and store every 2:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">epoch&lt;/span> &lt;span class="o">%&lt;/span> &lt;span class="mi">2&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">training_loss&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">epoch_loss&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">training_accuracy&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">epoch_accuracy&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">training_f1&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">epoch_f1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">epoch_val_loss&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">epoch_val_f1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">epoch_val_accuracy&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">evaluate&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">model&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">val_loader&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">validation_loss&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">epoch_val_loss&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">validation_accuracy&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">epoch_val_accuracy&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">validation_f1&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">epoch_val_f1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1">#save checkpoint&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">check_point&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">checkpoint&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">{&lt;/span>&lt;span class="s1">&amp;#39;epoch&amp;#39;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">epoch&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;model_state_dict&amp;#39;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">model&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">state_dict&lt;/span>&lt;span class="p">(),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s1">&amp;#39;optimizer_state_dict&amp;#39;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">optimizer&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">state_dict&lt;/span>&lt;span class="p">(),&lt;/span>&lt;span class="s1">&amp;#39;loss&amp;#39;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">loss&lt;/span>&lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">out_path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">dataset_path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">split&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/&amp;#39;&lt;/span>&lt;span class="p">)[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;/&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">model&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1">/lr_&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">learning_rate&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1">_batch_&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">batch_size&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1">_new&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">makedirs&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">location&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">out_path&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">exist_ok&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">checkpoint_filename&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;checkpoint_epoch&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">epoch&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1">.pth&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">torch&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">save&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">checkpoint&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">location&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">out_path&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">checkpoint_filename&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;&amp;#39;&amp;#39;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s1"> Epoch &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">epoch&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1">/&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">num_epochs&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s1"> Train Loss: &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">epoch_loss&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s1">.4f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1">, Train Accuracy: &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">epoch_accuracy&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s1">.2f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1">, Train F1: &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">epoch_f1&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s1">.2f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s1"> Val Loss: &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">epoch_val_loss&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s1">.4f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1">, Val Accuracy: &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">epoch_val_accuracy&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s1">.2f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1">, Val F1: &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">epoch_val_f1&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s1">.2f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s1"> &amp;#39;&amp;#39;&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">plot&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># Plot the training loss and accuracy curves&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">figure&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">figsize&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">18&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">6&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">subplot&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">3&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">plot&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">num_epochs&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">training_loss&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">label&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;Training Loss&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">plot&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">num_epochs&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">validation_loss&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">label&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;Validation Loss&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">xlabel&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Epochs&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ylabel&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Loss&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">title&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Training Loss Curve&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">legend&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">subplot&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">3&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">plot&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">num_epochs&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">training_accuracy&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">label&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;Training Accuracy&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">plot&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">num_epochs&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">validation_accuracy&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">label&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;Validation Accuracy&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">xlabel&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Epochs&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ylabel&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Accuracy&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">title&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Training Accuracy Curve&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">legend&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">subplot&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">3&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">3&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">plot&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">num_epochs&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">training_f1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">label&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;Training Accuracy&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">plot&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">num_epochs&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">validation_f1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">label&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;Validation Accuracy&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">xlabel&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Epochs&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ylabel&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;F1 Score&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">title&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;F1 Score Curve&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">legend&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">show&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">model&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="baseline-model">Baseline Model&lt;/h3>
&lt;p>We first trained a CNN without transferred learning from ResNet as the baseline model. This CNN simply had 2 convolution layers each followed by max-pooling, and 3 fully-connected layers. Activation function between the 3 fully-connected layers was ReLu.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="k">class&lt;/span> &lt;span class="nc">SimpleCNN&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">nn&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Module&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">def&lt;/span> &lt;span class="fm">__init__&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">self&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">super&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">SimpleCNN&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="bp">self&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="fm">__init__&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">name&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;SimpleCNN&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">conv1&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">nn&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Conv2d&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">3&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">6&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">5&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">pool&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">nn&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">MaxPool2d&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">conv2&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">nn&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Conv2d&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">6&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">16&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">5&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">fc1&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">nn&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Linear&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">16&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="mi">53&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="mi">53&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">120&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">fc2&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">nn&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Linear&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">120&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">84&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">fc3&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">nn&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Linear&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">84&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">def&lt;/span> &lt;span class="nf">forward&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">self&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">x&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">pool&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">F&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">relu&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">conv1&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">)))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">x&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">pool&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">F&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">relu&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">conv2&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">)))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">x&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">view&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">16&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="mi">53&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="mi">53&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">x&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">F&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">relu&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">fc1&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">x&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">F&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">relu&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">fc2&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">x&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">fc3&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">x&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="training">Training&lt;/h4>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">model&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">SimpleCNN&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># &amp;#39;both&amp;#39; will combine two dataset,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#&amp;#39; kaggle/train&amp;#39; will only used kaggle dataset&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># &amp;#39;LA_Spectrogram/train&amp;#39; will used ASVsproof dataset&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">train&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">model&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s1">&amp;#39;both&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">batch_size&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">64&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">learning_rate&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mf">0.001&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>Reading Data
Training Sample: 2986
Validation Sample: 747
Epoch: 3%|▎ | 1/30 [05:24&amp;lt;2:36:47, 324.39s/it]
Epoch 1/30
Train Loss: 0.3613, Train Accuracy: 0.89, Train F1: 0.00
Val Loss: 0.3500, Val Accuracy: 0.89, Val F1: 0.00
Epoch: 10%|█ | 3/30 [06:35&amp;lt;45:24, 100.90s/it]
Epoch 3/30
Train Loss: 0.3125, Train Accuracy: 0.90, Train F1: 0.06
Val Loss: 0.3126, Val Accuracy: 0.91, Val F1: 0.27
Epoch: 17%|█▋ | 5/30 [07:48&amp;lt;25:43, 61.74s/it]
Epoch 5/30
Train Loss: 0.2594, Train Accuracy: 0.91, Train F1: 0.26
Val Loss: 0.2729, Val Accuracy: 0.90, Val F1: 0.27
Epoch: 23%|██▎ | 7/30 [08:56&amp;lt;17:52, 46.63s/it]
Epoch 7/30
Train Loss: 0.1809, Train Accuracy: 0.93, Train F1: 0.55
Val Loss: 0.2037, Val Accuracy: 0.92, Val F1: 0.45
Epoch: 30%|███ | 9/30 [10:04&amp;lt;14:10, 40.49s/it]
Epoch 9/30
Train Loss: 0.0904, Train Accuracy: 0.97, Train F1: 0.80
Val Loss: 0.1039, Val Accuracy: 0.96, Val F1: 0.81
Epoch: 37%|███▋ | 11/30 [11:14&amp;lt;12:00, 37.92s/it]
Epoch 11/30
Train Loss: 0.0697, Train Accuracy: 0.97, Train F1: 0.86
Val Loss: 0.0878, Val Accuracy: 0.96, Val F1: 0.82
Epoch: 43%|████▎ | 13/30 [12:23&amp;lt;10:18, 36.41s/it]
Epoch 13/30
Train Loss: 0.0398, Train Accuracy: 0.99, Train F1: 0.92
Val Loss: 0.0806, Val Accuracy: 0.96, Val F1: 0.83
Epoch: 50%|█████ | 15/30 [13:32&amp;lt;08:58, 35.90s/it]
Epoch 15/30
Train Loss: 0.0238, Train Accuracy: 0.99, Train F1: 0.97
Val Loss: 0.0811, Val Accuracy: 0.97, Val F1: 0.83
Epoch: 57%|█████▋ | 17/30 [14:41&amp;lt;07:41, 35.52s/it]
Epoch 17/30
Train Loss: 0.0092, Train Accuracy: 1.00, Train F1: 0.99
Val Loss: 0.0564, Val Accuracy: 0.98, Val F1: 0.89
Epoch: 63%|██████▎ | 19/30 [15:49&amp;lt;06:26, 35.15s/it]
Epoch 19/30
Train Loss: 0.0018, Train Accuracy: 1.00, Train F1: 1.00
Val Loss: 0.0630, Val Accuracy: 0.98, Val F1: 0.89
Epoch: 70%|███████ | 21/30 [17:00&amp;lt;05:21, 35.68s/it]
Epoch 21/30
Train Loss: 0.0007, Train Accuracy: 1.00, Train F1: 1.00
Val Loss: 0.0677, Val Accuracy: 0.98, Val F1: 0.90
Epoch: 77%|███████▋ | 23/30 [18:09&amp;lt;04:06, 35.25s/it]
Epoch 23/30
Train Loss: 0.0004, Train Accuracy: 1.00, Train F1: 1.00
Val Loss: 0.0718, Val Accuracy: 0.98, Val F1: 0.91
Epoch: 83%|████████▎ | 25/30 [19:17&amp;lt;02:54, 34.99s/it]
Epoch 25/30
Train Loss: 0.0003, Train Accuracy: 1.00, Train F1: 1.00
Val Loss: 0.0697, Val Accuracy: 0.98, Val F1: 0.89
Epoch: 90%|█████████ | 27/30 [20:30&amp;lt;01:47, 35.86s/it]
Epoch 27/30
Train Loss: 0.0002, Train Accuracy: 1.00, Train F1: 1.00
Val Loss: 0.0737, Val Accuracy: 0.98, Val F1: 0.90
Epoch: 97%|█████████▋| 29/30 [21:41&amp;lt;00:35, 35.89s/it]
Epoch 29/30
Train Loss: 0.0002, Train Accuracy: 1.00, Train F1: 1.00
Val Loss: 0.0744, Val Accuracy: 0.98, Val F1: 0.89
&lt;/code>&lt;/pre>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="output_86_32.png" alt="png" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;pre>&lt;code>SimpleCNN(
(conv1): Conv2d(3, 6, kernel_size=(5, 5), stride=(1, 1))
(pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
(fc1): Linear(in_features=44944, out_features=120, bias=True)
(fc2): Linear(in_features=120, out_features=84, bias=True)
(fc3): Linear(in_features=84, out_features=2, bias=True)
)
&lt;/code>&lt;/pre>
&lt;p>The highest validation F1 we got was 0.91 and the corresponding accuracy was 0.98. A validation F1 of 0.91 is very good (Buhl, 2023), but by tuning hyperparameters based on validation F1, the model might have overfit on the validation data, so we should check the performance on the test data.&lt;/p>
&lt;h4 id="evaluation">Evaluation&lt;/h4>
&lt;p>We loaded the trained baseline model.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">model&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">SimpleCNN&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#load the checkpoint with lowest F1&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">location&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="s1">&amp;#39;/both/SimpleCNN/lr_0.001_batch_64/checkpoint_epoch23.pth&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">checkpoint&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">torch&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">load&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">model_state_dict&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">checkpoint&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;model_state_dict&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">model&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">load_state_dict&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">model_state_dict&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>&amp;lt;All keys matched successfully&amp;gt;
&lt;/code>&lt;/pre>
&lt;p>Then we evaluated the model&amp;rsquo;s performance on the test data. Note that the test data consisted of test samples from both &amp;ldquo;Deep-Voice&amp;rdquo; on Kaggle and ASVspoof2019 LA.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">transform&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">transforms&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Compose&lt;/span>&lt;span class="p">([&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">transforms&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Resize&lt;/span>&lt;span class="p">((&lt;/span>&lt;span class="mi">224&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">224&lt;/span>&lt;span class="p">)),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">transforms&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ToTensor&lt;/span>&lt;span class="p">(),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">test_dataset1&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">datasets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ImageFolder&lt;/span>&lt;span class="p">((&lt;/span>&lt;span class="n">location&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="s1">&amp;#39;kaggle/test&amp;#39;&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">transform&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">transform&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">test_dataset2&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">datasets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ImageFolder&lt;/span>&lt;span class="p">((&lt;/span>&lt;span class="n">location&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="s1">&amp;#39;LA_Spectrogram/test&amp;#39;&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">transform&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">transform&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">test_dataset&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">torch&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">utils&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">data&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ConcatDataset&lt;/span>&lt;span class="p">([&lt;/span>&lt;span class="n">test_dataset1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">test_dataset2&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">test_loader&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">DataLoader&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">test_dataset&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">batch_size&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">64&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">shuffle&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">loss&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">accuracy&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">evaluate&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">model&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">test_loader&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;The F1-Score for Testing Dataset is &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">f1&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;The accuracy for Testing Dataset is &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">accuracy&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>The F1-Score for Testing Dataset is 0.7845597579699278
The accuracy for Testing Dataset is 0.9271012931034484
&lt;/code>&lt;/pre>
&lt;p>Although the test accuracy of baseline model was high (0.927), the F1 score for the testing dataset was relatively low (0.785). This means that the model didn&amp;rsquo;t achieve an as satisfying balance between precision and recall as it did on the validation set. Since our training,validation and test data all had more fake samples than real samples, the baseline model might be classifying samples as fake without learning sufficient features.&lt;/p>
&lt;h3 id="transfer-learning-pre-train-res-net">Transfer Learning: Pre-train Res-Net&lt;/h3>
&lt;p>Our next step was to build a model that utilized transferred learning from ResNet18.
The architecture was composed of 1 convolution layer followed by the pretrained ResNet18, and 2 fully-connected layers in the end. We did not freeze the parameters from ResNet18. In other words, they were subject to update by backpropagation. Our justification was that ResNet18 was pretrained on ImageNet which did not contain any spectrograms or other signal visualizations.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">torch&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">torchvision.models&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="nn">models&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">torch.nn&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="nn">nn&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">class&lt;/span> &lt;span class="nc">PreTrainedResNet18&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">nn&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Module&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">def&lt;/span> &lt;span class="fm">__init__&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">self&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">super&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="fm">__init__&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">name&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;PreRes&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">conv1&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">nn&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Conv2d&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">3&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">3&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">kernel_size&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">7&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">stride&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">padding&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">3&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">bias&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">False&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">resnet&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">models&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">resnet18&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">pretrained&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">fc1&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">nn&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Linear&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1000&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">100&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">fc2&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">nn&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Linear&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">100&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">activation&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">nn&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ReLU&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">dropout&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">nn&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Dropout&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">p&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mf">0.1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># Copy the pre-trained model weights except for the first convolution layer&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">state_dict&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">resnet&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">state_dict&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">key&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="nb">list&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">state_dict&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">keys&lt;/span>&lt;span class="p">()):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="ow">not&lt;/span> &lt;span class="n">key&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">startswith&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;conv1&amp;#39;&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">state_dict&lt;/span>&lt;span class="p">()[&lt;/span>&lt;span class="n">key&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">state_dict&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">key&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">def&lt;/span> &lt;span class="nf">forward&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">self&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">x&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">conv1&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">x&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">resnet&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">x&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">activation&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">fc1&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">x&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">dropout&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">x&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">fc2&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">x&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="model-training-and-fine-tuning">Model Training and Fine-tuning&lt;/h4>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># training&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">model_resnet&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">PreTrainedResNet18&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">model_resnet&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">train&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">model_resnet&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s1">&amp;#39;both&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">batch_size&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">256&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">learning_rate&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mf">0.0001&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>Downloading: &amp;quot;https://download.pytorch.org/models/resnet18-f37072fd.pth&amp;quot; to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100%|██████████| 44.7M/44.7M [00:00&amp;lt;00:00, 120MB/s]
Reading Data
Training Sample: 2986
Validation Sample: 747
Epoch: 3%|▎ | 1/30 [01:09&amp;lt;33:42, 69.73s/it]
Epoch 1/30
Train Loss: 0.3109, Train Accuracy: 0.86, Train F1: 0.31
Val Loss: 0.3418, Val Accuracy: 0.89, Val F1: 0.00
Epoch: 10%|█ | 3/30 [02:22&amp;lt;20:03, 44.57s/it]
Epoch 3/30
Train Loss: 0.0240, Train Accuracy: 0.99, Train F1: 0.97
Val Loss: 0.2347, Val Accuracy: 0.91, Val F1: 0.35
Epoch: 17%|█▋ | 5/30 [03:34&amp;lt;16:26, 39.48s/it]
Epoch 5/30
Train Loss: 0.0028, Train Accuracy: 1.00, Train F1: 1.00
Val Loss: 0.0516, Val Accuracy: 0.98, Val F1: 0.91
Epoch: 23%|██▎ | 7/30 [04:44&amp;lt;14:23, 37.53s/it]
Epoch 7/30
Train Loss: 0.0004, Train Accuracy: 1.00, Train F1: 1.00
Val Loss: 0.0246, Val Accuracy: 0.99, Val F1: 0.96
Epoch: 30%|███ | 9/30 [05:57&amp;lt;13:00, 37.17s/it]
Epoch 9/30
Train Loss: 0.0001, Train Accuracy: 1.00, Train F1: 1.00
Val Loss: 0.0205, Val Accuracy: 1.00, Val F1: 0.98
Epoch: 37%|███▋ | 11/30 [07:08&amp;lt;11:36, 36.66s/it]
Epoch 11/30
Train Loss: 0.0001, Train Accuracy: 1.00, Train F1: 1.00
Val Loss: 0.0209, Val Accuracy: 0.99, Val F1: 0.97
Epoch: 43%|████▎ | 13/30 [08:20&amp;lt;10:23, 36.66s/it]
Epoch 13/30
Train Loss: 0.0001, Train Accuracy: 1.00, Train F1: 1.00
Val Loss: 0.0204, Val Accuracy: 0.99, Val F1: 0.96
Epoch: 50%|█████ | 15/30 [09:32&amp;lt;09:07, 36.53s/it]
Epoch 15/30
Train Loss: 0.0001, Train Accuracy: 1.00, Train F1: 1.00
Val Loss: 0.0209, Val Accuracy: 0.99, Val F1: 0.97
Epoch: 57%|█████▋ | 17/30 [10:43&amp;lt;07:52, 36.37s/it]
Epoch 17/30
Train Loss: 0.0001, Train Accuracy: 1.00, Train F1: 1.00
Val Loss: 0.0198, Val Accuracy: 0.99, Val F1: 0.97
Epoch: 63%|██████▎ | 19/30 [11:59&amp;lt;06:58, 38.02s/it]
Epoch 19/30
Train Loss: 0.0001, Train Accuracy: 1.00, Train F1: 1.00
Val Loss: 0.0207, Val Accuracy: 0.99, Val F1: 0.97
Epoch: 70%|███████ | 21/30 [13:10&amp;lt;05:33, 37.06s/it]
Epoch 21/30
Train Loss: 0.0000, Train Accuracy: 1.00, Train F1: 1.00
Val Loss: 0.0204, Val Accuracy: 0.99, Val F1: 0.97
Epoch: 77%|███████▋ | 23/30 [14:21&amp;lt;04:16, 36.60s/it]
Epoch 23/30
Train Loss: 0.0000, Train Accuracy: 1.00, Train F1: 1.00
Val Loss: 0.0210, Val Accuracy: 0.99, Val F1: 0.96
Epoch: 83%|████████▎ | 25/30 [15:33&amp;lt;03:02, 36.54s/it]
Epoch 25/30
Train Loss: 0.0000, Train Accuracy: 1.00, Train F1: 1.00
Val Loss: 0.0209, Val Accuracy: 0.99, Val F1: 0.97
Epoch: 90%|█████████ | 27/30 [16:46&amp;lt;01:49, 36.60s/it]
Epoch 27/30
Train Loss: 0.0000, Train Accuracy: 1.00, Train F1: 1.00
Val Loss: 0.0216, Val Accuracy: 0.99, Val F1: 0.97
Epoch: 97%|█████████▋| 29/30 [17:56&amp;lt;00:36, 36.26s/it]
Epoch 29/30
Train Loss: 0.0000, Train Accuracy: 1.00, Train F1: 1.00
Val Loss: 0.0222, Val Accuracy: 0.99, Val F1: 0.97
&lt;/code>&lt;/pre>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="output_98_33.png" alt="png" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>The highest validation F1 was 0.98, and the validation accuracy was approximately 1. This was an improvement of the baseline model. Comparing the training curves, we also saw that in terms of epochs, the ResNet18 model was converging faster and smoother. Then we checked its performance on the test data.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># load the best model&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">model_resnet&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">PreTrainedResNet18&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">location&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39;both/PreRes/lr_0.0001_batch_256/checkpoint_epoch9.pth&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">checkpoint&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">torch&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">load&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">map_location&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">torch&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">device&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;cpu&amp;#39;&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">model_state_dict&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">checkpoint&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;model_state_dict&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">model_resnet&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">load_state_dict&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">model_state_dict&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>Downloading: &amp;quot;https://download.pytorch.org/models/resnet18-f37072fd.pth&amp;quot; to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100%|██████████| 44.7M/44.7M [00:00&amp;lt;00:00, 77.1MB/s]
&amp;lt;All keys matched successfully&amp;gt;
&lt;/code>&lt;/pre>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># evaluate by the test dataset&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">loss&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">accuracy&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">evaluate&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">model_resnet&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">test_loader&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;The F1-Score for Testing Dataset is &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">f1&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;The accuracy for Testing Dataset is &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">accuracy&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>The F1-Score for Testing Dataset is 0.8942586360721817
The accuracy for Testing Dataset is 0.9690193965517241
&lt;/code>&lt;/pre>
&lt;p>While the test accuracies were close (0.969 v.s. 0.927), test F1 of ResNet18 model was quite higher (0.894 v.s. 0.785), so utilizing ResNet18 helped us achieve better balance between precision and recall. Our final model would be the ResNet18 model.&lt;/p>
&lt;h3 id="unseen-samples">Unseen samples&lt;/h3>
&lt;p>At the end of model training, we wish to test our best model with the unseen samples we prepared before. There are 50 fake samples, and 50 real samples.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">test_dataset&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">datasets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ImageFolder&lt;/span>&lt;span class="p">((&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/unseen/test&amp;#39;&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">transform&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">transform&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">test_loader&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">DataLoader&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">test_dataset&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">batch_size&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">64&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">shuffle&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">loss&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">f1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">accuracy&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">evaluate&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">model_resnet&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">test_loader&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;The F1-Score for Testing Dataset is &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">f1&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;The accuracy for Testing Dataset is &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">accuracy&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>The F1-Score for Testing Dataset is 0.8282732447817838
The accuracy for Testing Dataset is 0.8402777777777778
&lt;/code>&lt;/pre>
&lt;p>At this point we tested our final ResNet18 model on new spectrograms. It was expectable that the F1 score and the accuracy decrease as the audios from which the new spectrograms were generated in one unique method, which was likely not included in
ASVspoof2019 LA (the generation methods were labeled anonymously). The F1 score didn&amp;rsquo;t drop too much (0.828 v.s. 0.894) and was still satisfactory (Buhl, 2023).&lt;/p>
&lt;p>We think the model exhibits fairly good performance on unseen samples from a completely new dataset. Although it may be lower than the performance on the previous testing dataset, it&amp;rsquo;s noteworthy that the AI-generated voices originate from different models, use distinct methods, feature various speakers, and differ in audio length compared to our training data. Therefore, we believe our model&amp;rsquo;s performance remains robust.&lt;/p>
&lt;p>Here is an example of unseen samples&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#read one image and transform&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">PIL&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">Image&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fig&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ax&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">subplots&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">figsize&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">image&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">Image&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">open&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/unseen/test/0/fake5_10_1.jpg&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ax&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imshow&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">test_image&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">label&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">transform&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="mi">0&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">test_image&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">test_image&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">unsqueeze&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">_&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">predicted&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">torch&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">max&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">model_resnet&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">test_image&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ax&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">set_title&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Predicted: &amp;#39;&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;Fake&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;Real&amp;#39;&lt;/span>&lt;span class="p">][&lt;/span>&lt;span class="n">predicted&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">item&lt;/span>&lt;span class="p">()]&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39; Label: &amp;#39;&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;Fake&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;Real&amp;#39;&lt;/span>&lt;span class="p">][&lt;/span>&lt;span class="n">label&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">image&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">Image&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">open&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/unseen/test/1/speaker1_6_1.jpg&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ax&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imshow&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">test_image&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">label&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">transform&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="mi">1&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">test_image&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">test_image&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">unsqueeze&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">_&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">predicted&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">torch&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">max&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">model_resnet&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">test_image&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ax&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">set_title&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Predicted: &amp;#39;&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;Fake&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;Real&amp;#39;&lt;/span>&lt;span class="p">][&lt;/span>&lt;span class="n">predicted&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">item&lt;/span>&lt;span class="p">()]&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="s1">&amp;#39; Label: &amp;#39;&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;Fake&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;Real&amp;#39;&lt;/span>&lt;span class="p">][&lt;/span>&lt;span class="n">label&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">show&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="output_107_0.png" alt="png" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>The printed 2 spectrograms came from 2 audios which we were unable to classify by hearing them on ourselves, yet the ResNet18 model was able to classify them correctly.&lt;/p>
&lt;h3 id="qualitative-explanation">Qualitative explanation&lt;/h3>
&lt;p>The key reason that even the baseline CNN model achieved a high validation accuracy of 0.98 lay among the spectrograms. We re-visited some spectrograms in the training set from &amp;ldquo;Deep-Voice&amp;rdquo; on Kaggle and ASVspoof2019 LA.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">real_spectro&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">listdir&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/kaggle/real&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fake_spectro&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">listdir&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/kaggle/fake&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Display random five real and fake voice spectrogram&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">random&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">seed&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">count&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">0&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fig&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">axs&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">subplots&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">random&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">choice&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">real_spectro&lt;/span>&lt;span class="p">)),&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">image_path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">real_spectro&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">image&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">mpimg&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imread&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/kaggle/real&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">image_path&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imshow&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">image_path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">fake_spectro&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">image&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">mpimg&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imread&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/kaggle/fake&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">image_path&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imshow&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;off&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;off&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">count&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">set_title&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Kaggle, real voice&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">set_title&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Kaggle, fake voice&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">count&lt;/span> &lt;span class="o">+=&lt;/span> &lt;span class="mi">1&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="output_111_0.png" alt="png" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">real_spectro&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">listdir&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/LA_Spectrogram/train/1&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fake_spectro&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">listdir&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/LA_Spectrogram/train/0&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">random&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">seed&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">count&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">0&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fig&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">axs&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">subplots&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">random&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">choice&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">real_spectro&lt;/span>&lt;span class="p">)),&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">image_path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">real_spectro&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">image&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">mpimg&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imread&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/LA_Spectrogram/train/1&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">image_path&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imshow&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">image_path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">fake_spectro&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">image&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">mpimg&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imread&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/drive/MyDrive/shared-project-folder/LA_Spectrogram/train/0&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">image_path&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imshow&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;off&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;off&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">count&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">set_title&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;ASVspoof, real voice&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">axs&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">set_title&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;ASVspoof, fake voice&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">count&lt;/span> &lt;span class="o">+=&lt;/span> &lt;span class="mi">1&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="output_112_0.png" alt="png" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>The spectrograms of the fake audios had some obvious visual differences from those of the real audios, including:&lt;/p>
&lt;ol>
&lt;li>certain frequencies being intensified/diminished&lt;/li>
&lt;li>different distances between vertical stripes, i.e. different lengths of silent intervals&lt;/li>
&lt;/ol>
&lt;p>These differences could be captured by CNN, and even more easily by ResNet which incorporated skip connections. However, when we humans listened to the audios, we could not hear the spectrograms.&lt;/p>
&lt;h2 id="discussion">Discussion&lt;/h2>
&lt;p>In our project, we&amp;rsquo;ve encountered several challenges related to datasets.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Initially, we thought finding the desired datasets would be straightforward.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>However, we later found that the number of available datasets was less than expected. This is partly because some datasets only feature non-human voice audio. And some datasets have imbalance between real and fake audio samples. Certain datasets provide only fake audio or, only real audio.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Another challenge is the length of the audio files; most are under 10 seconds, which means we need to preprocess them to make them suitable for training.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Lastly, limiting our search to only English audio reduces the scope of our dataset options. But the question is: could we use non-English audio data for our task? It is a good question that we need to explore.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>The important lesson we have learned is that we should be daring when we are facing unfamiliar tasks.&lt;/p>
&lt;ul>
&lt;li>When we came up with the idea to make a fake voice detector, we were concerned with the fact that we haven’t worked on audios during the course.&lt;/li>
&lt;li>Using spectrograms, which are images, as the input was a doubtful method because we suspected that spectrograms might not contain enough features of the audios to begin with.&lt;/li>
&lt;li>The ResNet18 model turned out to perform well.&lt;/li>
&lt;/ul>
&lt;p>It is worth noting that our final model may be vulnerable to adversarial attacks. For instance, it is possible to train a &amp;ldquo;modifier&amp;rdquo; which will erase the extra frequencies present in the spectrogram of fake audios, as illustrated in section &amp;lsquo;Unseen Samples&amp;rsquo;, in order to confuse our model to misclassify fake audios as real.&lt;/p>
&lt;h2 id="references">References&lt;/h2>
&lt;ol>
&lt;li>
&lt;p>Jordan J. Bird, Ahmad Lotfi. REAL-TIME DETECTION OF AI-GENERATED SPEECH FOR
DEEPFAKE VOICE CONVERSION,24 Aug 2023. &lt;a href="https://arxiv.org/pdf/2308.12734.pdf" target="_blank" rel="noopener">https://arxiv.org/pdf/2308.12734.pdf&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Yamagishi, Junichi; Todisco, Massimiliano; Sahidullah, Md; Delgado, Héctor; Wang, Xin; Evans, Nicolas; Kinnunen, Tomi; Lee, Kong Aik; Vestman, Ville; Nautsch, Andreas. (2019). ASVspoof 2019: The 3rd Automatic Speaker Verification Spoofing and Countermeasures Challenge database, [sound]. University of Edinburgh. The Centre for Speech Technology Research (CSTR). &lt;a href="https://doi.org/10.7488/ds/2555" target="_blank" rel="noopener">https://doi.org/10.7488/ds/2555&lt;/a>.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Rodríguez, Yohanna; Ballesteros L, Dora Maria; Renza, Diego (2019), “Fake voice recordings (Imitation)”, Mendeley Data, V1, doi: 10.17632/ytkv9w92t6.1&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Nikolaj Buhl. F1 Score in Machine Learning. &lt;em>Encord&lt;/em>, 18 July 2023. &lt;a href="https://encord.com/blog/f1-score-in-machine-learning/" target="_blank" rel="noopener">https://encord.com/blog/f1-score-in-machine-learning/&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol></description></item><item><title>Unfair-ToS: A GPT-Based Unfair Term Of Service Detector</title><link>https://joeliang0520.github.io/project/unfair-tos/</link><pubDate>Fri, 01 Sep 2023 00:00:00 +0000</pubDate><guid>https://joeliang0520.github.io/project/unfair-tos/</guid><description>
&lt;details class="toc-inpage d-print-none " open>
&lt;summary class="font-weight-bold">Table of Contents&lt;/summary>
&lt;nav id="TableOfContents">
&lt;ul>
&lt;li>&lt;a href="#introduction">Introduction&lt;/a>
&lt;ul>
&lt;li>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#related-works">Related Works&lt;/a>
&lt;ul>
&lt;li>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#model-architecture">Model Architecture&lt;/a>
&lt;ul>
&lt;li>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#model-showcase">Model Showcase&lt;/a>
&lt;ul>
&lt;li>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#current-work">Current Work&lt;/a>&lt;/li>
&lt;/ul>
&lt;/nav>
&lt;/details>
&lt;p>The &lt;a href="https://github.com/joeliang0520/Unfair-ToS" target="_blank" rel="noopener">Github Repo&lt;/a> contains the training datasets, source codes, prompts, and other usefull analysis about this project.&lt;/p>
&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;h4 id="problems-in-terms-of-service">Problems in Terms of service&lt;/h4>
&lt;p>Terms of service (ToS) are agreements between service providers and their users. These ToS documents often contain complex legal language that users struggle to understand. Such clauses may violate consumer laws, compromise users&amp;rsquo; rights, and raise privacy concerns.&lt;/p>
&lt;h4 id="why-llms">Why LLMs?&lt;/h4>
&lt;p>The Large Language Model&amp;rsquo;s (LLM) proven ability to efficiently extract summaries from complex texts makes it ideal for addressing ToS complexities. The Unfair-ToS employs a GPT-based framework to highlight crucial ToS sentences, offer simplified explanations, and evaluate their fairness, also providing reasons for any unfair terms.&lt;/p>
&lt;h2 id="related-works">Related Works&lt;/h2>
&lt;h4 id="claudette">CLAUDETTE&lt;/h4>
&lt;p>Lippi et al. addressed the presence of unfair terms in infrequently read contracts in their paper ’CLAUDETTE: an Automated Detector of Potentially Unfair Clauses in Online Terms of Service’. The authors introduced the idea of utilizing machine learning models, including SVM, CNN, and several hybrid models, to identify potential unfair clauses in ToS. While the classification results from this paper are deemed acceptable, there is room for improvement.&lt;/p>
&lt;p>Building upon this foundation to adapt to the rapid evolution in the machine learning community, a more recent paper expanded the dataset to include 100 ToS, providing a more comprehensive coverage of service providers. The same annotated mechanism used in the previous paper was applied, and the study achieved improved classification results using Memory-Augmented Neural Networks.&lt;/p>
&lt;h4 id="current-issues">Current Issues&lt;/h4>
&lt;p>However, in existing approaches, many crucial terms are annotated as fair simply because they do not violate any laws. Users could benefit from being made aware of such terms without having to read through all the fair terms. Furthermore, the utilization of fine-tuned LLMs, a popular approach in legal studies [7], remains unexplored in the context of Unfair Term classification. These issues have prompted our attention to design a more comprehensive model.&lt;/p>
&lt;h2 id="model-architecture">Model Architecture&lt;/h2>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="https://github.com/joeliang0520/Unfair-ToS/assets/50597009/9993fc4a-9042-4100-b9ed-93859384475d" alt="Add a little bit of body text (1) (1)" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h4 id="important-terms-hightlighting">Important Terms Hightlighting&lt;/h4>
&lt;p>This model leverages GPT-4 through prompts to highlight sentences and generate simplifications for each highlighted sentence from the input, cleaned Terms of Service (ToS). The prompt used in our model has been fine-tuned using 11 ToS documents, comprising approximately 1500 sentences, through prompt engineering. GPT-4 identifies important sentences, and these serve as input to the text classification model.&lt;/p>
&lt;h4 id="unfair-terms-classification">Unfair Terms Classification&lt;/h4>
&lt;p>Each sentence undergoes tokenization using HuggingFace&amp;rsquo;s predefined &amp;lsquo;GPT2&amp;rsquo; tokenizer and is then fed into the pre-trained &amp;lsquo;gpt2&amp;rsquo; model. This model has been fine-tuned on a dataset of 100 ToS using the same tokenizer and padded to the maximum length. The output of the model falls into one of five classes. The purpose of this model is to assign a label to each highlighted sentence, assisting users in identifying the fairness of sentences.&lt;/p>
&lt;h2 id="model-showcase">Model Showcase&lt;/h2>
&lt;h4 id="report">Report&lt;/h4>
&lt;p>&lt;a href="https://joeliang0520.github.io/files/Unfair-ToS.Report.pdf" target="_blank">Please read our report&lt;/a>
to learn more about our project motivation, background information, and model evaluation.&lt;/p>
&lt;h4 id="some-highlight">Some highlight&lt;/h4>
&lt;p>Inspired by Yoon Kim&amp;rsquo;s paper, we implemented a baseline CNN classification model with k1 = 4 and k2 = 4 for fair/unfair classification. However, its F1 score is only 0.193, indicating a bias towards classifying samples into the class with the majority, which is the &amp;lsquo;fair&amp;rsquo; class. Consequently, the baseline CNN struggles to accurately identify &amp;lsquo;unfair&amp;rsquo; sentences within the samples.&lt;/p>
&lt;p align="center">
&lt;img src="https://github.com/joeliang0520/Unfair-ToS/assets/50597009/dbff8063-99d6-4515-a139-88e42efc5f92" alt="drawing" width="900"/>
&lt;/p>
&lt;p>In contrast, our fine-tuned GPT-2 model surpasses the baseline in both metrics. These findings suggest that the GPT-2 model exhibits greater resilience to class imbalance, particularly when &amp;lsquo;fair&amp;rsquo; sentences dominate the corpus, as is often the case.&lt;/p>
&lt;p align="center">
&lt;img src="https://github.com/joeliang0520/Unfair-ToS/assets/50597009/9d9a30b3-f4fc-40f3-b49a-e25725edf6a0" alt="drawing" width="700"/>
&lt;/p>
&lt;h4 id="prompt">Prompt&lt;/h4>
&lt;p>Utilizes GPT-4 through prompts to highlight sentences and generate simplifications for each
highlighted sentence from the input, cleaned ToS. The prompt used in our model has been fine-tuned using 11 ToS documents, comprising approximately 3300 sentences, through prompt engineering.&lt;/p>
&lt;p>We utilized a chain of thought to guide GPT-4 in text highlighting, ensuring that the output aligns with the user group of specific service providers.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1">### Steps ###&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">think&lt;/span> &lt;span class="n">step&lt;/span> &lt;span class="n">by&lt;/span> &lt;span class="n">step&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="mf">1.&lt;/span> &lt;span class="n">who&lt;/span> &lt;span class="ow">is&lt;/span> &lt;span class="n">the&lt;/span> &lt;span class="n">service&lt;/span> &lt;span class="n">provider&lt;/span>&lt;span class="err">?&lt;/span> &lt;span class="n">who&lt;/span> &lt;span class="ow">is&lt;/span> &lt;span class="n">its&lt;/span> &lt;span class="n">user&lt;/span> &lt;span class="n">population&lt;/span>&lt;span class="err">?&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="mf">2.&lt;/span> &lt;span class="n">what&lt;/span> &lt;span class="n">should&lt;/span> &lt;span class="n">be&lt;/span> &lt;span class="n">considered&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="n">important&lt;/span> &lt;span class="n">sentences&lt;/span> &lt;span class="k">for&lt;/span> &lt;span class="n">users&lt;/span> &lt;span class="n">to&lt;/span> &lt;span class="n">read&lt;/span>&lt;span class="err">?&lt;/span> &lt;span class="n">using&lt;/span> &lt;span class="n">the&lt;/span> &lt;span class="n">definition&lt;/span> &lt;span class="n">given&lt;/span> &lt;span class="n">below&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="mf">3.&lt;/span> &lt;span class="n">How&lt;/span> &lt;span class="n">can&lt;/span> &lt;span class="n">you&lt;/span> &lt;span class="n">quantify&lt;/span> &lt;span class="n">the&lt;/span> &lt;span class="n">importance&lt;/span> &lt;span class="n">of&lt;/span> &lt;span class="n">a&lt;/span> &lt;span class="n">sentence&lt;/span> &lt;span class="n">using&lt;/span> &lt;span class="n">this&lt;/span> &lt;span class="n">definition&lt;/span>&lt;span class="err">?&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="o">*&lt;/span>&lt;span class="mf">4.&lt;/span> &lt;span class="n">What&lt;/span> &lt;span class="n">are&lt;/span> &lt;span class="n">the&lt;/span> &lt;span class="mi">50&lt;/span> &lt;span class="n">most&lt;/span> &lt;span class="n">important&lt;/span> &lt;span class="n">sentences&lt;/span>&lt;span class="err">?&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="current-work">Current Work&lt;/h2>
&lt;p>We are presently developing a Graphic User Interface to showcase the capabilities of our model. An early beta version is now accessible in the Application folder, enabling users to upload a (txt/csv) file or copy and paste the Terms of Service (ToS) document. This allows the application of our Language Model (LLM) prompts with various models using your personal OPENAI API keys, yielding highlighted text results. The fair/unfair classification feature will be integrated into the GUI in a future release.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="https://github.com/joeliang0520/Unfair-ToS/assets/50597009/57137b42-7cf6-4219-bfbc-0f27a09a491b" alt="Screenshot 2024-01-14 at 11 31 06 PM" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>To utilize this GUI, kindly clone this repository into your local machine and install any missing packages. Following the completion of all preparations, execute the &amp;lsquo;GUI.py&amp;rsquo; file to initiate the application.&lt;/p>
&lt;p>We have provided a demo Terms of Service (ToS) document for you to experiment with using our GUI. Please click the &amp;ldquo;SHOW DEMO&amp;rdquo; button for more information about this demonstration. If you wish to highlight your own ToS document, you must obtain an OpenAI API key (with sufficient credits based on the length of the ToS and the selected model) and upload it using the SETTING function.&lt;/p>
&lt;p>&lt;strong>Note: This version is an early beta release. Please use it at your own risk. We are not liable for any costs or information leaks incurred while using our services.&lt;/strong>&lt;/p>
&lt;p>We also welcome any contributions to assist us in completing the front-end design!&lt;/p></description></item><item><title>Text Classification and Data Anaylsis on Cryptocurrency Related Tweets in PySpark Enviorment</title><link>https://joeliang0520.github.io/project/fiancial/</link><pubDate>Sat, 01 Jan 2022 00:00:00 +0000</pubDate><guid>https://joeliang0520.github.io/project/fiancial/</guid><description>&lt;h2 id="disclaimer-and-background">Disclaimer and Background&lt;/h2>
&lt;p>This project is an improvement of the final project of upper year CS
course &amp;quot;Data-Intensive Distributed Analytics&amp;quot; at the University of
Waterloo by &lt;a href="https://github.com/hughyyyy" target="_blank" rel="noopener">Hugh Chung&lt;/a> , &lt;a href="https://github.com/JOeOJ520" target="_blank" rel="noopener">Joe
Liang&lt;/a>, and &lt;a href="https://github.com/Shawn-Personal" target="_blank" rel="noopener">Shawn
Li&lt;/a>. The codes for setting up the
Pyspark environments in this project are credited to &lt;a href="https://cs.uwaterloo.ca/~a2abedi/" target="_blank" rel="noopener">Ali
Abedi&lt;/a>, the instructor in Winter
2022.&lt;/p>
&lt;p>Data in this project is from the Kaggle post &amp;quot;Bitcoin Tweets&amp;quot; under
CC0: Public Domain license. The data includes tweets that have #Bitcoin
and #btc hashtags from 2016. Additional information about this dataset
can be found
&lt;a href="https://www.kaggle.com/datasets/kaushiksuresh147/bitcoin-tweets" target="_blank" rel="noopener">here&lt;/a>.&lt;/p>
&lt;p>&lt;a href="https://en.wikipedia.org/wiki/Cryptocurrency" target="_blank" rel="noopener">Cryptocurrency&lt;/a> becomes a
popular topic in social media and the financial market. On 30 November
2020, bitcoin hit a new all-time high of $19,860. NLP Analysis on the
posts related to cryptocurrency in social media could be an interest
area of study.&lt;/p>
&lt;p>The goal of this project is to demonstrate the ability to use Pyspark
and big data computing in text data analysis and supervised learning:
tweets text classification. And using the trained model to construct an
automatic hash-tagging system for incoming tweets.&lt;/p>
&lt;p>The environment and programming language used in this project mainly
focus on
&lt;a href="https://spark.apache.org/docs/latest/api/python/#:~:text=PySpark%20is%20an%20interface%20for,data%20in%20a%20distributed%20environment" target="_blank" rel="noopener">Pyspark&lt;/a>
with its RDD and Data Frame interface. Also, &lt;a href="https://keras.io/" target="_blank" rel="noopener">Keras&lt;/a>
in Tensorflow with &lt;a href="https://pandas.pydata.org/" target="_blank" rel="noopener">Pandas&lt;/a> is used to train
neural network models.&lt;/p>
&lt;h2 id="pyspark-environment">Pyspark Environment&lt;/h2>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">shutil&lt;/span>&lt;span class="o">,&lt;/span> &lt;span class="nn">os&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">if&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">isdir&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;CryptoTweets&amp;#39;&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">shutil&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">rmtree&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;CryptoTweets&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="err">!&lt;/span> &lt;span class="n">git&lt;/span> &lt;span class="n">clone&lt;/span> &lt;span class="n">https&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="o">//&lt;/span>&lt;span class="n">github&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">com&lt;/span>&lt;span class="o">/&lt;/span>&lt;span class="n">JOeOJ520&lt;/span>&lt;span class="o">/&lt;/span>&lt;span class="n">CryptoTweets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">git&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>To get started, let's initialize Spark.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="err">!&lt;/span>&lt;span class="n">apt&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="n">get&lt;/span> &lt;span class="n">update&lt;/span> &lt;span class="o">-&lt;/span>&lt;span class="n">qq&lt;/span> &lt;span class="o">&amp;gt;&lt;/span> &lt;span class="o">/&lt;/span>&lt;span class="n">dev&lt;/span>&lt;span class="o">/&lt;/span>&lt;span class="n">null&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="err">!&lt;/span>&lt;span class="n">apt&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="n">get&lt;/span> &lt;span class="n">install&lt;/span> &lt;span class="n">openjdk&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mi">8&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="n">jdk&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="n">headless&lt;/span> &lt;span class="o">-&lt;/span>&lt;span class="n">qq&lt;/span> &lt;span class="o">&amp;gt;&lt;/span> &lt;span class="o">/&lt;/span>&lt;span class="n">dev&lt;/span>&lt;span class="o">/&lt;/span>&lt;span class="n">null&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="err">!&lt;/span>&lt;span class="n">wget&lt;/span> &lt;span class="o">-&lt;/span>&lt;span class="n">q&lt;/span> &lt;span class="n">https&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="o">//&lt;/span>&lt;span class="n">downloads&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">apache&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">org&lt;/span>&lt;span class="o">/&lt;/span>&lt;span class="n">spark&lt;/span>&lt;span class="o">/&lt;/span>&lt;span class="n">spark&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mf">2.4.8&lt;/span>&lt;span class="o">/&lt;/span>&lt;span class="n">spark&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mf">2.4.8&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="nb">bin&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="n">hadoop2&lt;/span>&lt;span class="mf">.7&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">tgz&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="err">!&lt;/span>&lt;span class="n">tar&lt;/span> &lt;span class="n">xf&lt;/span> &lt;span class="n">spark&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mf">2.4.8&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="nb">bin&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="n">hadoop2&lt;/span>&lt;span class="mf">.7&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">tgz&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="err">!&lt;/span>&lt;span class="n">pip&lt;/span> &lt;span class="n">install&lt;/span> &lt;span class="o">-&lt;/span>&lt;span class="n">q&lt;/span> &lt;span class="n">findspark&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="err">!&lt;/span>&lt;span class="n">tar&lt;/span> &lt;span class="o">-&lt;/span>&lt;span class="n">xzf&lt;/span> &lt;span class="n">CryptoTweets&lt;/span>&lt;span class="o">/&lt;/span>&lt;span class="n">sql&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="n">data&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">tgz&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># install required packages&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="err">!&lt;/span>&lt;span class="n">pip&lt;/span> &lt;span class="n">install&lt;/span> &lt;span class="n">pycountry&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="err">!&lt;/span>&lt;span class="n">pip&lt;/span> &lt;span class="n">install&lt;/span> &lt;span class="n">pyecharts&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>To use Spark SQL and the DataFrame interface, creating a &lt;code>SparkSession&lt;/code>.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">os&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">environ&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;JAVA_HOME&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;/usr/lib/jvm/java-8-openjdk-amd64&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">environ&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;SPARK_HOME&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;/content/spark-2.4.8-bin-hadoop2.7&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">findspark&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">findspark&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">init&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.sql&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">SparkSession&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">random&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">spark&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">SparkSession&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">builder&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">appName&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;YourTest&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">master&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;local[2]&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>\
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">config&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;spark.ui.port&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">random&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">randrange&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">4000&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mi">5000&lt;/span>&lt;span class="p">))&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">config&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;spark.driver.memory&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;9g&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">getOrCreate&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="data-preprocessing">Data Preprocessing&lt;/h4>
&lt;p>The bitcoin-tweets.csv contains total 1.09G of tweets regarding bitcoins
and crpytocurrecies. The below section requires a
&lt;a href="https://www.kaggle.com/docs/api" target="_blank" rel="noopener">kaggle.json&lt;/a> for authentication
purposes in order to download the file.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">google.colab&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">files&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">uploaded&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">files&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">upload&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#Upload kaggle account verification&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">fn&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">uploaded&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">keys&lt;/span>&lt;span class="p">():&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;User uploaded file &amp;#34;&lt;/span>&lt;span class="si">{name}&lt;/span>&lt;span class="s1">&amp;#34; with length &lt;/span>&lt;span class="si">{length}&lt;/span>&lt;span class="s1"> bytes&amp;#39;&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">format&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">name&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">fn&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">length&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">uploaded&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">fn&lt;/span>&lt;span class="p">])))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Then move kaggle.json into the folder where the API expects to find it.&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="err">!&lt;/span>&lt;span class="n">mkdir&lt;/span> &lt;span class="o">-&lt;/span>&lt;span class="n">p&lt;/span> &lt;span class="o">~/.&lt;/span>&lt;span class="n">kaggle&lt;/span>&lt;span class="o">/&lt;/span> &lt;span class="o">&amp;amp;&amp;amp;&lt;/span> &lt;span class="n">mv&lt;/span> &lt;span class="n">kaggle&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">json&lt;/span> &lt;span class="o">~/.&lt;/span>&lt;span class="n">kaggle&lt;/span>&lt;span class="o">/&lt;/span> &lt;span class="o">&amp;amp;&amp;amp;&lt;/span> &lt;span class="n">chmod&lt;/span> &lt;span class="mi">600&lt;/span> &lt;span class="o">~/.&lt;/span>&lt;span class="n">kaggle&lt;/span>&lt;span class="o">/&lt;/span>&lt;span class="n">kaggle&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">json&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#Download dataset&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="err">!&lt;/span>&lt;span class="n">kaggle&lt;/span> &lt;span class="n">datasets&lt;/span> &lt;span class="n">download&lt;/span> &lt;span class="s2">&amp;#34;kaushiksuresh147/bitcoin-tweets&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#unzips&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="err">!&lt;/span>&lt;span class="n">unzip&lt;/span> &lt;span class="n">bitcoin&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="n">tweets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">zip&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Using Pyspark SQL Interface to obtain dataframe from Bitcoin_tweets.csv
and first 10 rows of the dataset are showed below.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#Read the csv and construct pyspark dataframe&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">tweets_raw&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">spark&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">read&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">format&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;csv&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">option&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;header&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s2">&amp;#34;true&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">load&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;Bitcoin_tweets.csv&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">tweets_raw&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">spark&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">read&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">csv&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;./Bitcoin_tweets.csv&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#better visual&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">tweets_raw&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">limit&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">toPandas&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div> &lt;div id="df-b2e9cd57-463a-465a-8be9-3cf7b172dc6c">
&lt;div class="colab-df-container">
&lt;div>
&lt;style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
&lt;pre>&lt;code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
&lt;/code>&lt;/pre>
&lt;p>&lt;/style>&lt;/p>
&lt;table border="1" class="dataframe">
&lt;thead>
&lt;tr style="text-align: right;">
&lt;th>&lt;/th>
&lt;th>user_name&lt;/th>
&lt;th>user_location&lt;/th>
&lt;th>user_description&lt;/th>
&lt;th>user_created&lt;/th>
&lt;th>user_followers&lt;/th>
&lt;th>user_friends&lt;/th>
&lt;th>user_favourites&lt;/th>
&lt;th>user_verified&lt;/th>
&lt;th>date&lt;/th>
&lt;th>text&lt;/th>
&lt;th>hashtags&lt;/th>
&lt;th>source&lt;/th>
&lt;th>is_retweet&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;th>0&lt;/th>
&lt;td>DeSota Wilson&lt;/td>
&lt;td>Atlanta, GA&lt;/td>
&lt;td>Biz Consultant, real estate, fintech, startups...&lt;/td>
&lt;td>2009-04-26 20:05:09&lt;/td>
&lt;td>8534.0&lt;/td>
&lt;td>7605&lt;/td>
&lt;td>4838&lt;/td>
&lt;td>False&lt;/td>
&lt;td>2021-02-10 23:59:04&lt;/td>
&lt;td>Blue Ridge Bank shares halted by NYSE after #b...&lt;/td>
&lt;td>['bitcoin']&lt;/td>
&lt;td>Twitter Web App&lt;/td>
&lt;td>False&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>1&lt;/th>
&lt;td>CryptoND&lt;/td>
&lt;td>None&lt;/td>
&lt;td>😎 BITCOINLIVE is a Dutch platform aimed at inf...&lt;/td>
&lt;td>2019-10-17 20:12:10&lt;/td>
&lt;td>6769.0&lt;/td>
&lt;td>1532&lt;/td>
&lt;td>25483&lt;/td>
&lt;td>False&lt;/td>
&lt;td>2021-02-10 23:58:48&lt;/td>
&lt;td>"😎 Today, that's this #Thursday, we will do a ...&lt;/td>
&lt;td>#Btc #wallet #security expe… https://t.co/go6...&lt;/td>
&lt;td>['Thursday', 'Btc', 'wallet', 'security']&lt;/td>
&lt;td>Twitter for Android&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>2&lt;/th>
&lt;td>Tdlmatias&lt;/td>
&lt;td>London, England&lt;/td>
&lt;td>IM Academy : The best #forex, #SelfEducation, ...&lt;/td>
&lt;td>2014-11-10 10:50:37&lt;/td>
&lt;td>128.0&lt;/td>
&lt;td>332&lt;/td>
&lt;td>924&lt;/td>
&lt;td>False&lt;/td>
&lt;td>2021-02-10 23:54:48&lt;/td>
&lt;td>Guys evening, I have read this article about B...&lt;/td>
&lt;td>None&lt;/td>
&lt;td>Twitter Web App&lt;/td>
&lt;td>False&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>3&lt;/th>
&lt;td>Crypto is the future&lt;/td>
&lt;td>None&lt;/td>
&lt;td>I will post a lot of buying signals for BTC tr...&lt;/td>
&lt;td>2019-09-28 16:48:12&lt;/td>
&lt;td>625.0&lt;/td>
&lt;td>129&lt;/td>
&lt;td>14&lt;/td>
&lt;td>False&lt;/td>
&lt;td>2021-02-10 23:54:33&lt;/td>
&lt;td>$BTC A big chance in a billion! Price: \487264...&lt;/td>
&lt;td>['Bitcoin', 'FX', 'BTC', 'crypto']&lt;/td>
&lt;td>dlvr.it&lt;/td>
&lt;td>False&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>4&lt;/th>
&lt;td>Alex Kirchmaier 🇦🇹🇸🇪 #FactsSuperspreader&lt;/td>
&lt;td>Europa&lt;/td>
&lt;td>Co-founder @RENJERJerky | Forbes 30Under30 | I...&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>5&lt;/th>
&lt;td>#Bitcoin"&lt;/td>
&lt;td>2016-02-03 13:15:55&lt;/td>
&lt;td>1249.0&lt;/td>
&lt;td>1472&lt;/td>
&lt;td>10482&lt;/td>
&lt;td>False&lt;/td>
&lt;td>2021-02-10 23:54:06&lt;/td>
&lt;td>This network is secured by 9 508 nodes as of t...&lt;/td>
&lt;td>['BTC']&lt;/td>
&lt;td>Twitter Web App&lt;/td>
&lt;td>False&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>6&lt;/th>
&lt;td>ZerrBenz™ ⚔ ✪ 20732&lt;/td>
&lt;td>Bkk, Thailand&lt;/td>
&lt;td>I'm a cat slave 🐱 Interested in Blockchain · T...&lt;/td>
&lt;td>2010-01-12 07:00:04&lt;/td>
&lt;td>742.0&lt;/td>
&lt;td>716&lt;/td>
&lt;td>2444&lt;/td>
&lt;td>False&lt;/td>
&lt;td>2021-02-10 23:53:30&lt;/td>
&lt;td>💹 Trade #Crypto on #Binance&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>7&lt;/th>
&lt;td>📌 Enjoy #Cashback 10% of the Trading fee&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>8&lt;/th>
&lt;td>📌 Sign up link 👉 https://t.co/T4WttWeohc… http...&lt;/td>
&lt;td>['Crypto', 'Binance', 'Cashback']&lt;/td>
&lt;td>Twitter Web App&lt;/td>
&lt;td>False&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>9&lt;/th>
&lt;td>Bitcoin-Bot&lt;/td>
&lt;td>Florida, USA&lt;/td>
&lt;td>Bot to generate Bitcoin picture as combination...&lt;/td>
&lt;td>2019-12-23 16:49:16&lt;/td>
&lt;td>131.0&lt;/td>
&lt;td>84&lt;/td>
&lt;td>5728&lt;/td>
&lt;td>False&lt;/td>
&lt;td>2021-02-10 23:53:17&lt;/td>
&lt;td>&amp;amp;lt;'fire' &amp;amp;amp; 'man'&amp;amp;gt;&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;td>None&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;/div>
&lt;button class="colab-df-convert" onclick="convertToInteractive('df-b2e9cd57-463a-465a-8be9-3cf7b172dc6c')"
title="Convert this dataframe to an interactive table."
style="display:none;">
&lt;p>&amp;lt;svg xmlns=&amp;ldquo;&lt;a href="http://www.w3.org/2000/svg%22" target="_blank" rel="noopener">http://www.w3.org/2000/svg"&lt;/a> height=&amp;ldquo;24px&amp;quot;viewBox=&amp;ldquo;0 0 24 24&amp;rdquo;
width=&amp;ldquo;24px&amp;rdquo;&amp;gt;
&lt;path d="M0 0h24v24H0V0z" fill="none"/>
&lt;path d="M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z"/>&lt;path d="M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z"/>
&lt;/svg>
&lt;/button>&lt;/p>
&lt;style>
.colab-df-container {
display:flex;
flex-wrap:wrap;
gap: 12px;
}
.colab-df-convert {
background-color: #E8F0FE;
border: none;
border-radius: 50%;
cursor: pointer;
display: none;
fill: #1967D2;
height: 32px;
padding: 0 0 0 0;
width: 32px;
}
.colab-df-convert:hover {
background-color: #E2EBFA;
box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);
fill: #174EA6;
}
[theme=dark] .colab-df-convert {
background-color: #3B4455;
fill: #D2E3FC;
}
[theme=dark] .colab-df-convert:hover {
background-color: #434B5C;
box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);
filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));
fill: #FFFFFF;
}
&lt;/style>
&lt;script>
const buttonEl =
document.querySelector('#df-b2e9cd57-463a-465a-8be9-3cf7b172dc6c button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-b2e9cd57-463a-465a-8be9-3cf7b172dc6c');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'&lt;a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook&lt;/a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
&lt;/script>
&lt;/div>
&lt;/div>
The dimension of raw dataset is (row = 11176654, columns = 13). The
sample size exceeds the need of our project goal, which might lead to
extremely high computational cost.
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">((&lt;/span>&lt;span class="n">tweets_raw&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">(),&lt;/span> &lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweets_raw&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">columns&lt;/span>&lt;span class="p">)))&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Therefore, we cleans up the dataframe by removing samples with missing
value and performs type conversion on multiple columns.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">pyspark.sql.functions&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="nn">F&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.sql.types&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">IntegerType&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#Remove all null rows&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">tweets&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">spark&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sql&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;SELECT * FROM tweet WHERE user_name != &amp;#39;null&amp;#39; AND user_description != &amp;#39;null&amp;#39; &lt;/span>&lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span>&lt;span class="s2">AND user_location != &amp;#39;null&amp;#39; AND user_created != &amp;#39;null&amp;#39; AND user_followers != &amp;#39;null&amp;#39; AND user_friends != &amp;#39;null&amp;#39; &lt;/span>&lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span>&lt;span class="s2">AND user_favourites != &amp;#39;null&amp;#39; AND user_verified != &amp;#39;null&amp;#39; AND date != &amp;#39;null&amp;#39; AND text != &amp;#39;null&amp;#39; AND hashtags != &amp;#39;null&amp;#39; &lt;/span>&lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span>&lt;span class="s2">AND source != &amp;#39;null&amp;#39; AND is_retweet != &amp;#39;null&amp;#39;&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#Convert string into datetime for date col&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">tweets&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tweets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">withColumn&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;date&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">F&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">to_timestamp&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;date&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s1">&amp;#39;yyyy-MM-dd&amp;#39;&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">tweets&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tweets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">withColumn&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;user_created&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">F&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">to_timestamp&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;user_created&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s1">&amp;#39;yyyy-MM-dd&amp;#39;&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#Convert string to Number/Int&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">tweets&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tweets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">withColumn&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;user_followers&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">tweets&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;user_followers&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">cast&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">IntegerType&lt;/span>&lt;span class="p">()))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">tweets&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tweets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">withColumn&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;user_friends&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">tweets&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;user_friends&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">cast&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">IntegerType&lt;/span>&lt;span class="p">()))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">tweets&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tweets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">withColumn&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;user_favourites&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">tweets&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;user_favourites&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">cast&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">IntegerType&lt;/span>&lt;span class="p">()))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#Remove all null convertions that occured&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">tweets&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tweets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">filter&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;date is not NULL AND user_created is not NULL &lt;/span>&lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span>&lt;span class="s2">AND user_followers is not NULL AND user_friends is not NULL&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#Tweet left after cleaning&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">tweets_left&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tweets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#Total tweet&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">tweets_total&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tweets_raw&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;The total number of tweets is: &amp;#34;&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="nb">str&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweets_total&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;The total number of tweets after cleaning the data types is &amp;#34;&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="nb">str&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweets_left&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;Percentage of tweets removed: &amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">1&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">tweets_left&lt;/span> &lt;span class="o">/&lt;/span> &lt;span class="n">tweets_total&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">The total number of tweets is: 11804338
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">The total number of tweets after cleaning the data types is 497152
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">Percentage of tweets removed: 0.9578839575755964
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="explanatory-data-analysis">Explanatory Data Analysis&lt;/h2>
&lt;h3 id="categorical-data">Categorical Data&lt;/h3>
&lt;p>Standardlizing the &amp;quot;user_location&amp;quot; columns into countries using
&amp;quot;pycountry&amp;quot; package. Using Pyspark RDDs interface to achieve parallel
computing in the calculation of the frequencies of words in
&amp;quot;hashtage&amp;quot;, &amp;quot;locations&amp;quot;, and &amp;quot;source&amp;quot;, and &amp;quot;user_name&amp;quot;&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">nltk.stem&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">PorterStemmer&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">CryptoTweets.otherstr&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="o">*&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">CryptoTweets.simple_tokenize&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">simple_tokenize&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Functions can be found in otherstr.py file in Github&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># sources&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">sor&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">source&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweets&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">collect&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Hashtages&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">hashs&lt;/span> &lt;span class="o">=&lt;/span>&lt;span class="n">hashtags&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweets&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">collect&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># verification&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">verified&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">user_verified&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweets&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">collect&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># user name&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">name&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">user_name&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweets&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">collect&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Locations&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">loca&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">user_location&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweets&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">collect&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Here is an example of the results from the above calculations. The
number indicates the frequency of each category in the dataset.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">loca&lt;/span>&lt;span class="p">[:&lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">[(&amp;#39;others&amp;#39;, 397175),
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> (&amp;#39;United States&amp;#39;, 13592),
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> (&amp;#39;United Kingdom&amp;#39;, 9272),
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> (&amp;#39;Canada&amp;#39;, 8544),
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> (&amp;#39;India&amp;#39;, 7791),
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> (&amp;#39;Australia&amp;#39;, 4415),
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> (&amp;#39;Bangladesh&amp;#39;, 3545),
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> (&amp;#39;South Africa&amp;#39;, 3314),
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> (&amp;#39;Niger&amp;#39;, 3206),
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> (&amp;#39;France&amp;#39;, 2697)]
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">matplotlib.pyplot&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="nn">plt&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Plots for tweet source&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">x_t&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">y_t&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">zip&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="n">sor&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fig1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ax1&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">subplots&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">explode&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mf">0.05&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">colors&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;#8B0000&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;#db6777&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;#e6d1d4&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;#e8dcde&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ax1&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">pie&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y_t&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">labels&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">x_t&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">colors&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">colors&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">autopct&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="si">%1.1f%%&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">startangle&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">90&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">pctdistance&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mf">0.85&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">explode&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">explode&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#draw circle&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">centre_circle&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Circle&lt;/span>&lt;span class="p">((&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="mf">0.70&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">fc&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;white&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fig&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">gcf&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fig&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">gca&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">add_artist&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">centre_circle&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Equal aspect ratio ensures that pie is drawn as a circle&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ax1&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;equal&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">tight_layout&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">show&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">show&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="" srcset="
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/94f94f3ec97494d4240cb34f7c7dbca1465bb05f_hu407d309b237934210dee6430b16c6207_15626_27a9bc2419fdbb3b4b5cd81daa2e6bde.webp 400w,
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/94f94f3ec97494d4240cb34f7c7dbca1465bb05f_hu407d309b237934210dee6430b16c6207_15626_debf1e3fd929d2c79442dbbd4841190f.webp 760w,
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/94f94f3ec97494d4240cb34f7c7dbca1465bb05f_hu407d309b237934210dee6430b16c6207_15626_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://joeliang0520.github.io/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/94f94f3ec97494d4240cb34f7c7dbca1465bb05f_hu407d309b237934210dee6430b16c6207_15626_27a9bc2419fdbb3b4b5cd81daa2e6bde.webp"
width="438"
height="280"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>User Location&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Importing required library&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyecharts.charts&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">Bar&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyecharts&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">options&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="n">opts&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Obtaining x and y axis from Location lists&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">x_hash&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">y_hash&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">zip&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="n">loca&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">bar&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">Bar&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">init_opts&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">opts&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">InitOpts&lt;/span>&lt;span class="p">())&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="o">.&lt;/span>&lt;span class="n">add_xaxis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x_hash&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="mi">11&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="o">.&lt;/span>&lt;span class="n">add_yaxis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;Frequency&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">y_hash&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="mi">11&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="o">.&lt;/span>&lt;span class="n">set_global_opts&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">title_opts&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">opts&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">TitleOpts&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">title&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;Top 10 User Location&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">subtitle&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;standardization and removed others&amp;#34;&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">bar&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">render_notebook&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">xticks&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">rotation&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">45&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">horizontalalignment&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;right&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">fontweight&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;light&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">fontsize&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;x-large&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">bar&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x_hash&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="n">y_hash&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">show&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="" srcset="
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/c73d663b202d823f8e6c4cdbec30fcc26b4da314_huaa831178ad7e184e173506b8dd838f38_16771_e71bc028ea0e20cef427dce9ae4f7fa6.webp 400w,
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/c73d663b202d823f8e6c4cdbec30fcc26b4da314_huaa831178ad7e184e173506b8dd838f38_16771_f38019332ac20b41303a212972d4422c.webp 760w,
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/c73d663b202d823f8e6c4cdbec30fcc26b4da314_huaa831178ad7e184e173506b8dd838f38_16771_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://joeliang0520.github.io/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/c73d663b202d823f8e6c4cdbec30fcc26b4da314_huaa831178ad7e184e173506b8dd838f38_16771_e71bc028ea0e20cef427dce9ae4f7fa6.webp"
width="399"
height="330"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>User Verification&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Plots for verified&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">x_t&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">y_t&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">zip&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="n">verified&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fig1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ax1&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">subplots&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">explode&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">colors&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;#6067e0&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;#fc0303&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ax1&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">pie&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y_t&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">labels&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">x_t&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">colors&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">colors&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">explode&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">explode&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">autopct&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="si">%1.1f%%&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">shadow&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">startangle&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">90&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ax1&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;equal&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="c1"># Equal aspect ratio ensures that pie is drawn as a circle.&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#draw circle&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">centre_circle&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Circle&lt;/span>&lt;span class="p">((&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="mf">0.70&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">fc&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;white&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fig&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">gcf&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fig&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">gca&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">add_artist&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">centre_circle&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Equal aspect ratio ensures that pie is drawn as a circle&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ax1&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;equal&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">tight_layout&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">show&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">show&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="" srcset="
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/ba3b043c565f58fc685cdf7639f4c7e0996b6551_hu69340c523ab22ef82936cecd7fe5f0bc_14527_d2a72d3fff25f9dfd0eec22f47956945.webp 400w,
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/ba3b043c565f58fc685cdf7639f4c7e0996b6551_hu69340c523ab22ef82936cecd7fe5f0bc_14527_40218927276072ed5486bdfcd6c48b91.webp 760w,
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/ba3b043c565f58fc685cdf7639f4c7e0996b6551_hu69340c523ab22ef82936cecd7fe5f0bc_14527_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://joeliang0520.github.io/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/ba3b043c565f58fc685cdf7639f4c7e0996b6551_hu69340c523ab22ef82936cecd7fe5f0bc_14527_d2a72d3fff25f9dfd0eec22f47956945.webp"
width="424"
height="290"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>Hashtag (Response Variable)&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Counting the frequency of each hashtags&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">x_hash&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">y_hash&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">zip&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="n">hashs&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">bar&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">Bar&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">init_opts&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">opts&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">InitOpts&lt;/span>&lt;span class="p">())&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="o">.&lt;/span>&lt;span class="n">add_xaxis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x_hash&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="mi">11&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="o">.&lt;/span>&lt;span class="n">add_yaxis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;Frequency&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">y_hash&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="mi">11&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="o">.&lt;/span>&lt;span class="n">set_global_opts&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">title_opts&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">opts&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">TitleOpts&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">title&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;Top 10 Hashtags in the Tweets&amp;#34;&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">bar&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">render_notebook&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;script>
require.config({
paths: {
'echarts':'https://assets.pyecharts.org/assets/echarts.min'
}
});
&lt;/script>
&lt;pre>&lt;code> &amp;lt;div id=&amp;quot;8d97781c87bf495ca55aa7e8b30493a5&amp;quot; style=&amp;quot;width:900px; height:500px;&amp;quot;&amp;gt;&amp;lt;/div&amp;gt;
&lt;/code>&lt;/pre>
&lt;script>
require(['echarts'], function(echarts) {
var chart_8d97781c87bf495ca55aa7e8b30493a5 = echarts.init(
document.getElementById('8d97781c87bf495ca55aa7e8b30493a5'), 'white', {renderer: 'canvas'});
var option_8d97781c87bf495ca55aa7e8b30493a5 = {
"animation": true,
"animationThreshold": 2000,
"animationDuration": 1000,
"animationEasing": "cubicOut",
"animationDelay": 0,
"animationDurationUpdate": 300,
"animationEasingUpdate": "cubicOut",
"animationDelayUpdate": 0,
"color": [
"#c23531",
"#2f4554",
"#61a0a8",
"#d48265",
"#749f83",
"#ca8622",
"#bda29a",
"#6e7074",
"#546570",
"#c4ccd3",
"#f05b72",
"#ef5b9c",
"#f47920",
"#905a3d",
"#fab27b",
"#2a5caa",
"#444693",
"#726930",
"#b2d235",
"#6d8346",
"#ac6767",
"#1d953f",
"#6950a1",
"#918597"
],
"series": [
{
"type": "bar",
"name": "Frequency",
"legendHoverLink": true,
"data": [
171516,
76007,
41042,
33246,
22127,
20762,
18837,
17877,
14717,
12045
],
"showBackground": false,
"barMinHeight": 0,
"barCategoryGap": "20%",
"barGap": "30%",
"large": false,
"largeThreshold": 400,
"seriesLayoutBy": "column",
"datasetIndex": 0,
"clip": true,
"zlevel": 0,
"z": 2,
"label": {
"show": true,
"position": "top",
"margin": 8
}
}
],
"legend": [
{
"data": [
"Frequency"
],
"selected": {
"Frequency": true
},
"show": true,
"padding": 5,
"itemGap": 10,
"itemWidth": 25,
"itemHeight": 14
}
],
"tooltip": {
"show": true,
"trigger": "item",
"triggerOn": "mousemove|click",
"axisPointer": {
"type": "line"
},
"showContent": true,
"alwaysShowContent": false,
"showDelay": 0,
"hideDelay": 100,
"textStyle": {
"fontSize": 14
},
"borderWidth": 0,
"padding": 5
},
"xAxis": [
{
"show": true,
"scale": false,
"nameLocation": "end",
"nameGap": 15,
"gridIndex": 0,
"inverse": false,
"offset": 0,
"splitNumber": 5,
"minInterval": 0,
"splitLine": {
"show": false,
"lineStyle": {
"show": true,
"width": 1,
"opacity": 1,
"curveness": 0,
"type": "solid"
}
},
"data": [
"cryptocurrency",
"etherenum",
"dogecoin",
"binanc",
"nft",
"blockchain",
"gift",
"shop",
"altcoin",
"affiliatemarket"
]
}
],
"yAxis": [
{
"show": true,
"scale": false,
"nameLocation": "end",
"nameGap": 15,
"gridIndex": 0,
"inverse": false,
"offset": 0,
"splitNumber": 5,
"minInterval": 0,
"splitLine": {
"show": false,
"lineStyle": {
"show": true,
"width": 1,
"opacity": 1,
"curveness": 0,
"type": "solid"
}
}
}
],
"title": [
{
"text": "Top 10 Hashtags in the Tweets",
"padding": 5,
"itemGap": 10
}
]
};
chart_8d97781c87bf495ca55aa7e8b30493a5.setOption(option_8d97781c87bf495ca55aa7e8b30493a5);
});
&lt;/script>
&lt;p>The non-interactive plot when the above interactive plot fail to load&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">xticks&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">rotation&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">45&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">horizontalalignment&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;right&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">fontweight&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;light&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">fontsize&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;x-large&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">bar&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x_hash&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="n">y_hash&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">show&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="" srcset="
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/d2f8d3958ddd53c890092783c5bcb751665226b0_hu35d947dd22ff35783b04b20f58c2a41c_15271_17933579cae1a62b3e01f9ccc44bf1bb.webp 400w,
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/d2f8d3958ddd53c890092783c5bcb751665226b0_hu35d947dd22ff35783b04b20f58c2a41c_15271_a40248da1b8ccda197a6ac604573e05a.webp 760w,
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/d2f8d3958ddd53c890092783c5bcb751665226b0_hu35d947dd22ff35783b04b20f58c2a41c_15271_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://joeliang0520.github.io/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/d2f8d3958ddd53c890092783c5bcb751665226b0_hu35d947dd22ff35783b04b20f58c2a41c_15271_17933579cae1a62b3e01f9ccc44bf1bb.webp"
width="406"
height="325"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>We decides to use four of the most frequent hashtags and &amp;quot;bitcoin&amp;quot; as
our five response variables for text classification in supervised
learning. Therefore, the goal is to classify each tweet into one of the
five categories using the trained model in the future sections.&lt;/p>
&lt;h3 id="numerical-variables">Numerical Variables&lt;/h3>
&lt;p>Performing calculation on numerical variables in the dataset, such as
&amp;quot;Post date&amp;quot;, &amp;quot;user created date&amp;quot;, &amp;quot;Number of followers&amp;quot;, and
others. Obtaining the frequency of possible values in the samples.&lt;/p>
&lt;p>Number of tweets in recent two years&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">date_count&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tweets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">select&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;date&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">rdd&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">flatMap&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">row&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[(&lt;/span>&lt;span class="n">row&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">)])&lt;/span>&lt;span class="o">.&lt;/span>\
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">reduceByKey&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sortBy&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">])&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">collect&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">date_x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">date_y&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">zip&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="n">date_count&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">figure&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">figsize&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">15&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">5&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">plot&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">date_x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">date_y&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">title&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Total tweets by Date&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">xlabel&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Date&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ylabel&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Number of Tweets&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">show&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="" srcset="
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/1ebd175881907a5c2c59f13c27c331a9f99dd7a3_huf015f474bdcfcaa2af151deed52a230b_45293_02e96aa35ce92a6488b3b7be80d9f9a1.webp 400w,
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/1ebd175881907a5c2c59f13c27c331a9f99dd7a3_huf015f474bdcfcaa2af151deed52a230b_45293_9ee866413b3d5a248bc429d70e967ce8.webp 760w,
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/1ebd175881907a5c2c59f13c27c331a9f99dd7a3_huf015f474bdcfcaa2af151deed52a230b_45293_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://joeliang0520.github.io/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/1ebd175881907a5c2c59f13c27c331a9f99dd7a3_huf015f474bdcfcaa2af151deed52a230b_45293_02e96aa35ce92a6488b3b7be80d9f9a1.webp"
width="760"
height="280"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>We also performed the time series decomposition on the samples to check
possible seasonal patterns and trends.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">pandas&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="nn">pd&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">statsmodels.tsa.seasonal&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">seasonal_decompose&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">series&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">pd&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">DataFrame&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">date_count&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">result&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">seasonal_decompose&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">series&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="n">model&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;additive&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">freq&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">12&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">result&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">plot&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">show&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="" srcset="
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/c90e00cf66340d15316855182c2cc0d8c31f5605_hu056c82ec8373415c127c83c85d995d9f_43517_e0a87b6b08b239f4a3a99e1a743e9d68.webp 400w,
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/c90e00cf66340d15316855182c2cc0d8c31f5605_hu056c82ec8373415c127c83c85d995d9f_43517_ab699820c7299346de16bbbeed7dbaf2.webp 760w,
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/c90e00cf66340d15316855182c2cc0d8c31f5605_hu056c82ec8373415c127c83c85d995d9f_43517_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://joeliang0520.github.io/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/c90e00cf66340d15316855182c2cc0d8c31f5605_hu056c82ec8373415c127c83c85d995d9f_43517_e0a87b6b08b239f4a3a99e1a743e9d68.webp"
width="424"
height="280"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>There is an increasing trend at the beginning of the plot with a very
low residual. However, there is no clear pattern after 07/2021.&lt;/p>
&lt;p>}
Account Created Date&lt;/p>
&lt;p>Applying Log transformation to the number of account.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">numpy&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">log&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="n">ln&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">created_count&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tweets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">select&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;user_created&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">rdd&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">flatMap&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">row&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[(&lt;/span>&lt;span class="n">row&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">)])&lt;/span>&lt;span class="o">.&lt;/span>\
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">reduceByKey&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sortBy&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">])&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">collect&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#Convert list of tuple into two lists&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">date_x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">date_y&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">zip&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="n">created_count&lt;/span> &lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">figure&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">figsize&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">15&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">5&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">plot&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">date_x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ln&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">date_y&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">title&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Total Account Created by Date&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">xlabel&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Date&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ylabel&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Number of Account&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">show&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="" srcset="
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/c7894ce1d300247081e05cc2c6f18ccd3b8e39bd_hu95a3a15ea0f466dae101b67ffcfce779_37051_9e6a98779c95c1b98c43bdc565f7ea17.webp 400w,
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/c7894ce1d300247081e05cc2c6f18ccd3b8e39bd_hu95a3a15ea0f466dae101b67ffcfce779_37051_73e0e42fd7bec37483f839836745f7ee.webp 760w,
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/c7894ce1d300247081e05cc2c6f18ccd3b8e39bd_hu95a3a15ea0f466dae101b67ffcfce779_37051_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://joeliang0520.github.io/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/c7894ce1d300247081e05cc2c6f18ccd3b8e39bd_hu95a3a15ea0f466dae101b67ffcfce779_37051_9e6a98779c95c1b98c43bdc565f7ea17.webp"
width="760"
height="288"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">series&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">pd&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">DataFrame&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">created_count&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">result&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">seasonal_decompose&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">ln&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">series&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">]),&lt;/span> &lt;span class="n">model&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;additive&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">freq&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">12&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">result&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">plot&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">show&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="" srcset="
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/caf8c13e09fb4a1b4ee1bee2038162dab8cfc857_hu643f6741cacab607bb9d1976c217cc49_27153_6ed775f3f5647544b3b064142d495f05.webp 400w,
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/caf8c13e09fb4a1b4ee1bee2038162dab8cfc857_hu643f6741cacab607bb9d1976c217cc49_27153_5c80ae9581e0771705f8fcd2e7c60123.webp 760w,
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/caf8c13e09fb4a1b4ee1bee2038162dab8cfc857_hu643f6741cacab607bb9d1976c217cc49_27153_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://joeliang0520.github.io/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/caf8c13e09fb4a1b4ee1bee2038162dab8cfc857_hu643f6741cacab607bb9d1976c217cc49_27153_6ed775f3f5647544b3b064142d495f05.webp"
width="424"
height="280"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>There is no significant evidence in the time series decomposition to
support the existence of a seasonal pattern in the samples.&lt;/p>
&lt;p>Number of followers of each tweet user&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#Get the count of number of friends for the accounts&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">checker&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">&amp;lt;&lt;/span> &lt;span class="mi">50&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;1&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">elif&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">&amp;lt;&lt;/span> &lt;span class="mi">100&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;2&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">elif&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">&amp;lt;&lt;/span> &lt;span class="mi">200&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;3&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">elif&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">&amp;lt;&lt;/span> &lt;span class="mi">1000&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;4&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">else&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;5&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">friends&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tweets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">select&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;user_friends&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">rdd&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">flatMap&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">row&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[(&lt;/span>&lt;span class="n">row&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">)])&lt;/span>\
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="o">.&lt;/span>&lt;span class="n">map&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">checker&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">))&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">reduceByKey&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">collect&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#Convert list of tuple into two lists&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">friends_x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">friends_y&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">zip&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="n">friends&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fig1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ax1&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">subplots&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">colors&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;#ffbaba&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;#ff7b7b&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;#ff5252&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;#ff0000&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;#a70000&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ax1&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">pie&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">friends_y&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">labels&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;&amp;lt; 50&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;50-100&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;100-200&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;200-1000&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;&amp;gt;1000&amp;#39;&lt;/span>&lt;span class="p">],&lt;/span> \
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">autopct&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="si">%1.1f%%&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">colors&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">colors&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">startangle&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">90&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">pctdistance&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mf">0.85&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#draw circle&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">centre_circle&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Circle&lt;/span>&lt;span class="p">((&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="mf">0.70&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">fc&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;white&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fig&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">gcf&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fig&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">gca&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">add_artist&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">centre_circle&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Equal aspect ratio ensures that pie is drawn as a circle&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ax1&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;equal&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">tight_layout&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">title&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Total number of followers&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">show&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="" srcset="
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/e53c7440324a8c00a58a427dff02706d597f55f0_hu15e3b7c3c8743bda42994a70afbc1ec7_15758_9b07a4a71270fef2c78458aa8a80993c.webp 400w,
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/e53c7440324a8c00a58a427dff02706d597f55f0_hu15e3b7c3c8743bda42994a70afbc1ec7_15758_50f87f004a68104d4711344cc7272808.webp 760w,
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/e53c7440324a8c00a58a427dff02706d597f55f0_hu15e3b7c3c8743bda42994a70afbc1ec7_15758_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://joeliang0520.github.io/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/e53c7440324a8c00a58a427dff02706d597f55f0_hu15e3b7c3c8743bda42994a70afbc1ec7_15758_9b07a4a71270fef2c78458aa8a80993c.webp"
width="424"
height="295"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="data-cleaning-for-text-classification">Data cleaning for text classification&lt;/h2>
&lt;p>After data analysis, we found that some variables can be cleaned into a
more suitable format for machine learning. For example, the response
variables &amp;quot;hashtag&amp;quot; can be eliminated into five categories. And the
variable &amp;quot;sources&amp;quot; is a categorical variable with four levels. We
converted it into indicator variables.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.sql.functions&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">when&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.sql.functions&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">monotonically_increasing_id&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.sql.functions&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">udf&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.sql.functions&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">year&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">month&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.sql&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">functions&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="n">F&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Cleaning response variables&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">tweet_ml&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tweets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">withColumn&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;hashtags&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">when&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">hashtags&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">contains&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;dog&amp;#34;&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="s1">&amp;#39;Dogecoin&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>\
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">when&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">hashtags&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">contains&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;eth&amp;#34;&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="s1">&amp;#39;Etherenum&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>\
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">when&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">hashtags&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">contains&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;bnb&amp;#34;&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="s2">&amp;#34;binance&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>\
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">when&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">hashtags&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">contains&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;bin&amp;#34;&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="s2">&amp;#34;binance&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>\
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">when&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">hashtags&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">contains&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;crypto&amp;#34;&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="s1">&amp;#39;Cryptocurrency&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>\
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">when&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">hashtags&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">contains&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;btc&amp;#34;&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="s1">&amp;#39;Bitcoin&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>\
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">otherwise&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;other&amp;#39;&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Assigning unique_id to each row&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">unique_id&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">monotonically_increasing_id&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">tweet_ml&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tweet_ml&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">select&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;*&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">withColumn&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">unique_id&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Cleaning locations&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">pycountry&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">get_country&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">country&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">pycountry&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">countries&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">country&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">name&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">country&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">name&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="s2">&amp;#34;others&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">get_countryudf&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">udf&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">z&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">get_country&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">z&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#Creating indicator variables for categorical variables data&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">tweet_ml&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tweet_ml&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">na&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">drop&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">withColumn&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;user_location&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">get_countryudf&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;user_location&amp;#34;&lt;/span>&lt;span class="p">))&lt;/span>\
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="o">.&lt;/span>&lt;span class="n">withColumn&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;user_verified&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">when&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">user_verified&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">contains&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;True&amp;#34;&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">otherwise&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">))&lt;/span>\
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="o">.&lt;/span>&lt;span class="n">withColumn&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;source_Iphone&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">when&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">source&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">contains&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;iPhone&amp;#34;&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">otherwise&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">))&lt;/span>\
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="o">.&lt;/span>&lt;span class="n">withColumn&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;source_Web&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">when&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">source&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">contains&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;Web&amp;#34;&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">otherwise&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">))&lt;/span>\
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="o">.&lt;/span>&lt;span class="n">withColumn&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;source_Android&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">when&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweets&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">source&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">contains&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;Android&amp;#34;&lt;/span>&lt;span class="p">),&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">otherwise&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">))&lt;/span>\
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="o">.&lt;/span>&lt;span class="n">withColumn&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;post_year&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">year&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweet_ml&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">date&lt;/span>&lt;span class="p">))&lt;/span>\
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="o">.&lt;/span>&lt;span class="n">withColumn&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;post_month&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">month&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweet_ml&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">date&lt;/span>&lt;span class="p">))&lt;/span>\
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="o">.&lt;/span>&lt;span class="n">withColumn&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;created_year&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">year&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweet_ml&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">user_created&lt;/span>&lt;span class="p">))&lt;/span>\
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="o">.&lt;/span>&lt;span class="n">withColumn&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;created_month&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">month&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweet_ml&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">user_created&lt;/span>&lt;span class="p">))&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">drop&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;is_retweet&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;user_name&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s2">&amp;#34;user_created&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s2">&amp;#34;date&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s2">&amp;#34;source&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># better visual&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">tweet_ml&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">limit&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">toPandas&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div> &lt;div id="df-eca29686-df36-4fbd-afd5-c88e322b5724">
&lt;div class="colab-df-container">
&lt;div>
&lt;style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
&lt;pre>&lt;code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
&lt;/code>&lt;/pre>
&lt;p>&lt;/style>&lt;/p>
&lt;table border="1" class="dataframe">
&lt;thead>
&lt;tr style="text-align: right;">
&lt;th>&lt;/th>
&lt;th>user_location&lt;/th>
&lt;th>user_description&lt;/th>
&lt;th>user_followers&lt;/th>
&lt;th>user_friends&lt;/th>
&lt;th>user_favourites&lt;/th>
&lt;th>user_verified&lt;/th>
&lt;th>text&lt;/th>
&lt;th>hashtags&lt;/th>
&lt;th>id&lt;/th>
&lt;th>source_Iphone&lt;/th>
&lt;th>source_Web&lt;/th>
&lt;th>source_Android&lt;/th>
&lt;th>post_year&lt;/th>
&lt;th>post_month&lt;/th>
&lt;th>created_year&lt;/th>
&lt;th>created_month&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;th>0&lt;/th>
&lt;td>others&lt;/td>
&lt;td>Biz Consultant, real estate, fintech, startups...&lt;/td>
&lt;td>8534&lt;/td>
&lt;td>7605&lt;/td>
&lt;td>4838&lt;/td>
&lt;td>0&lt;/td>
&lt;td>Blue Ridge Bank shares halted by NYSE after #b...&lt;/td>
&lt;td>other&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>2021&lt;/td>
&lt;td>2&lt;/td>
&lt;td>2009&lt;/td>
&lt;td>4&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>1&lt;/th>
&lt;td>others&lt;/td>
&lt;td>Biz Consultant, real estate, fintech, startups...&lt;/td>
&lt;td>8534&lt;/td>
&lt;td>7605&lt;/td>
&lt;td>4838&lt;/td>
&lt;td>0&lt;/td>
&lt;td>.@Tesla’s #bitcoin investment is revolutionary...&lt;/td>
&lt;td>Cryptocurrency&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>2021&lt;/td>
&lt;td>2&lt;/td>
&lt;td>2009&lt;/td>
&lt;td>4&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>2&lt;/th>
&lt;td>others&lt;/td>
&lt;td>Persistent. to the extreme... #FREEPALESTINE #...&lt;/td>
&lt;td>1159&lt;/td>
&lt;td>2185&lt;/td>
&lt;td>30852&lt;/td>
&lt;td>0&lt;/td>
&lt;td>Annnd #btc #Bitcoin is headed even higher now....&lt;/td>
&lt;td>Bitcoin&lt;/td>
&lt;td>2&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>2021&lt;/td>
&lt;td>2&lt;/td>
&lt;td>2009&lt;/td>
&lt;td>1&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>3&lt;/th>
&lt;td>others&lt;/td>
&lt;td>#Bitcoin&lt;/td>
&lt;td>4&lt;/td>
&lt;td>32&lt;/td>
&lt;td>139&lt;/td>
&lt;td>0&lt;/td>
&lt;td>Buy #Bitcoin with 5% LIFETIME cashback on fees...&lt;/td>
&lt;td>Cryptocurrency&lt;/td>
&lt;td>3&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>2021&lt;/td>
&lt;td>2&lt;/td>
&lt;td>2010&lt;/td>
&lt;td>7&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>4&lt;/th>
&lt;td>others&lt;/td>
&lt;td>Biz Consultant, real estate, fintech, startups...&lt;/td>
&lt;td>8534&lt;/td>
&lt;td>7605&lt;/td>
&lt;td>4838&lt;/td>
&lt;td>0&lt;/td>
&lt;td>#Bitcoin institutional demand accelerates in 2...&lt;/td>
&lt;td>Cryptocurrency&lt;/td>
&lt;td>4&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>2021&lt;/td>
&lt;td>2&lt;/td>
&lt;td>2009&lt;/td>
&lt;td>4&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>5&lt;/th>
&lt;td>others&lt;/td>
&lt;td>CEO &amp;amp; PRESIDENT SG GROUP&lt;/td>
&lt;td>62&lt;/td>
&lt;td>288&lt;/td>
&lt;td>2656&lt;/td>
&lt;td>0&lt;/td>
&lt;td>#Bitcoin #BTC #ADA #DOT Mastercard Will Let Me...&lt;/td>
&lt;td>other&lt;/td>
&lt;td>5&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>2021&lt;/td>
&lt;td>2&lt;/td>
&lt;td>2009&lt;/td>
&lt;td>6&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>6&lt;/th>
&lt;td>others&lt;/td>
&lt;td>Biz Consultant, real estate, fintech, startups...&lt;/td>
&lt;td>8534&lt;/td>
&lt;td>7605&lt;/td>
&lt;td>4838&lt;/td>
&lt;td>0&lt;/td>
&lt;td>After @Tesla: @Twitter considers adding #bitco...&lt;/td>
&lt;td>other&lt;/td>
&lt;td>6&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>2021&lt;/td>
&lt;td>2&lt;/td>
&lt;td>2009&lt;/td>
&lt;td>4&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>7&lt;/th>
&lt;td>Portugal&lt;/td>
&lt;td>#bitcoin Entrepreneur, Master in Communication...&lt;/td>
&lt;td>872&lt;/td>
&lt;td>158&lt;/td>
&lt;td>1080&lt;/td>
&lt;td>0&lt;/td>
&lt;td>#BTC/USD 4H. #Bitcoin consolidating between su...&lt;/td>
&lt;td>other&lt;/td>
&lt;td>7&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>2021&lt;/td>
&lt;td>2&lt;/td>
&lt;td>2020&lt;/td>
&lt;td>9&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>8&lt;/th>
&lt;td>others&lt;/td>
&lt;td>Biz Consultant, real estate, fintech, startups...&lt;/td>
&lt;td>8534&lt;/td>
&lt;td>7605&lt;/td>
&lt;td>4838&lt;/td>
&lt;td>0&lt;/td>
&lt;td>The @Grayscale #Bitcoin Trust: What it is and ...&lt;/td>
&lt;td>other&lt;/td>
&lt;td>8&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>2021&lt;/td>
&lt;td>2&lt;/td>
&lt;td>2009&lt;/td>
&lt;td>4&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>9&lt;/th>
&lt;td>others&lt;/td>
&lt;td>One bet every day. Join our team and become pa...&lt;/td>
&lt;td>2019&lt;/td>
&lt;td>104&lt;/td>
&lt;td>71&lt;/td>
&lt;td>0&lt;/td>
&lt;td>We accept #Bitcoin, #BitcoinCash #Litecoin and...&lt;/td>
&lt;td>Dogecoin&lt;/td>
&lt;td>9&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>2021&lt;/td>
&lt;td>2&lt;/td>
&lt;td>2014&lt;/td>
&lt;td>12&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;/div>
&lt;button class="colab-df-convert" onclick="convertToInteractive('df-eca29686-df36-4fbd-afd5-c88e322b5724')"
title="Convert this dataframe to an interactive table."
style="display:none;">
&lt;p>&amp;lt;svg xmlns=&amp;ldquo;&lt;a href="http://www.w3.org/2000/svg%22" target="_blank" rel="noopener">http://www.w3.org/2000/svg"&lt;/a> height=&amp;ldquo;24px&amp;quot;viewBox=&amp;ldquo;0 0 24 24&amp;rdquo;
width=&amp;ldquo;24px&amp;rdquo;&amp;gt;
&lt;path d="M0 0h24v24H0V0z" fill="none"/>
&lt;path d="M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z"/>&lt;path d="M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z"/>
&lt;/svg>
&lt;/button>&lt;/p>
&lt;style>
.colab-df-container {
display:flex;
flex-wrap:wrap;
gap: 12px;
}
.colab-df-convert {
background-color: #E8F0FE;
border: none;
border-radius: 50%;
cursor: pointer;
display: none;
fill: #1967D2;
height: 32px;
padding: 0 0 0 0;
width: 32px;
}
.colab-df-convert:hover {
background-color: #E2EBFA;
box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);
fill: #174EA6;
}
[theme=dark] .colab-df-convert {
background-color: #3B4455;
fill: #D2E3FC;
}
[theme=dark] .colab-df-convert:hover {
background-color: #434B5C;
box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);
filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));
fill: #FFFFFF;
}
&lt;/style>
&lt;script>
const buttonEl =
document.querySelector('#df-eca29686-df36-4fbd-afd5-c88e322b5724 button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-eca29686-df36-4fbd-afd5-c88e322b5724');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'&lt;a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook&lt;/a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
&lt;/script>
&lt;/div>
&lt;/div>
&lt;h2 id="nature-language-processing-on-user-descriptions-and-tweets-tokens">Nature Language Processing on user descriptions and tweets: Tokens&lt;/h2>
&lt;p>The user descriptions and tweets can be considered natural human
language. They both share some same characteristics: long sentences,
emojis, and containing some unwanted symbols.&lt;/p>
&lt;p>To analyze these two variables, we first convert all texts into bags of
words, including stemming, converting to lowercase, and deleting all
possible stopwords.&lt;/p>
&lt;p>Then we calculate the Frequency for each words and selected the highest
20 words to be included in our text classification model.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">CryptoTweets.simple_tokenize&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">simple_tokenize&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">nltk.stem&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">PorterStemmer&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">re&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Top 20 words&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">n&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">20&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#Take the text&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">tweets_text&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tweet_ml&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">select&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;text&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#Take the user description&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">tweets_ud&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tweet_ml&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">select&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;user_description&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Stemming using Porter Stemmer&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">st&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">PorterStemmer&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">#stop words&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">with&lt;/span> &lt;span class="nb">open&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;CryptoTweets/CommonEnglishWord.txt&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">lines&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">readlines&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">lst&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">list&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">map&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">lower&lt;/span>&lt;span class="p">(),&lt;/span>&lt;span class="n">lines&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">lst&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">lst&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;-&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">lst&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;it&amp;#39;s&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">lst&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;going&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">lst&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;it’s&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">lst&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;via&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">lst&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;|&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">lst&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;&amp;amp;&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">lst&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;/&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">lst&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;•&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">lst&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;http&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Remove emoji since it beyonds the scope of this scope&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">deEmojify&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">text&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">regrex_pattern&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">re&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">compile&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">pattern&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;[&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="sa">u&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="se">\U0001F600&lt;/span>&lt;span class="s2">-&lt;/span>&lt;span class="se">\U0001F64F&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> &lt;span class="c1"># emoticons&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="sa">u&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="se">\U0001F300&lt;/span>&lt;span class="s2">-&lt;/span>&lt;span class="se">\U0001F5FF&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> &lt;span class="c1"># symbols &amp;amp; pictographs&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="sa">u&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="se">\U0001F680&lt;/span>&lt;span class="s2">-&lt;/span>&lt;span class="se">\U0001F6FF&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> &lt;span class="c1"># transport &amp;amp; map symbols&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="sa">u&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="se">\U0001F1E0&lt;/span>&lt;span class="s2">-&lt;/span>&lt;span class="se">\U0001F1FF&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> &lt;span class="c1"># flags (iOS)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;]+&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">flags&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">re&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">UNICODE&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">regrex_pattern&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sub&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">r&lt;/span>&lt;span class="s1">&amp;#39;&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">text&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Bag of words, stemming, lowercase, stop words&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">rddtext&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tweets_text&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">rdd&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">flatMap&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">simple_tokenize&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">deEmojify&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">])))&lt;/span>&lt;span class="o">.&lt;/span>\
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">map&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">st&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">stem&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">))&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">filter&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">x&lt;/span> &lt;span class="ow">not&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">lst&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">filter&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">&amp;gt;&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>\
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">map&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">lower&lt;/span>&lt;span class="p">(),&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">))&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">reduceByKey&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sortBy&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">],&lt;/span>&lt;span class="n">ascending&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">False&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">cache&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Then we created variables for each of top 20 words. The value indicate
the Term Frequency of each words in current text. The following table
shows the resulted variables of first ten samples.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Calculate the frequency and return the result as tuple&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">calcfreq&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">t&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">wordtup&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">t&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">wordlist&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">tweet&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">t&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">lower&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">wordtup&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">wordlist&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">list&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">wordlist&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">i&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">tweet&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">i&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">+=&lt;/span> &lt;span class="mi">1&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">result&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="n">t&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">]]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">wordlist&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">result&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="nb">tuple&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">result&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">reinit_list&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">rddtext&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">map&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">))&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">take&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">n&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">most_frequent_tweet&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">rddtext&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">take&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">n&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">words&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">freq&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">zip&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="n">most_frequent_tweet&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">words&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">list&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">words&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">words&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">insert&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;text&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">reinit_rdd&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tweets_text&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">rdd&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">map&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">])&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">map&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">reinit_list&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">calc&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">reinit_rdd&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">map&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">calcfreq&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">table_tweet&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">calc&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">toDF&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">words&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">table_tweet&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">limit&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">toPandas&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div> &lt;div id="df-7da8d144-8147-406a-bb19-8c320269f662">
&lt;div class="colab-df-container">
&lt;div>
&lt;style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
&lt;pre>&lt;code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
&lt;/code>&lt;/pre>
&lt;p>&lt;/style>&lt;/p>
&lt;table border="1" class="dataframe">
&lt;thead>
&lt;tr style="text-align: right;">
&lt;th>&lt;/th>
&lt;th>text&lt;/th>
&lt;th>bitcoin&lt;/th>
&lt;th>co&lt;/th>
&lt;th>btc&lt;/th>
&lt;th>crypto&lt;/th>
&lt;th>thi&lt;/th>
&lt;th>cryptocurr&lt;/th>
&lt;th>eth&lt;/th>
&lt;th>ethereum&lt;/th>
&lt;th>price&lt;/th>
&lt;th>...&lt;/th>
&lt;th>binanc&lt;/th>
&lt;th>blockchain&lt;/th>
&lt;th>dogecoin&lt;/th>
&lt;th>ha&lt;/th>
&lt;th>gift&lt;/th>
&lt;th>amp&lt;/th>
&lt;th>wa&lt;/th>
&lt;th>invest&lt;/th>
&lt;th>altcoin&lt;/th>
&lt;th>doge&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;th>0&lt;/th>
&lt;td>Blue Ridge Bank shares halted by NYSE after #b...&lt;/td>
&lt;td>1&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>...&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>1&lt;/th>
&lt;td>.@Tesla’s #bitcoin investment is revolutionary...&lt;/td>
&lt;td>1&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>...&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>2&lt;/th>
&lt;td>Annnd #btc #Bitcoin is headed even higher now....&lt;/td>
&lt;td>1&lt;/td>
&lt;td>1&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>...&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>3&lt;/th>
&lt;td>Buy #Bitcoin with 5% LIFETIME cashback on fees...&lt;/td>
&lt;td>1&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>...&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>4&lt;/th>
&lt;td>#Bitcoin institutional demand accelerates in 2...&lt;/td>
&lt;td>1&lt;/td>
&lt;td>1&lt;/td>
&lt;td>1&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>...&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>5 rows × 21 columns&lt;/p>
&lt;/div>
&lt;button class="colab-df-convert" onclick="convertToInteractive('df-7da8d144-8147-406a-bb19-8c320269f662')"
title="Convert this dataframe to an interactive table."
style="display:none;">
&lt;p>&amp;lt;svg xmlns=&amp;ldquo;&lt;a href="http://www.w3.org/2000/svg%22" target="_blank" rel="noopener">http://www.w3.org/2000/svg"&lt;/a> height=&amp;ldquo;24px&amp;quot;viewBox=&amp;ldquo;0 0 24 24&amp;rdquo;
width=&amp;ldquo;24px&amp;rdquo;&amp;gt;
&lt;path d="M0 0h24v24H0V0z" fill="none"/>
&lt;path d="M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z"/>&lt;path d="M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z"/>
&lt;/svg>
&lt;/button>&lt;/p>
&lt;style>
.colab-df-container {
display:flex;
flex-wrap:wrap;
gap: 12px;
}
.colab-df-convert {
background-color: #E8F0FE;
border: none;
border-radius: 50%;
cursor: pointer;
display: none;
fill: #1967D2;
height: 32px;
padding: 0 0 0 0;
width: 32px;
}
.colab-df-convert:hover {
background-color: #E2EBFA;
box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);
fill: #174EA6;
}
[theme=dark] .colab-df-convert {
background-color: #3B4455;
fill: #D2E3FC;
}
[theme=dark] .colab-df-convert:hover {
background-color: #434B5C;
box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);
filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));
fill: #FFFFFF;
}
&lt;/style>
&lt;script>
const buttonEl =
document.querySelector('#df-7da8d144-8147-406a-bb19-8c320269f662 button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-7da8d144-8147-406a-bb19-8c320269f662');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'&lt;a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook&lt;/a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
&lt;/script>
&lt;/div>
&lt;/div>
&lt;p>The follow plot shows the frequency distribution of top 20 words&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">x_t&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">y_t&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">zip&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="n">most_frequent_tweet&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">bar&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">Bar&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">init_opts&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">opts&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">InitOpts&lt;/span>&lt;span class="p">())&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="o">.&lt;/span>&lt;span class="n">add_xaxis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x_t&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="o">.&lt;/span>&lt;span class="n">add_yaxis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;Frequency&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">y_t&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="o">.&lt;/span>&lt;span class="n">set_global_opts&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">title_opts&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">opts&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">TitleOpts&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">title&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;Top 20 words in the Tweets&amp;#34;&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">bar&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">render_notebook&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;script>
require.config({
paths: {
'echarts':'https://assets.pyecharts.org/assets/echarts.min'
}
});
&lt;/script>
&lt;pre>&lt;code> &amp;lt;div id=&amp;quot;0c81b88fcab84b0ebc1683a83d5ca866&amp;quot; style=&amp;quot;width:900px; height:500px;&amp;quot;&amp;gt;&amp;lt;/div&amp;gt;
&lt;/code>&lt;/pre>
&lt;script>
require(['echarts'], function(echarts) {
var chart_0c81b88fcab84b0ebc1683a83d5ca866 = echarts.init(
document.getElementById('0c81b88fcab84b0ebc1683a83d5ca866'), 'white', {renderer: 'canvas'});
var option_0c81b88fcab84b0ebc1683a83d5ca866 = {
"animation": true,
"animationThreshold": 2000,
"animationDuration": 1000,
"animationEasing": "cubicOut",
"animationDelay": 0,
"animationDurationUpdate": 300,
"animationEasingUpdate": "cubicOut",
"animationDelayUpdate": 0,
"color": [
"#c23531",
"#2f4554",
"#61a0a8",
"#d48265",
"#749f83",
"#ca8622",
"#bda29a",
"#6e7074",
"#546570",
"#c4ccd3",
"#f05b72",
"#ef5b9c",
"#f47920",
"#905a3d",
"#fab27b",
"#2a5caa",
"#444693",
"#726930",
"#b2d235",
"#6d8346",
"#ac6767",
"#1d953f",
"#6950a1",
"#918597"
],
"series": [
{
"type": "bar",
"name": "Frequency",
"legendHoverLink": true,
"data": [
442749,
326597,
228976,
104846,
71983,
68579,
44451,
36081,
33958,
28528,
26239,
25499,
22375,
20703,
19922,
18472,
18026,
17181,
16820,
16101
],
"showBackground": false,
"barMinHeight": 0,
"barCategoryGap": "20%",
"barGap": "30%",
"large": false,
"largeThreshold": 400,
"seriesLayoutBy": "column",
"datasetIndex": 0,
"clip": true,
"zlevel": 0,
"z": 2,
"label": {
"show": true,
"position": "top",
"margin": 8
}
}
],
"legend": [
{
"data": [
"Frequency"
],
"selected": {
"Frequency": true
},
"show": true,
"padding": 5,
"itemGap": 10,
"itemWidth": 25,
"itemHeight": 14
}
],
"tooltip": {
"show": true,
"trigger": "item",
"triggerOn": "mousemove|click",
"axisPointer": {
"type": "line"
},
"showContent": true,
"alwaysShowContent": false,
"showDelay": 0,
"hideDelay": 100,
"textStyle": {
"fontSize": 14
},
"borderWidth": 0,
"padding": 5
},
"xAxis": [
{
"show": true,
"scale": false,
"nameLocation": "end",
"nameGap": 15,
"gridIndex": 0,
"inverse": false,
"offset": 0,
"splitNumber": 5,
"minInterval": 0,
"splitLine": {
"show": false,
"lineStyle": {
"show": true,
"width": 1,
"opacity": 1,
"curveness": 0,
"type": "solid"
}
},
"data": [
"bitcoin",
"co",
"btc",
"crypto",
"thi",
"cryptocurr",
"eth",
"ethereum",
"price",
"nft",
"binanc",
"blockchain",
"dogecoin",
"ha",
"gift",
"amp",
"wa",
"invest",
"altcoin",
"doge"
]
}
],
"yAxis": [
{
"show": true,
"scale": false,
"nameLocation": "end",
"nameGap": 15,
"gridIndex": 0,
"inverse": false,
"offset": 0,
"splitNumber": 5,
"minInterval": 0,
"splitLine": {
"show": false,
"lineStyle": {
"show": true,
"width": 1,
"opacity": 1,
"curveness": 0,
"type": "solid"
}
}
}
],
"title": [
{
"text": "Top 20 words in the Tweets",
"padding": 5,
"itemGap": 10
}
]
};
chart_0c81b88fcab84b0ebc1683a83d5ca866.setOption(option_0c81b88fcab84b0ebc1683a83d5ca866);
});
&lt;/script>
&lt;p>The non-interactive plot when the above interactive plot fail to load&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">xticks&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">rotation&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">45&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">horizontalalignment&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;right&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">fontweight&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;light&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">fontsize&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;x-large&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">bar&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x_t&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y_t&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">show&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="" srcset="
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/c748810ef034a8c12add55bada408376f3519e87_hu2cd4b369e7c5a226b59437bca1e25544_16671_1dbc431e2d8d998104343bb0faf1a104.webp 400w,
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/c748810ef034a8c12add55bada408376f3519e87_hu2cd4b369e7c5a226b59437bca1e25544_16671_ad4d72d77e5e0076c9f3596deb40314d.webp 760w,
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/c748810ef034a8c12add55bada408376f3519e87_hu2cd4b369e7c5a226b59437bca1e25544_16671_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://joeliang0520.github.io/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/c748810ef034a8c12add55bada408376f3519e87_hu2cd4b369e7c5a226b59437bca1e25544_16671_1dbc431e2d8d998104343bb0faf1a104.webp"
width="393"
height="303"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">table_tweet&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">table_tweet&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">select&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;*&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">withColumn&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;idtweet&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">unique_id&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">tweet_ml&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tweet_ml&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">table_tweet&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">tweet_ml&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">id&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="n">table_tweet&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">idtweet&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;inner&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">drop&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;text&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;_1&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;idtweet&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Similiar preparation and variable creations for variable &amp;quot;user
description&amp;quot;&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">udrdd&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tweets_ud&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">rdd&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">flatMap&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">simple_tokenize&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">deEmojify&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">])))&lt;/span>&lt;span class="o">.&lt;/span>\
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">map&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">st&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">stem&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">))&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">filter&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">x&lt;/span> &lt;span class="ow">not&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">lst&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">filter&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">&amp;gt;&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>\
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">map&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">lower&lt;/span>&lt;span class="p">(),&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">))&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">reduceByKey&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sortBy&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">],&lt;/span>&lt;span class="n">ascending&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">False&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">cache&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">reinit_udlist&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">udrdd&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">map&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">],&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">))&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">take&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">n&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">most_frequent_ud&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">udrdd&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">take&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">n&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">words&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">freq&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">zip&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="n">most_frequent_ud&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">words&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">list&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">words&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">words&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">insert&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;text&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">reinit_udrdd&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tweets_ud&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">rdd&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">map&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">])&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">map&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">reinit_udlist&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">calcud&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">reinit_udrdd&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">map&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">calcfreq&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">table_ud&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">calcud&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">toDF&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">words&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">table_ud&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">table_ud&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">select&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;*&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">withColumn&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;idud&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">unique_id&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">table_ud&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">table_ud&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">select&lt;/span>&lt;span class="p">([&lt;/span>&lt;span class="n">F&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">col&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">c&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">alias&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;ud&amp;#34;&lt;/span>&lt;span class="o">+&lt;/span>&lt;span class="n">c&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="k">for&lt;/span> &lt;span class="n">c&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">table_ud&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">columns&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">tweet_ml&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tweet_ml&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">table_ud&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">tweet_ml&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">id&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="n">table_ud&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">udidud&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;inner&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">drop&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;udtext&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;ud_1&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;udidud&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">cache&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">x_t&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">y_t&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">zip&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="n">most_frequent_ud&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">bar&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">Bar&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">init_opts&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">opts&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">InitOpts&lt;/span>&lt;span class="p">())&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="o">.&lt;/span>&lt;span class="n">add_xaxis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x_t&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="o">.&lt;/span>&lt;span class="n">add_yaxis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;Frequency&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">y_t&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="o">.&lt;/span>&lt;span class="n">set_global_opts&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">title_opts&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">opts&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">TitleOpts&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">title&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;Top 20 words in the User Descriptions&amp;#34;&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">bar&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">render_notebook&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;script>
require.config({
paths: {
'echarts':'https://assets.pyecharts.org/assets/echarts.min'
}
});
&lt;/script>
&lt;pre>&lt;code> &amp;lt;div id=&amp;quot;381439a01c6c488787d31b7bed284317&amp;quot; style=&amp;quot;width:900px; height:500px;&amp;quot;&amp;gt;&amp;lt;/div&amp;gt;
&lt;/code>&lt;/pre>
&lt;script>
require(['echarts'], function(echarts) {
var chart_381439a01c6c488787d31b7bed284317 = echarts.init(
document.getElementById('381439a01c6c488787d31b7bed284317'), 'white', {renderer: 'canvas'});
var option_381439a01c6c488787d31b7bed284317 = {
"animation": true,
"animationThreshold": 2000,
"animationDuration": 1000,
"animationEasing": "cubicOut",
"animationDelay": 0,
"animationDurationUpdate": 300,
"animationEasingUpdate": "cubicOut",
"animationDelayUpdate": 0,
"color": [
"#c23531",
"#2f4554",
"#61a0a8",
"#d48265",
"#749f83",
"#ca8622",
"#bda29a",
"#6e7074",
"#546570",
"#c4ccd3",
"#f05b72",
"#ef5b9c",
"#f47920",
"#905a3d",
"#fab27b",
"#2a5caa",
"#444693",
"#726930",
"#b2d235",
"#6d8346",
"#ac6767",
"#1d953f",
"#6950a1",
"#918597"
],
"series": [
{
"type": "bar",
"name": "Frequency",
"legendHoverLink": true,
"data": [
210136,
129333,
78451,
59091,
58586,
50059,
43675,
42968,
34820,
32164,
31355,
27193,
24923,
24729,
21688,
21052,
20221,
19947,
18904,
17492
],
"showBackground": false,
"barMinHeight": 0,
"barCategoryGap": "20%",
"barGap": "30%",
"large": false,
"largeThreshold": 400,
"seriesLayoutBy": "column",
"datasetIndex": 0,
"clip": true,
"zlevel": 0,
"z": 2,
"label": {
"show": true,
"position": "top",
"margin": 8
}
}
],
"legend": [
{
"data": [
"Frequency"
],
"selected": {
"Frequency": true
},
"show": true,
"padding": 5,
"itemGap": 10,
"itemWidth": 25,
"itemHeight": 14
}
],
"tooltip": {
"show": true,
"trigger": "item",
"triggerOn": "mousemove|click",
"axisPointer": {
"type": "line"
},
"showContent": true,
"alwaysShowContent": false,
"showDelay": 0,
"hideDelay": 100,
"textStyle": {
"fontSize": 14
},
"borderWidth": 0,
"padding": 5
},
"xAxis": [
{
"show": true,
"scale": false,
"nameLocation": "end",
"nameGap": 15,
"gridIndex": 0,
"inverse": false,
"offset": 0,
"splitNumber": 5,
"minInterval": 0,
"splitLine": {
"show": false,
"lineStyle": {
"show": true,
"width": 1,
"opacity": 1,
"curveness": 0,
"type": "solid"
}
},
"data": [
"bitcoin",
"crypto",
"co",
"btc",
"cryptocurr",
"blockchain",
"financi",
"news",
"eth",
"advic",
"investor",
"trader",
"tweet",
"nft",
"busi",
"ethereum",
"enthusiast",
"invest",
"latest",
"doge"
]
}
],
"yAxis": [
{
"show": true,
"scale": false,
"nameLocation": "end",
"nameGap": 15,
"gridIndex": 0,
"inverse": false,
"offset": 0,
"splitNumber": 5,
"minInterval": 0,
"splitLine": {
"show": false,
"lineStyle": {
"show": true,
"width": 1,
"opacity": 1,
"curveness": 0,
"type": "solid"
}
}
}
],
"title": [
{
"text": "Top 20 words in the User Descriptions",
"padding": 5,
"itemGap": 10
}
]
};
chart_381439a01c6c488787d31b7bed284317.setOption(option_381439a01c6c488787d31b7bed284317);
});
&lt;/script>
&lt;p>The non-interactive plot when the above interactive plot fail to load&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">xticks&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">rotation&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">45&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">horizontalalignment&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;right&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">fontweight&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;light&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">fontsize&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;x-large&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">bar&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x_t&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y_t&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">show&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="" srcset="
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/7df2031c54842231304ce61485050ccf71110a18_hu88bd45430a230496a55a71c59c02b3cc_19109_404334e93ca638a82043fe1e2cb36984.webp 400w,
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/7df2031c54842231304ce61485050ccf71110a18_hu88bd45430a230496a55a71c59c02b3cc_19109_dc84e51c5f177cbed76c4b050e751afe.webp 760w,
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/7df2031c54842231304ce61485050ccf71110a18_hu88bd45430a230496a55a71c59c02b3cc_19109_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://joeliang0520.github.io/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/7df2031c54842231304ce61485050ccf71110a18_hu88bd45430a230496a55a71c59c02b3cc_19109_404334e93ca638a82043fe1e2cb36984.webp"
width="394"
height="303"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>Here are the first five samples of the dataset after applying natural
language processing to tweets and user descriptions.&lt;/p>
&lt;p>We can also calculate the TF-IDF to vectorize the top 20 words in each
sample. Compare to frequency, TF-IDF has the advantage by assigning a
larger weight to words that appear less in the documents. It can be a
future improvement.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">tweet_ml&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tweet_ml&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">drop&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;user_description&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">tweet_ml&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">limit&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">toPandas&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div> &lt;div id="df-44c66e86-0d6f-4729-9cc9-b9d507066ebe">
&lt;div class="colab-df-container">
&lt;div>
&lt;style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
&lt;pre>&lt;code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
&lt;/code>&lt;/pre>
&lt;p>&lt;/style>&lt;/p>
&lt;table border="1" class="dataframe">
&lt;thead>
&lt;tr style="text-align: right;">
&lt;th>&lt;/th>
&lt;th>user_location&lt;/th>
&lt;th>user_followers&lt;/th>
&lt;th>user_friends&lt;/th>
&lt;th>user_favourites&lt;/th>
&lt;th>user_verified&lt;/th>
&lt;th>hashtags&lt;/th>
&lt;th>id&lt;/th>
&lt;th>source_Iphone&lt;/th>
&lt;th>source_Web&lt;/th>
&lt;th>source_Android&lt;/th>
&lt;th>...&lt;/th>
&lt;th>udinvestor&lt;/th>
&lt;th>udtrader&lt;/th>
&lt;th>udtweet&lt;/th>
&lt;th>udnft&lt;/th>
&lt;th>udbusi&lt;/th>
&lt;th>udethereum&lt;/th>
&lt;th>udenthusiast&lt;/th>
&lt;th>udinvest&lt;/th>
&lt;th>udlatest&lt;/th>
&lt;th>uddoge&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;th>0&lt;/th>
&lt;td>others&lt;/td>
&lt;td>8&lt;/td>
&lt;td>0&lt;/td>
&lt;td>49&lt;/td>
&lt;td>0&lt;/td>
&lt;td>Dogecoin&lt;/td>
&lt;td>26&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>...&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>1&lt;/th>
&lt;td>others&lt;/td>
&lt;td>94&lt;/td>
&lt;td>189&lt;/td>
&lt;td>753&lt;/td>
&lt;td>0&lt;/td>
&lt;td>other&lt;/td>
&lt;td>29&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>...&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>2&lt;/th>
&lt;td>others&lt;/td>
&lt;td>5366&lt;/td>
&lt;td>927&lt;/td>
&lt;td>34484&lt;/td>
&lt;td>0&lt;/td>
&lt;td>other&lt;/td>
&lt;td>474&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>...&lt;/td>
&lt;td>1&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>3&lt;/th>
&lt;td>United States&lt;/td>
&lt;td>68&lt;/td>
&lt;td>84&lt;/td>
&lt;td>427&lt;/td>
&lt;td>0&lt;/td>
&lt;td>other&lt;/td>
&lt;td>964&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>...&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>4&lt;/th>
&lt;td>others&lt;/td>
&lt;td>275&lt;/td>
&lt;td>789&lt;/td>
&lt;td>3654&lt;/td>
&lt;td>0&lt;/td>
&lt;td>other&lt;/td>
&lt;td>1677&lt;/td>
&lt;td>1&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>...&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;td>0&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>5 rows × 54 columns&lt;/p>
&lt;/div>
&lt;button class="colab-df-convert" onclick="convertToInteractive('df-44c66e86-0d6f-4729-9cc9-b9d507066ebe')"
title="Convert this dataframe to an interactive table."
style="display:none;">
&lt;p>&amp;lt;svg xmlns=&amp;ldquo;&lt;a href="http://www.w3.org/2000/svg%22" target="_blank" rel="noopener">http://www.w3.org/2000/svg"&lt;/a> height=&amp;ldquo;24px&amp;quot;viewBox=&amp;ldquo;0 0 24 24&amp;rdquo;
width=&amp;ldquo;24px&amp;rdquo;&amp;gt;
&lt;path d="M0 0h24v24H0V0z" fill="none"/>
&lt;path d="M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z"/>&lt;path d="M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z"/>
&lt;/svg>
&lt;/button>&lt;/p>
&lt;style>
.colab-df-container {
display:flex;
flex-wrap:wrap;
gap: 12px;
}
.colab-df-convert {
background-color: #E8F0FE;
border: none;
border-radius: 50%;
cursor: pointer;
display: none;
fill: #1967D2;
height: 32px;
padding: 0 0 0 0;
width: 32px;
}
.colab-df-convert:hover {
background-color: #E2EBFA;
box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);
fill: #174EA6;
}
[theme=dark] .colab-df-convert {
background-color: #3B4455;
fill: #D2E3FC;
}
[theme=dark] .colab-df-convert:hover {
background-color: #434B5C;
box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);
filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));
fill: #FFFFFF;
}
&lt;/style>
&lt;script>
const buttonEl =
document.querySelector('#df-44c66e86-0d6f-4729-9cc9-b9d507066ebe button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-44c66e86-0d6f-4729-9cc9-b9d507066ebe');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'&lt;a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook&lt;/a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
&lt;/script>
&lt;/div>
&lt;/div>
&lt;h2 id="training-testing-and-validation-dataset">Training, Testing, and Validation Dataset&lt;/h2>
&lt;p>Since not all samples belong to these five response variables (hashtags
= 'other' in above table), we decided to use these un-classified
samples as our testing dataset to demonstrate the outcome of our
training model in future sections.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">tweet_train&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tweet_ml&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">filter&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="o">~&lt;/span>&lt;span class="n">tweet_ml&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">hashtags&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">contains&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;other&amp;#39;&lt;/span>&lt;span class="p">))&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">cache&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">tweet_test&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tweet_ml&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">filter&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweet_ml&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">hashtags&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">contains&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;other&amp;#39;&lt;/span>&lt;span class="p">))&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">cache&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Here is the final distribution of the response variables in the training
set in text classifications.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">hashtag_ml&lt;/span> &lt;span class="o">=&lt;/span>&lt;span class="n">hashtags&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweet_train&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">collect&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Cleaned Hashtag&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">x_hash&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">y_hash&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">hashtag_ml&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">x_hash&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">y_hash&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">xticks&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">rotation&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">45&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">horizontalalignment&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;right&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">fontweight&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;light&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">fontsize&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;x-large&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">bar&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x_hash&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="n">y_hash&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">show&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="" srcset="
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/cd530fb69e47cfd2bbfccbcc8d24162504c14a1e_huf8f8d51d72df1981ba798010a38dea62_10018_a841ce21f9aeda62b0f89b5456ce821a.webp 400w,
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/cd530fb69e47cfd2bbfccbcc8d24162504c14a1e_huf8f8d51d72df1981ba798010a38dea62_10018_dd52401211238b97b092c62280b1a14b.webp 760w,
/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/cd530fb69e47cfd2bbfccbcc8d24162504c14a1e_huf8f8d51d72df1981ba798010a38dea62_10018_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://joeliang0520.github.io/project/fiancial/vertopal_cd4f1a3a239a4680984e7af5e860469f/cd530fb69e47cfd2bbfccbcc8d24162504c14a1e_huf8f8d51d72df1981ba798010a38dea62_10018_a841ce21f9aeda62b0f89b5456ce821a.webp"
width="387"
height="304"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>Then spliting training dataset into training data and validation data
(80% and 20%) to test the performance of the training model.&lt;/p>
&lt;h4 id="multinomial-regression-model">Multinomial Regression Model&lt;/h4>
&lt;p>Setting up the Pyspark Machine Learning environments. Removing user
locations since it has too many levels which leads to extremely high
computational cost.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.ml.classification&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">LogisticRegression&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.ml.feature&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">VectorAssembler&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.ml.feature&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">StringIndexer&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Y-variables&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">stringIndexer&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">StringIndexer&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">inputCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;hashtags&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">outputCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;label&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">model&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">stringIndexer&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">fit&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweet_train&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">td&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">model&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">transform&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweet_train&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># X-variables&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">assembler&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">VectorAssembler&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">outputCol&lt;/span>&lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;features&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>\
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="o">.&lt;/span>&lt;span class="n">setInputCols&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweet_train&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">drop&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;hashtags&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;id&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;user_location&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">columns&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Setting up Multinomial Logistic Regression&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">lr&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">LogisticRegression&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">maxIter&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">family&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;multinomial&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">assembler_df&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">assembler&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">transform&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">td&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">train&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">validation&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">assembler_df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">randomSplit&lt;/span>&lt;span class="p">([&lt;/span>&lt;span class="mf">0.8&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mf">0.2&lt;/span>&lt;span class="p">],&lt;/span>&lt;span class="mi">2022&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Fitted a Multinomial Logestic Regression model using training data&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">lrModel&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">lr&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">fit&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">train&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Tuning the model and checking the performance using validation data&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">predictions&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">lrModel&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">transform&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">validation&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">accuracy&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">predictions&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">filter&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">predictions&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">label&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="n">predictions&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">prediction&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="o">/&lt;/span>&lt;span class="nb">float&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">predictions&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">())&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;The accuracy of prediction in Validation Data&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">accuracy&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">The accuracy of prediction in Validation Data 0.6524463640869503
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="random-forest">Random Forest&lt;/h4>
&lt;p>Using the training set to train a Random Forest with 100 total trees,
and number of sqrt(col) variables to choose in each trees.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.ml.classification&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">RandomForestClassifier&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">rf&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">RandomForestClassifier&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">featuresCol&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;features&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">labelCol&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;label&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">numTrees&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">100&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">rfModel&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">rf&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">fit&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">train&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Checking the performance using validation data&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">predictions&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">rfModel&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">transform&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">validation&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">accuracy&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">predictions&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">filter&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">predictions&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">label&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="n">predictions&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">prediction&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="o">/&lt;/span>&lt;span class="nb">float&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">predictions&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">())&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;The accuracy of prediction in Validation Data&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">accuracy&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>The accuracy of prediction in Validation Data 0.8449691991786448
&lt;/code>&lt;/pre>
&lt;h4 id="neural-network-with-keras-api">Neural Network with Keras API&lt;/h4>
&lt;p>Converting to Pandas Dataframe for better compatibility with Keras
Packages. And creating dummy variables for response variables
&amp;quot;hashtags&amp;quot; since the neural network requires int variables for
calculation.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">train_df&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tweet_train&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">drop&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;id&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">toPandas&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">response&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">train_df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">pop&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;hashtags&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">train_df&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">train_df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">drop&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">columns&lt;/span>&lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;user_location&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">response&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">pd&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">get_dummies&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">response&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Setting up the Keras environments and a neural network model with two
hidden layers, have 128 and 256 node. Setting up the Relu activation
functions for non-linearity, and softmax output function for multi-class
classification.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">tensorflow&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="nn">tf&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">tensorflow&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">keras&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">tensorflow.keras&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">layers&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">inputs&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">keras&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Input&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">shape&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">51&lt;/span>&lt;span class="p">,))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">x&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">layers&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Dense&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">128&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">activation&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;relu&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">name&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;dense_1&amp;#34;&lt;/span>&lt;span class="p">)(&lt;/span>&lt;span class="n">inputs&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">x&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">layers&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Dense&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">256&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">activation&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;relu&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">name&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;dense_2&amp;#34;&lt;/span>&lt;span class="p">)(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">outputs&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">layers&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Dense&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">5&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">activation&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;softmax&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">name&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;classification&amp;#34;&lt;/span>&lt;span class="p">)(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">model&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">keras&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Model&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">inputs&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">inputs&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">outputs&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">outputs&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">model&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">summary&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>Model: &amp;quot;model_1&amp;quot;
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) [(None, 51)] 0
dense_1 (Dense) (None, 128) 6656
dense_2 (Dense) (None, 256) 33024
classification (Dense) (None, 5) 1285
=================================================================
Total params: 40,965
Trainable params: 40,965
Non-trainable params: 0
_________________________________________________________________
&lt;/code>&lt;/pre>
&lt;p>The above shows the structures of our final model. And fitting it with
training data and tunning it with validation splits&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">model&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">compile&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">optimizer&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;adam&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">loss&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;categorical_crossentropy&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">metrics&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;accuracy&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">checkpoint_filepath&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;/tmp/checkpoint&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">model_checkpoint_callback&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tf&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">keras&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">callbacks&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ModelCheckpoint&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">filepath&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">checkpoint_filepath&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">save_weights_only&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">monitor&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;loss&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">mode&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;max&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">save_best_only&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">model&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">fit&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">train_df&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">response&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">batch_size&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">32&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">epochs&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">20&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">validation_split&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mf">0.2&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">callbacks&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">model_checkpoint_callback&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>Epoch 1/20
3548/3548 [==============================] - 14s 4ms/step - loss: 116.8460 - accuracy: 0.4272 - val_loss: 21.8834 - val_accuracy: 0.4696
Epoch 2/20
3548/3548 [==============================] - 11s 3ms/step - loss: 11.9903 - accuracy: 0.4798 - val_loss: 1.3669 - val_accuracy: 0.4488
Epoch 3/20
3548/3548 [==============================] - 11s 3ms/step - loss: 1.2508 - accuracy: 0.4696 - val_loss: 1.6729 - val_accuracy: 0.4723
Epoch 4/20
3548/3548 [==============================] - 11s 3ms/step - loss: 1.0349 - accuracy: 0.5547 - val_loss: 0.8407 - val_accuracy: 0.6537
Epoch 5/20
3548/3548 [==============================] - 13s 4ms/step - loss: 0.7887 - accuracy: 0.6968 - val_loss: 0.7531 - val_accuracy: 0.7269
Epoch 6/20
3548/3548 [==============================] - 11s 3ms/step - loss: 0.7540 - accuracy: 0.7144 - val_loss: 0.8589 - val_accuracy: 0.6768
Epoch 7/20
3548/3548 [==============================] - 11s 3ms/step - loss: 0.7876 - accuracy: 0.7020 - val_loss: 0.7817 - val_accuracy: 0.6801
Epoch 8/20
3548/3548 [==============================] - 11s 3ms/step - loss: 0.7321 - accuracy: 0.7294 - val_loss: 0.6161 - val_accuracy: 0.7643
Epoch 9/20
3548/3548 [==============================] - 11s 3ms/step - loss: 0.9909 - accuracy: 0.6766 - val_loss: 0.9638 - val_accuracy: 0.6175
Epoch 10/20
3548/3548 [==============================] - 10s 3ms/step - loss: 0.8489 - accuracy: 0.6843 - val_loss: 0.7968 - val_accuracy: 0.6997
Epoch 11/20
3548/3548 [==============================] - 10s 3ms/step - loss: 0.8047 - accuracy: 0.7123 - val_loss: 0.8172 - val_accuracy: 0.6729
Epoch 12/20
3548/3548 [==============================] - 11s 3ms/step - loss: 0.7652 - accuracy: 0.7191 - val_loss: 1.0272 - val_accuracy: 0.5862
Epoch 13/20
3548/3548 [==============================] - 11s 3ms/step - loss: 0.8149 - accuracy: 0.6901 - val_loss: 0.6500 - val_accuracy: 0.7623
Epoch 14/20
3548/3548 [==============================] - 11s 3ms/step - loss: 0.8350 - accuracy: 0.6905 - val_loss: 0.9390 - val_accuracy: 0.6809
Epoch 15/20
3548/3548 [==============================] - 11s 3ms/step - loss: 0.7641 - accuracy: 0.7348 - val_loss: 0.6914 - val_accuracy: 0.7622
Epoch 16/20
3548/3548 [==============================] - 10s 3ms/step - loss: 0.7264 - accuracy: 0.7385 - val_loss: 0.6975 - val_accuracy: 0.7328
Epoch 17/20
3548/3548 [==============================] - 10s 3ms/step - loss: 0.7544 - accuracy: 0.7320 - val_loss: 0.9025 - val_accuracy: 0.7154
Epoch 18/20
3548/3548 [==============================] - 11s 3ms/step - loss: 0.7749 - accuracy: 0.7043 - val_loss: 0.6575 - val_accuracy: 0.7617
Epoch 19/20
3548/3548 [==============================] - 11s 3ms/step - loss: 0.7928 - accuracy: 0.7032 - val_loss: 0.9805 - val_accuracy: 0.6677
Epoch 20/20
3548/3548 [==============================] - 10s 3ms/step - loss: 0.6914 - accuracy: 0.7542 - val_loss: 0.6055 - val_accuracy: 0.7811
&amp;lt;keras.callbacks.History at 0x7f1ebf43ec10&amp;gt;
&lt;/code>&lt;/pre>
&lt;p>The above plot demonstrates the training process of the neural network
model in each iteration. To avoid overfitting, we need to use the model
with the lowest validated loss (val_loss) in the above result. This
model is stored in memory by checkpoint functions and is callable for
future usage.&lt;/p>
&lt;h4 id="prediction">Prediction&lt;/h4>
&lt;p>We will use the testing dataset to demonstrate the outcome of three
trained models and how auto-hashtaging system works in incoming tweets.&lt;/p>
&lt;p>Logestic Regression&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pyspark.ml.feature&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">IndexToString&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">test&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">assembler&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">transform&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">tweet_test&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">test_prediction&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">lrModel&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">transform&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">test&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">backtoshash&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">IndexToString&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">inputCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;prediction&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">outputCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;hashes&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">labels&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;Cryptocurrency&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;Bitcoin&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;Dogecoin&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;Etherenum&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;binance&amp;#39;&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">test_prediction&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">backtoshash&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">transform&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">test_prediction&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">test_prediction&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">select&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;id&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;probability&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">show&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="kc">False&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>+----+------------------------------------------------------------------------------------------------------+
|id |probability |
+----+------------------------------------------------------------------------------------------------------+
|29 |[0.3077313892095184,0.371686180208301,0.16157765845611455,0.15900331096889617,1.4611571697825333E-6] |
|474 |[0.33327388585249207,0.3741639190318949,0.1714864279527474,0.12107430591892694,1.4612439387007495E-6] |
|964 |[0.3913783911510319,0.3027252404883092,0.17921642595022608,0.1266785265102223,1.4159002104801821E-6] |
|1677|[0.33295682110539726,0.3273996767104981,0.16134718703164241,0.17829488835432508,1.4267981371979524E-6]|
|1950|[0.3600914851863604,0.24751148870741563,0.15465675775777693,0.23773885594117347,1.4124072734451633E-6]|
|2040|[0.42485527012649427,0.26050835049863025,0.16349387679216867,0.1511411729427666,1.3296399402649282E-6]|
|2214|[0.4419353500007448,0.2527123390106577,0.16882974288092872,0.13652119501290258,1.3730947663475665E-6] |
|2453|[0.4589381243296783,0.14606804491493697,0.2937284738352546,0.10126408940553218,1.2675145980027649E-6] |
|2509|[0.3321477952505388,0.37096013204945966,0.16393197503583595,0.13295870829424924,1.389369916457664E-6] |
|2529|[0.3048276563348717,0.23617205430574892,0.2686116148297133,0.19038718971765053,1.484812015595407E-6] |
+----+------------------------------------------------------------------------------------------------------+
only showing top 10 rows
&lt;/code>&lt;/pre>
&lt;p>The hashtags with the highest probabilities will be the classified
categories for the corresponding samples (optimal Bayes)&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">test_prediction&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">select&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;id&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;hashes&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">show&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="kc">False&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>+----+--------------+
|id |hashes |
+----+--------------+
|29 |Bitcoin |
|474 |Bitcoin |
|964 |Cryptocurrency|
|1677|Cryptocurrency|
|1950|Cryptocurrency|
|2040|Cryptocurrency|
|2214|Cryptocurrency|
|2453|Cryptocurrency|
|2509|Bitcoin |
|2529|Cryptocurrency|
+----+--------------+
only showing top 10 rows
&lt;/code>&lt;/pre>
&lt;p>Random Forest Similiar for Random Forest model and Neural Network Model&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">test_prediction&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">rfModel&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">transform&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">test&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">backtoshash&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">IndexToString&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">inputCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;prediction&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">outputCol&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;hashes&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">labels&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;Cryptocurrency&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;Bitcoin&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;Dogecoin&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;Etherenum&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;binance&amp;#39;&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">test_prediction&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">backtoshash&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">transform&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">test_prediction&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">test_prediction&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">select&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;id&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;probability&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">show&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="kc">False&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">test_prediction&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">select&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;id&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;hashes&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">show&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="kc">False&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>+----+-----------------------------------------------------------------------------------------------------+
|id |probability |
+----+-----------------------------------------------------------------------------------------------------+
|29 |[0.11408553240964694,0.7291978857094307,0.06664959177602166,0.055024281883508205,0.03504270822139252]|
|474 |[0.1652564305753442,0.6331858397395639,0.08087177629844432,0.037126306578410845,0.08355964680823667] |
|964 |[0.3305776828116616,0.36432798930213495,0.14284028703120938,0.06453066609337645,0.0977233747616177] |
|1677|[0.31612928358927095,0.4227445340598757,0.11989082364177438,0.05751233645147351,0.08372302225760536] |
|1950|[0.08574619059640502,0.16340390330095428,0.03166511566726569,0.7003261461305691,0.01885864430480592] |
|2040|[0.5943068155443776,0.2726191016966887,0.07905983918076226,0.03180926486801036,0.022204978710161125] |
|2214|[0.6667520715299391,0.1644187178079488,0.10068119920639075,0.033609698746054184,0.03453831270966719] |
|2453|[0.4000689206511194,0.02974922008303281,0.5285874094943018,0.028858445070719826,0.012736004700826165]|
|2509|[0.1381026785114682,0.739044371954166,0.06888786575378644,0.027324394280704465,0.026640689499874904] |
|2529|[0.19059071422634632,0.15944406647246737,0.554794414228412,0.05290261631835627,0.042268188754417985] |
+----+-----------------------------------------------------------------------------------------------------+
only showing top 10 rows
+----+--------------+
|id |hashes |
+----+--------------+
|29 |Bitcoin |
|474 |Bitcoin |
|964 |Bitcoin |
|1677|Bitcoin |
|1950|Etherenum |
|2040|Cryptocurrency|
|2214|Cryptocurrency|
|2453|Dogecoin |
|2509|Bitcoin |
|2529|Dogecoin |
+----+--------------+
only showing top 10 rows
&lt;/code>&lt;/pre>
&lt;p>Neural Network&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">test_df&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">tweet_test&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">drop&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;id&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">toPandas&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">response&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">test_df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">pop&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;hashtags&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">test_df&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">test_df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">drop&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">columns&lt;/span>&lt;span class="o">=&lt;/span> &lt;span class="s1">&amp;#39;user_location&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">model&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">load_weights&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">checkpoint_filepath&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">prediction&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">model&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">predict&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">test_df&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">prediction&lt;/span>&lt;span class="p">[:&lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>array([[2.35893115e-01, 3.70337307e-01, 8.22583735e-02, 3.05431724e-01,
6.07949682e-03],
[1.47559753e-04, 9.99817193e-01, 1.51092713e-12, 3.53277137e-05,
2.90755020e-22],
[3.74551356e-01, 2.91717589e-01, 1.80127278e-01, 1.19809344e-01,
3.37944217e-02],
[8.34352300e-02, 4.13003623e-01, 1.82284534e-01, 2.89828300e-01,
3.14484239e-02],
[3.59519780e-01, 4.05398160e-01, 1.29304364e-01, 1.38947032e-02,
9.18831453e-02],
[0.00000000e+00, 9.99999642e-01, 0.00000000e+00, 4.06629681e-07,
0.00000000e+00],
[9.14618373e-02, 5.94880283e-01, 1.25181660e-01, 1.41764238e-01,
4.67120372e-02],
[3.66317110e-09, 9.95616674e-01, 1.36035316e-08, 2.10694573e-03,
2.27643130e-03],
[5.13970926e-02, 3.42191756e-01, 3.40411484e-01, 9.60622579e-02,
1.69937387e-01],
[1.38813183e-01, 2.94758379e-01, 1.67302951e-01, 3.93695772e-01,
5.42974332e-03]], dtype=float32)
&lt;/code>&lt;/pre>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">numpy&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="nn">np&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">label&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;Bitcoin&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;Cryptocurrency&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;Dogecoin&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;Etherenum&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="s1">&amp;#39;binance&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">hashtags&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">prob&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">prediction&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">index_max&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">argmax&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">prob&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">hashtags&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">append&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">label&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">index_max&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">hashtags&lt;/span>&lt;span class="p">[:&lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre>&lt;code>['Cryptocurrency',
'Cryptocurrency',
'Bitcoin',
'Cryptocurrency',
'Cryptocurrency',
'Cryptocurrency',
'Cryptocurrency',
'Cryptocurrency',
'Cryptocurrency',
'Etherenum']
&lt;/code>&lt;/pre>
&lt;h4 id="conclusion-and-possible-improvement">Conclusion and Possible Improvement&lt;/h4>
&lt;p>This project demonstrates data analysis of tweets and related
information under Pyspark environments. And performing a simple text
classification using three popular supervised learning models: logistic
regression, Random Forest, and ANN. The Random forest model achieves 84%
accuracy in the hashtag classification of the validation dataset.
Therefore, the prediction can be an important metrics for an automatic
tagging system for new tweets.&lt;/p>
&lt;p>There are alterative choices of machine learning models for text
classification in this project. For exmaple, the vectorized version of
the neural network, support vector machine, KNN, and others. Some of
models might have a better performance than the above models. These can
be a possible improvement of this project in future developments.&lt;/p>
&lt;p>Also, &lt;a href="https://developer.twitter.com/en/docs/twitter-api" target="_blank" rel="noopener">Twitter API&lt;/a>
provides more possibilities for data mining, such as but not limited to
streaming, recent search, and particular user search. Therefore, the
Kaggle dataset in this project can be replace with other sources to
improve the performance.&lt;/p></description></item><item><title>Tranditional Statistical Learning: Classification in Self-Assessed Financial Health Status</title><link>https://joeliang0520.github.io/project/cosmetics/</link><pubDate>Sat, 01 Jan 2022 00:00:00 +0000</pubDate><guid>https://joeliang0520.github.io/project/cosmetics/</guid><description>&lt;h2 id="disclaimer">Disclaimer&lt;/h2>
&lt;p>This project is an improvement of the final project of the upper-year Statistic course &amp;ldquo;STAT441: Statistical Learning - Classification&amp;rdquo; at the University of Waterloo by Bolun Cui and Joe Liang.&lt;/p>
&lt;h2 id="video">Video&lt;/h2>
&lt;p>A video explaination about this project can be found &lt;a href="https://youtu.be/LncR2eTuaW8" target="_blank" rel="noopener">here&lt;/a>. (Note: The video was made for the university course project, some parts in the video might not be matched with the file)&lt;/p>
&lt;h2 id="background">Background&lt;/h2>
&lt;p>The dataset in this project corresponds to the responses in the &lt;a href="https://www.gesis.org/en/allbus/allbus-home/general-information" target="_blank" rel="noopener">German General Social Survey (ALLBUS)&lt;/a> between 2005 and 2019. The target variable for machine learning is the last variable &amp;ldquo;health&amp;rdquo;. It is an ordinal variable with five categories from 1 to 5 and represents the &amp;ldquo;self-asset financial health&amp;rdquo; of each survey response.&lt;/p>
&lt;p>There are two parts of the dataset, ”train.csv&amp;quot; and &amp;ldquo;test.csv&amp;rdquo;. the samples in ”train.csv&amp;quot; include &amp;ldquo;health&amp;rdquo; variables, which are used for model training. And &amp;ldquo;test.csv&amp;rdquo; does not have the &amp;ldquo;health&amp;rdquo; variable. The goal of this project is to train a classification model using the &amp;ldquo;train.csv&amp;rdquo; to classify survey responses in &amp;ldquo;test.csv&amp;rdquo; into one of the financial health categories.&lt;/p>
&lt;h2 id="highlights">Highlights&lt;/h2>
&lt;p>Compelete documentation can be found in the &amp;ldquo;Supervised Learning code.rmd&amp;rdquo; file&lt;/p>
&lt;h3 id="exploratory-data-analysis">Exploratory Data Analysis&lt;/h3>
&lt;ul>
&lt;li>Outlier anaylsis&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="https://user-images.githubusercontent.com/50597009/167510234-a40d0857-18ba-49ab-9211-a77439907923.png" alt="Screen Shot 2022-05-09 at 6 39 19 PM" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;ul>
&lt;li>Target variables distribution anaylsis and normalization&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="https://user-images.githubusercontent.com/50597009/167510495-4e9d4185-cff5-4c59-8182-0ef97cfe02a8.png" alt="Screen Shot 2022-05-09 at 6 43 30 PM" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h3 id="feature-engineering">Feature Engineering&lt;/h3>
&lt;p>During analysis, based on our domain knowledge, we derived a new x-variable: the average living space in m2 per person in the household.&lt;/p>
&lt;h3 id="random-foresting">Random Foresting&lt;/h3>
&lt;ul>
&lt;li>Out of Bag (OOB) samples tuning for number of variables to choice and number of trees&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="https://user-images.githubusercontent.com/50597009/167512133-90c355a2-6c07-428c-a0f7-16f52dadebda.png" alt="Screen Shot 2022-05-09 at 6 59 31 PM" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;ul>
&lt;li>Importance of variables from OOB (randomly mix each variables to test the decrease in accuracy)&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="https://user-images.githubusercontent.com/50597009/167512382-42b851ae-18c9-4aed-aecd-d52827a579c1.png" alt="image" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;ul>
&lt;li>Tracing the perfomance of different number of trees&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="https://user-images.githubusercontent.com/50597009/167513470-1775942b-3a12-40d8-a8e1-b487fae8f6c3.gif" alt="ezgif-5-6cdd33b368" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h3 id="neural-network">Neural Network&lt;/h3>
&lt;ul>
&lt;li>Pipline implentation of two hidden layers neural network&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="https://user-images.githubusercontent.com/50597009/167514019-340214b8-ed37-49ac-b301-653565b66ac0.gif" alt="pip" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;ul>
&lt;li>Tuning Epochs (number of iteration) to balance bias and vairance tradeoff&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="https://user-images.githubusercontent.com/50597009/167511333-45ce2f77-87f2-42df-a84e-372223eaca53.gif" alt="neural" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;ul>
&lt;li>Number Nodes and Layer tuning with validation cross entropy&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="https://user-images.githubusercontent.com/50597009/167513662-b1bc3c49-a2f1-4192-96d8-fbb189e31487.png" alt="image" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="https://user-images.githubusercontent.com/50597009/167514574-f7be8d49-a08a-4911-8820-7f718171eabb.png" alt="image" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="performance-of-the-model">Performance of the Model&lt;/h2>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="https://user-images.githubusercontent.com/50597009/167514689-ae11686b-a180-4d94-84a5-35e78f941d42.png" alt="image" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="enviorment">Enviorment&lt;/h2>
&lt;p>This project uses R with R markdown for better visualization. Please visit the official websites for documentation and installation of &lt;a href="https://www.r-project.org/" target="_blank" rel="noopener">R&lt;/a>, and &lt;a href="https://rmarkdown.rstudio.com/" target="_blank" rel="noopener">R Markdown&lt;/a>. &lt;a href="https://www.rstudio.com/" target="_blank" rel="noopener">R studio&lt;/a> is recommended to open the .rmd file.&lt;/p>
&lt;p>The required packages to excuate the code in .rmd file are listed below and can be installed in CRAN using&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">install.package&lt;span class="o">(&lt;/span>&lt;span class="s2">&amp;#34;package_name&amp;#34;&lt;/span>&lt;span class="o">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>in R or R studio.&lt;/p>
&lt;ul>
&lt;li>randomForest: a comprehensive package for Random Forest Model training&lt;/li>
&lt;li>caret: a machine learning platform with many integrated features, such as cross-validation&lt;/li>
&lt;li>fastDummies: a package allows you to convert categorical variables into indicator (Dummy) variables&lt;/li>
&lt;li>Keras: a comprehensive package under Tensorflow for Neural Network (Tensorflow installation is required)&lt;/li>
&lt;li>gbm: the Generalized Boosting Model is supported&lt;/li>
&lt;li>nnet: the Multinomial logistic regression model is supported&lt;/li>
&lt;/ul></description></item></channel></rss>