<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Chengyin Eng: Writing]]></title><description><![CDATA[On AI]]></description><link>https://www.chengyineng.com/s/writing</link><image><url>https://substackcdn.com/image/fetch/$s_!mBDd!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06a6af98-6e11-4a35-ac44-46a57b8cdad6_1280x1280.png</url><title>Chengyin Eng: Writing</title><link>https://www.chengyineng.com/s/writing</link></image><generator>Substack</generator><lastBuildDate>Sun, 05 Apr 2026 13:47:12 GMT</lastBuildDate><atom:link href="https://www.chengyineng.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Chengyin Eng]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[chengyineng@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[chengyineng@substack.com]]></itunes:email><itunes:name><![CDATA[Chengyin Eng]]></itunes:name></itunes:owner><itunes:author><![CDATA[Chengyin Eng]]></itunes:author><googleplay:owner><![CDATA[chengyineng@substack.com]]></googleplay:owner><googleplay:email><![CDATA[chengyineng@substack.com]]></googleplay:email><googleplay:author><![CDATA[Chengyin Eng]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Saving Mothers with ML: How CareSource uses MLOps to Improve Healthcare in High-Risk Obstetrics]]></title><description><![CDATA[Published on https://www.databricks.com/blog/2023/04/03/saving-mothers-ml-how-mlops-improves-healthcare-high-risk-obstetrics.html]]></description><link>https://www.chengyineng.com/p/saving-mothers-with-ml-how-caresource-2ad</link><guid isPermaLink="false">https://www.chengyineng.com/p/saving-mothers-with-ml-how-caresource-2ad</guid><dc:creator><![CDATA[Chengyin Eng]]></dc:creator><pubDate>Wed, 25 Sep 2024 23:03:26 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/4423392f-bfbf-4031-a746-4973fee7fe79_1200x661.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Published on <a href="https://www.databricks.com/blog/2023/04/03/saving-mothers-ml-how-mlops-improves-healthcare-high-risk-obstetrics.html">https://www.databricks.com/blog/2023/04/03/saving-mothers-ml-how-mlops-improves-healthcare-high-risk-obstetrics.html</a></p>]]></content:encoded></item><item><title><![CDATA[Just Enough Theoretical Underpinnings for NLP]]></title><description><![CDATA[Note: This article was originally posted on Open Data Science Conference here.]]></description><link>https://www.chengyineng.com/p/just-enough-theoretical-underpinnings</link><guid isPermaLink="false">https://www.chengyineng.com/p/just-enough-theoretical-underpinnings</guid><dc:creator><![CDATA[Chengyin Eng]]></dc:creator><pubDate>Wed, 25 Sep 2024 22:52:13 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!v7zE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda976a49-446e-4682-91ef-d9c1783b885b_2500x931.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Note: This article was originally posted on Open Data Science Conference </em><a href="https://odsc.com/blog/just-enough-theoretical-underpinnings-for-nlp/">here</a>.</p><p>&#8220;How do you say good morning in Spanish?&#8221; &#8211; This is an early research example in the field of Natural Language Processing (NLP) dated back to the 1950s. Using statistical techniques to analyze the human language then became popular in the 1990s; fast forward to two decades later in the 2010s, research efforts have evolved to focus on using deep neural networks for NLP. We hear about more and more new models with millions and billions of parameters, e.g. BERT, XLNet, ALBERT, etc., but are they really entirely different from one another? How about methods like tf-idf, word2vec? Are they still relevant and useful? How do we keep up with the constant evolution and innovation in the NLP field? How do we start understanding the model details behind the scene, rather than just using these models?&nbsp;</p><p>Start with the common and basic theoretical foundation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v7zE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda976a49-446e-4682-91ef-d9c1783b885b_2500x931.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v7zE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda976a49-446e-4682-91ef-d9c1783b885b_2500x931.png 424w, https://substackcdn.com/image/fetch/$s_!v7zE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda976a49-446e-4682-91ef-d9c1783b885b_2500x931.png 848w, https://substackcdn.com/image/fetch/$s_!v7zE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda976a49-446e-4682-91ef-d9c1783b885b_2500x931.png 1272w, https://substackcdn.com/image/fetch/$s_!v7zE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda976a49-446e-4682-91ef-d9c1783b885b_2500x931.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v7zE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda976a49-446e-4682-91ef-d9c1783b885b_2500x931.png" width="1456" height="542" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/da976a49-446e-4682-91ef-d9c1783b885b_2500x931.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:542,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!v7zE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda976a49-446e-4682-91ef-d9c1783b885b_2500x931.png 424w, https://substackcdn.com/image/fetch/$s_!v7zE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda976a49-446e-4682-91ef-d9c1783b885b_2500x931.png 848w, https://substackcdn.com/image/fetch/$s_!v7zE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda976a49-446e-4682-91ef-d9c1783b885b_2500x931.png 1272w, https://substackcdn.com/image/fetch/$s_!v7zE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda976a49-446e-4682-91ef-d9c1783b885b_2500x931.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Deep learning or not, probabilistic language modeling is a common thread across all NLP tasks.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-Dh6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff383830d-1813-4078-89f3-35b388d429f6_1568x1376.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-Dh6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff383830d-1813-4078-89f3-35b388d429f6_1568x1376.png 424w, https://substackcdn.com/image/fetch/$s_!-Dh6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff383830d-1813-4078-89f3-35b388d429f6_1568x1376.png 848w, https://substackcdn.com/image/fetch/$s_!-Dh6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff383830d-1813-4078-89f3-35b388d429f6_1568x1376.png 1272w, https://substackcdn.com/image/fetch/$s_!-Dh6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff383830d-1813-4078-89f3-35b388d429f6_1568x1376.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-Dh6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff383830d-1813-4078-89f3-35b388d429f6_1568x1376.png" width="1456" height="1278" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f383830d-1813-4078-89f3-35b388d429f6_1568x1376.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1278,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!-Dh6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff383830d-1813-4078-89f3-35b388d429f6_1568x1376.png 424w, https://substackcdn.com/image/fetch/$s_!-Dh6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff383830d-1813-4078-89f3-35b388d429f6_1568x1376.png 848w, https://substackcdn.com/image/fetch/$s_!-Dh6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff383830d-1813-4078-89f3-35b388d429f6_1568x1376.png 1272w, https://substackcdn.com/image/fetch/$s_!-Dh6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff383830d-1813-4078-89f3-35b388d429f6_1568x1376.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Count-based Approaches</h3><p>In the non-deep learning realm, probabilistic language models (LMs) leverage count-based approaches, which represent words in terms of their frequencies. As an example, let&#8217;s now try to set up n-grams and bag-of-words (BOW) in the framework of a Naive Bayes classification model.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1Bs5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97812279-dfae-4d47-829b-83f3193a7222_1752x936.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1Bs5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97812279-dfae-4d47-829b-83f3193a7222_1752x936.png 424w, https://substackcdn.com/image/fetch/$s_!1Bs5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97812279-dfae-4d47-829b-83f3193a7222_1752x936.png 848w, https://substackcdn.com/image/fetch/$s_!1Bs5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97812279-dfae-4d47-829b-83f3193a7222_1752x936.png 1272w, https://substackcdn.com/image/fetch/$s_!1Bs5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97812279-dfae-4d47-829b-83f3193a7222_1752x936.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1Bs5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97812279-dfae-4d47-829b-83f3193a7222_1752x936.png" width="1456" height="778" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/97812279-dfae-4d47-829b-83f3193a7222_1752x936.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:778,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!1Bs5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97812279-dfae-4d47-829b-83f3193a7222_1752x936.png 424w, https://substackcdn.com/image/fetch/$s_!1Bs5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97812279-dfae-4d47-829b-83f3193a7222_1752x936.png 848w, https://substackcdn.com/image/fetch/$s_!1Bs5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97812279-dfae-4d47-829b-83f3193a7222_1752x936.png 1272w, https://substackcdn.com/image/fetch/$s_!1Bs5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97812279-dfae-4d47-829b-83f3193a7222_1752x936.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Naive Bayes treats the entire input text as a bag of words and models the probability of text based on unigrams.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sYzp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18263b7c-f28b-445a-84d1-da838d97f015_1143x901.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sYzp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18263b7c-f28b-445a-84d1-da838d97f015_1143x901.png 424w, https://substackcdn.com/image/fetch/$s_!sYzp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18263b7c-f28b-445a-84d1-da838d97f015_1143x901.png 848w, https://substackcdn.com/image/fetch/$s_!sYzp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18263b7c-f28b-445a-84d1-da838d97f015_1143x901.png 1272w, https://substackcdn.com/image/fetch/$s_!sYzp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18263b7c-f28b-445a-84d1-da838d97f015_1143x901.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sYzp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18263b7c-f28b-445a-84d1-da838d97f015_1143x901.png" width="1143" height="901" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/18263b7c-f28b-445a-84d1-da838d97f015_1143x901.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:901,&quot;width&quot;:1143,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!sYzp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18263b7c-f28b-445a-84d1-da838d97f015_1143x901.png 424w, https://substackcdn.com/image/fetch/$s_!sYzp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18263b7c-f28b-445a-84d1-da838d97f015_1143x901.png 848w, https://substackcdn.com/image/fetch/$s_!sYzp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18263b7c-f28b-445a-84d1-da838d97f015_1143x901.png 1272w, https://substackcdn.com/image/fetch/$s_!sYzp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18263b7c-f28b-445a-84d1-da838d97f015_1143x901.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The diagram above shows that Naive Bayes infers the probability distribution of unlabeled data based on the probability distribution of labeled data, all by counting!</p><p>On the other hand, positive pointwise mutual information (PPMI) frames the probabilistic LM problem by asking &#8220;do words w1 and w2 co-occur more than if they were independent?&#8221; While Term Frequency-Inverse Document Frequency (TF-IDF) calculates a term&#8217;s importance based on the term frequency relative to other terms in other documents, there have also been attempts to estimate the probability of any given document <em>d </em>containing a term <em>t.&nbsp;</em></p><h3>Prediction-based Approaches&nbsp;</h3><p>Count-based methods result in high sparsity in word representation vectors and typically do not generalize well &#8211; what if there are new words not present in your training vocabulary? We encounter very different vocabulary when reading The Lord of the Rings vs. The New York Times vs. Twitter. The idea that we could use the text at hand as implicitly supervised training data was a paradigm shift in framing NLP tasks. NLP approaches started to pivot away from count-based to prediction-based.&nbsp;</p><p>Word2Vec is a single-layered neural network that hinges upon training binary prediction tasks based on a window of nearby words; for instance, is the word &#8220;hobbit&#8221; likely to show up near &#8220;ring&#8221;? Then, we use the model weights as the word embeddings, which are vector representations of words. Word2Vec is quick to train but does not take global context into account.&nbsp;</p><p>On the other hand, since words that occur together may encode some form of meaning, Global Vectors for Word Representation (GloVe) generates embeddings by aggregating a global matrix of word co-occurrence. For example, one would expect to see the words &#8220;ice&#8221; and &#8220;solid&#8221; more frequently together than &#8220;ice&#8221; and &#8220;gas&#8221;.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JdxP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ed8998-efdc-44ad-af45-0e1e27a1025c_700x148.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JdxP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ed8998-efdc-44ad-af45-0e1e27a1025c_700x148.png 424w, https://substackcdn.com/image/fetch/$s_!JdxP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ed8998-efdc-44ad-af45-0e1e27a1025c_700x148.png 848w, https://substackcdn.com/image/fetch/$s_!JdxP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ed8998-efdc-44ad-af45-0e1e27a1025c_700x148.png 1272w, https://substackcdn.com/image/fetch/$s_!JdxP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ed8998-efdc-44ad-af45-0e1e27a1025c_700x148.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JdxP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ed8998-efdc-44ad-af45-0e1e27a1025c_700x148.png" width="700" height="148" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a6ed8998-efdc-44ad-af45-0e1e27a1025c_700x148.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:148,&quot;width&quot;:700,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!JdxP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ed8998-efdc-44ad-af45-0e1e27a1025c_700x148.png 424w, https://substackcdn.com/image/fetch/$s_!JdxP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ed8998-efdc-44ad-af45-0e1e27a1025c_700x148.png 848w, https://substackcdn.com/image/fetch/$s_!JdxP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ed8998-efdc-44ad-af45-0e1e27a1025c_700x148.png 1272w, https://substackcdn.com/image/fetch/$s_!JdxP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ed8998-efdc-44ad-af45-0e1e27a1025c_700x148.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"><em>Image sourced from <a href="https://nlp.stanford.edu/projects/glove/">nlp.stanford.edu</a></em></figcaption></figure></div><p>One fundamental limitation that both Word2Vec and GloVe share is that they do not know how to deal with out-of-vocabulary (OOV) words. This is where fastText comes in. <a href="https://arxiv.org/pdf/1607.04606v2.pdf">fastText</a> looks at substrings of words; therefore, each word is a sum of its N-grams and each N-gram ranges from 3 to 6 characters.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bkq4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a07271b-0691-4527-b665-14b6c157a636_624x826.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bkq4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a07271b-0691-4527-b665-14b6c157a636_624x826.png 424w, https://substackcdn.com/image/fetch/$s_!bkq4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a07271b-0691-4527-b665-14b6c157a636_624x826.png 848w, https://substackcdn.com/image/fetch/$s_!bkq4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a07271b-0691-4527-b665-14b6c157a636_624x826.png 1272w, https://substackcdn.com/image/fetch/$s_!bkq4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a07271b-0691-4527-b665-14b6c157a636_624x826.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bkq4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a07271b-0691-4527-b665-14b6c157a636_624x826.png" width="624" height="826" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0a07271b-0691-4527-b665-14b6c157a636_624x826.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:826,&quot;width&quot;:624,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!bkq4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a07271b-0691-4527-b665-14b6c157a636_624x826.png 424w, https://substackcdn.com/image/fetch/$s_!bkq4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a07271b-0691-4527-b665-14b6c157a636_624x826.png 848w, https://substackcdn.com/image/fetch/$s_!bkq4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a07271b-0691-4527-b665-14b6c157a636_624x826.png 1272w, https://substackcdn.com/image/fetch/$s_!bkq4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a07271b-0691-4527-b665-14b6c157a636_624x826.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image sourced from the <a href="https://arxiv.org/pdf/1607.04606v2.pdf">fastText</a> paper</em></figcaption></figure></div><h3>Contextual Approaches (Deep Learning)</h3><p>None of the methods discussed above is able to capture contextual information well.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CslN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0a5f21c-2902-41a1-bb1b-704914ed820c_1536x928.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CslN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0a5f21c-2902-41a1-bb1b-704914ed820c_1536x928.png 424w, https://substackcdn.com/image/fetch/$s_!CslN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0a5f21c-2902-41a1-bb1b-704914ed820c_1536x928.png 848w, https://substackcdn.com/image/fetch/$s_!CslN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0a5f21c-2902-41a1-bb1b-704914ed820c_1536x928.png 1272w, https://substackcdn.com/image/fetch/$s_!CslN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0a5f21c-2902-41a1-bb1b-704914ed820c_1536x928.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CslN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0a5f21c-2902-41a1-bb1b-704914ed820c_1536x928.png" width="1456" height="880" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c0a5f21c-2902-41a1-bb1b-704914ed820c_1536x928.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:880,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!CslN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0a5f21c-2902-41a1-bb1b-704914ed820c_1536x928.png 424w, https://substackcdn.com/image/fetch/$s_!CslN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0a5f21c-2902-41a1-bb1b-704914ed820c_1536x928.png 848w, https://substackcdn.com/image/fetch/$s_!CslN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0a5f21c-2902-41a1-bb1b-704914ed820c_1536x928.png 1272w, https://substackcdn.com/image/fetch/$s_!CslN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0a5f21c-2902-41a1-bb1b-704914ed820c_1536x928.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Hence, we turn to neural network architecture to process long sequences. Here, probabilistic LMs manifest in that a neural network predicts the probability distribution of every word, given words that the model has seen so far.&nbsp; First, let&#8217;s discuss recurrent neural networks (RNNs) briefly. RNNs are feedback loops that cycle over the sequences. The most popular RNN variant in NLP is Long Short-Term Memory (LSTM). While vanilla RNNs&nbsp; have bottleneck difficulties accessing information from many steps back, LSTMs have a separate memory cell with built-in forget gates to <em>forget </em>unimportant information.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zgJp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e2e2437-d6ea-4aa0-beb1-a141fd9da13c_702x356.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zgJp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e2e2437-d6ea-4aa0-beb1-a141fd9da13c_702x356.png 424w, https://substackcdn.com/image/fetch/$s_!zgJp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e2e2437-d6ea-4aa0-beb1-a141fd9da13c_702x356.png 848w, https://substackcdn.com/image/fetch/$s_!zgJp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e2e2437-d6ea-4aa0-beb1-a141fd9da13c_702x356.png 1272w, https://substackcdn.com/image/fetch/$s_!zgJp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e2e2437-d6ea-4aa0-beb1-a141fd9da13c_702x356.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zgJp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e2e2437-d6ea-4aa0-beb1-a141fd9da13c_702x356.png" width="702" height="356" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3e2e2437-d6ea-4aa0-beb1-a141fd9da13c_702x356.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:356,&quot;width&quot;:702,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!zgJp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e2e2437-d6ea-4aa0-beb1-a141fd9da13c_702x356.png 424w, https://substackcdn.com/image/fetch/$s_!zgJp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e2e2437-d6ea-4aa0-beb1-a141fd9da13c_702x356.png 848w, https://substackcdn.com/image/fetch/$s_!zgJp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e2e2437-d6ea-4aa0-beb1-a141fd9da13c_702x356.png 1272w, https://substackcdn.com/image/fetch/$s_!zgJp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e2e2437-d6ea-4aa0-beb1-a141fd9da13c_702x356.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image sourced from </em><a href="https://arxiv.org/pdf/1706.08924.pdf">Raseem et al. 2017</a></figcaption></figure></div><p>A notable model architecture that uses LSTMs is Embeddings from Language Models (ELMo). ELMo is bidirectional and concatenates the input text from left to right, and right to left, mirroring a human reader&#8217;s experience. However, the concatenation means that the training process could not take advantage of both left and right contexts simultaneously nor resolve the bottleneck problem completely. This leads to our most popular architecture: Transformers. Transformers leverage attention mechanisms; as the name implies, attention <em>attends </em>to its inputs, which means it pays most attention to the most important words. Furthermore, transformers&nbsp; allow simultaneous processing of text,, by additionally encoding the word positions. Therefore, transformers have a faster model training process and thanks to positional word encodings, we do not have to worry about word orders being jumbled up.</p><p>BERT (Bidirectional Encoder Representation from Transformers), published in 2019, is actually a variant of the <a href="https://arxiv.org/abs/1706.03762">original transformer architecture</a> implemented in 2017. Unlike the original encoder-decoder transformer architecture meant for sequence generation tasks, i.e. machine translation, BERT does not have a decoder component. An architectural difference as such indicates&nbsp; that BERT is not trained for sequence generation tasks.&nbsp;</p><p>Today, we hear about a variety of other transformer architectures, namely RoBERTa, XLNet, ALBERT, and so on. These models are built upon BERT and share a lot of commonalities in the architecture, despite improving upon the training processes. In order to keep up with the evolution of the NLP landscape, it is necessary to understand only the foundation of <a href="https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/">attention</a> and the <a href="https://jalammar.github.io/illustrated-bert/">BERT architecture</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ekQK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf29cadb-fc69-4959-9e76-7cfb80e36044_1280x720.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ekQK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf29cadb-fc69-4959-9e76-7cfb80e36044_1280x720.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ekQK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf29cadb-fc69-4959-9e76-7cfb80e36044_1280x720.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ekQK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf29cadb-fc69-4959-9e76-7cfb80e36044_1280x720.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ekQK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf29cadb-fc69-4959-9e76-7cfb80e36044_1280x720.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ekQK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf29cadb-fc69-4959-9e76-7cfb80e36044_1280x720.jpeg" width="1280" height="720" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/df29cadb-fc69-4959-9e76-7cfb80e36044_1280x720.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!ekQK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf29cadb-fc69-4959-9e76-7cfb80e36044_1280x720.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ekQK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf29cadb-fc69-4959-9e76-7cfb80e36044_1280x720.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ekQK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf29cadb-fc69-4959-9e76-7cfb80e36044_1280x720.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ekQK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf29cadb-fc69-4959-9e76-7cfb80e36044_1280x720.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Summary</h3><p>In this article, you learned that probabilistic language modeling is at the core of NLP tasks. You also learned the motivation of moving from count-based to prediction-based to deep contextual models, and the high-level differences and similarities among these models. You now have just enough theoretical background to get started in NLP!&nbsp;</p>]]></content:encoded></item><item><title><![CDATA[Is the power of random forests due to the number of trees?]]></title><description><![CDATA[Written in March 2022.]]></description><link>https://www.chengyineng.com/p/is-the-power-of-random-forests-due</link><guid isPermaLink="false">https://www.chengyineng.com/p/is-the-power-of-random-forests-due</guid><dc:creator><![CDATA[Chengyin Eng]]></dc:creator><pubDate>Wed, 25 Sep 2024 22:48:02 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!mBDd!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06a6af98-6e11-4a35-ac44-46a57b8cdad6_1280x1280.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Written in March 2022. </p><p></p><p>To pay homage to my ecology background, it is only right that I begin discussing machine learning methods with random forest.</p><p>Random forest is one of my favorite algorithms - it&#8217;s effective and elegant. We know that a forest is made up of many trees in real life. Same goes to random forest - it is an ensemble of decision trees trained independently of each other to ultimately cast a majority vote. To answer the question that the title of this post begs, we must start by investigating:</p><h2>What&#8217;s random about random forests?</h2><p>There are two elements of randomness in a random forest:<br>1. random sampling of training data points when building trees<br>2. random subsets of features considered</p><p>During training, samples are drawn with replacement (bootstrapping). Some samples are used multiple times in a tree. Only approximately 2/3 of data instances are used during the training process. In other words, about 1/3 of instances in your training data are left out. Each tree learns from a random, bootstrapped sample of training data. Therefore, each tree learns different parts of training data. Random subsets of features are considered when splitting nodes. At test time, random forest makes predictions by taking average predictions of each tree (aggregating). This is why you often hear the term &#8220;bagging&#8221; &#8212; it is simply a combination of two processes: <strong>b</strong>ootstrap and <strong>agg</strong>regat<strong>ing</strong>.</p><h2>Why is randomness powerful then?</h2><p>The random selection of features considered can minimize the correlation effect between inputs while maintaining the strength of the tree. According to Leo Breiman (2001), the author of random forest paper, random selection features help the resulting forest to be relatively robust to outliers and noise. Moreover, the randomness nature of feature selection allows this algorithm to be parallelized to reduce training time.</p><p>Random sampling of data points allows us to generate estimates of ongoing generalization errors, strength, and correlation (since every tree&#8217;s knowledge within the ensemble is different, we can truly test if the forest can generalize to unseen data).</p><p>It is perhaps obvious at this point, Breiman also pointed out that the accuracy of random forests depends on 2 things: the strength of individual trees and a measure of dependence between them.</p><p>It&#8217;s worth noting that while random forest can help reduce the problem of multicollinearity of features, the variable importance of the correlated features will be reduced. It is still important to use your domain knowledge, if possible, to inform you which feature to use/remove. If you lack domain expertise, it is worth running different random forest experiments to test different combination of features and see for yourself the differences in model performances and variable importance values.</p><h5>Increasing number of trees (alone) does not result in overfitting</h5><p>This is perhaps a shocking and confusing statement to you. Many of you know that as model complexity increases, the tendency of overfitting also increases. However, in the context of random forests, the number of trees built is not a problem because each tree is built <strong>independently</strong> of another, meaning each tree is always starting from scratch. If you need math to convince you, Leo Breiman (2001) proved this using the Strong Law of Large Numbers: error rates converge to a certain value, Simply put, while keeping all other hyperparameters constant, beyond a sufficiently large value of number of trees, error rates plateau out and do not increase anymore. Following up on Breiman&#8217;s paper, Probst and Boulesteix (2018) also empirically concluded that we should not tune the number of trees and more trees do not degrade performance. In other words, theoretically, you should pick the number of trees based on the consideration of computational cost.</p><p>However, Probst and Boulesteix (2018) also found that empirically 100 trees are sufficient to lead to the largest performance gain. Oftentimes, we see that the testing performance might decrease when we increase the number of trees, even if all else stays constant. When you see this, you should note this as a sign of the distributional differences between the training data and the testing data. If both training and testing data share the same distribution, perhaps your trees are too &#8220;fully grown&#8221;, which means your trees are too deep &#8212; then, it&#8217;s time to reduce the depth of trees. Generally, you need more trees when you have:</p><ol><li><p>lower sample size for each tree while training</p></li><li><p>bigger constraints on tree depth</p></li><li><p>more variables (since they lead to less correlated trees)</p></li></ol><p>to reach convergence of error rate.</p><h2>Conclusion</h2><p>The name of random forests gives away the reason why they are so powerful. It&#8217;s because of the embedded randomness within the algorithm. Don&#8217;t blame on the number of trees alone when your model overfits. Check for the tree depth, sample size for each tree, and the number of variables.</p><h2>References</h2><ol><li><p>Breiman, Leo (2001), Random Forests, <a href="https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf">https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf</a></p></li><li><p>Probst, Phillip and Boulesteix, Anne-Laure (2018), To Tune or Not to Tune the</p><p>Number of Trees in Random Forest, <a href="http://www.jmlr.org/papers/volume18/17-269/17-269.pdf">http://www.jmlr.org/papers/volume18/17-269/17-269.pdf</a></p></li></ol>]]></content:encoded></item><item><title><![CDATA[How to begin NLP with CORD-19 Data?]]></title><description><![CDATA[Written in March 2022.]]></description><link>https://www.chengyineng.com/p/how-to-begin-nlp-with-cord-19-data</link><guid isPermaLink="false">https://www.chengyineng.com/p/how-to-begin-nlp-with-cord-19-data</guid><dc:creator><![CDATA[Chengyin Eng]]></dc:creator><pubDate>Wed, 25 Sep 2024 22:45:48 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ZRTm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F652f9fbe-cb14-42a4-8d69-4cc5cd722338_2409x651.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Written in March 2022. </p><p></p><p>It&#8217;s hard to lead &#8220;normal&#8221; lives to confront the &#8220;new normal&#8221; as the COVID-19 pandemic sweeps across the nation and the rest of the world. Aside from social distancing, many of us do want to use our skills to help, as we spend more time at home.</p><p>This blog is meant to provide some starter or ideas around how to begin analyzing the COVID-19 Open Research Dataset (CORD-19) that compiles scholarly articles on coronaviruses. I first learned about this dataset because multiple tech giants and other organizations have partnered together (a rare effort) to create this dataset. They launched a Kaggle challenge which you can read more about <a href="https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge#">here</a> .</p><h2>About the data</h2><p>As mentioned, this data is text-based and is provided in <code>json</code> format. There are three data files altogether: peer-reviewed commercial use subset (9000 papers), peer-reviewed non-commercial use subset (1973 papers), and another non-peer-reviewed dataset (803 papers). In this post, I will be using the Databricks environment to analyze the dataset for commercial use, since it has more files than the other two.</p><pre><code><code>comm_use_subset.printSchema()</code></code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZRTm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F652f9fbe-cb14-42a4-8d69-4cc5cd722338_2409x651.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZRTm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F652f9fbe-cb14-42a4-8d69-4cc5cd722338_2409x651.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ZRTm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F652f9fbe-cb14-42a4-8d69-4cc5cd722338_2409x651.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ZRTm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F652f9fbe-cb14-42a4-8d69-4cc5cd722338_2409x651.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ZRTm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F652f9fbe-cb14-42a4-8d69-4cc5cd722338_2409x651.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZRTm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F652f9fbe-cb14-42a4-8d69-4cc5cd722338_2409x651.jpeg" width="1456" height="393" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/652f9fbe-cb14-42a4-8d69-4cc5cd722338_2409x651.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:393,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;comm_use_subset_schema&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="comm_use_subset_schema" title="comm_use_subset_schema" srcset="https://substackcdn.com/image/fetch/$s_!ZRTm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F652f9fbe-cb14-42a4-8d69-4cc5cd722338_2409x651.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ZRTm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F652f9fbe-cb14-42a4-8d69-4cc5cd722338_2409x651.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ZRTm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F652f9fbe-cb14-42a4-8d69-4cc5cd722338_2409x651.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ZRTm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F652f9fbe-cb14-42a4-8d69-4cc5cd722338_2409x651.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As you can see, the file is very deeply nested. You can already tell by now, cleaning this data would be a significant undertaking. But I am going to take the easy way out here for now, and show you how we can very quickly make use of the data in its current state and generate some simple Natural Language Processing (NLP) analysis.</p><p>Many associate NLP with deep learning now and I won&#8217;t blame them, because deep learning is indeed a very powerful tool in analyzing text. However, I am going to show you simple NLP approaches with and without deep learning methods.</p><h2>Non-deep learning</h2><p>Say that I am interested in knowing, at a glance, which are the most common words that appear in all the titles of these peer-reviewed papers. I can generate a word cloud.</p><pre><code><code>comm_use_subset.select("metadata.title").show(3)</code></code></pre><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!epOU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d94ec9-c9a5-4a66-a743-1389ab905945_2466x258.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!epOU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d94ec9-c9a5-4a66-a743-1389ab905945_2466x258.jpeg 424w, https://substackcdn.com/image/fetch/$s_!epOU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d94ec9-c9a5-4a66-a743-1389ab905945_2466x258.jpeg 848w, https://substackcdn.com/image/fetch/$s_!epOU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d94ec9-c9a5-4a66-a743-1389ab905945_2466x258.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!epOU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d94ec9-c9a5-4a66-a743-1389ab905945_2466x258.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!epOU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d94ec9-c9a5-4a66-a743-1389ab905945_2466x258.jpeg" width="1456" height="152" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f9d94ec9-c9a5-4a66-a743-1389ab905945_2466x258.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:152,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;metadata title&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="metadata title" title="metadata title" srcset="https://substackcdn.com/image/fetch/$s_!epOU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d94ec9-c9a5-4a66-a743-1389ab905945_2466x258.jpeg 424w, https://substackcdn.com/image/fetch/$s_!epOU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d94ec9-c9a5-4a66-a743-1389ab905945_2466x258.jpeg 848w, https://substackcdn.com/image/fetch/$s_!epOU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d94ec9-c9a5-4a66-a743-1389ab905945_2466x258.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!epOU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d94ec9-c9a5-4a66-a743-1389ab905945_2466x258.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>First, I concatenated all the available titles of papers.</p><pre><code><code>from pyspark.sql.functions import concat_ws, collect_list

all_title_df = comm_use_subset.agg(concat_ws(", ",     collect_list(comm_use_subset['metadata.title'])).alias('all_titles'))
display(all_title_df)</code></code></pre><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Lrde!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F514e2f4a-1d89-4075-91b0-b62edd92070b_2450x354.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Lrde!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F514e2f4a-1d89-4075-91b0-b62edd92070b_2450x354.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Lrde!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F514e2f4a-1d89-4075-91b0-b62edd92070b_2450x354.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Lrde!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F514e2f4a-1d89-4075-91b0-b62edd92070b_2450x354.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Lrde!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F514e2f4a-1d89-4075-91b0-b62edd92070b_2450x354.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Lrde!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F514e2f4a-1d89-4075-91b0-b62edd92070b_2450x354.jpeg" width="1456" height="210" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/514e2f4a-1d89-4075-91b0-b62edd92070b_2450x354.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:210,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;all_titles&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="all_titles" title="all_titles" srcset="https://substackcdn.com/image/fetch/$s_!Lrde!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F514e2f4a-1d89-4075-91b0-b62edd92070b_2450x354.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Lrde!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F514e2f4a-1d89-4075-91b0-b62edd92070b_2450x354.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Lrde!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F514e2f4a-1d89-4075-91b0-b62edd92070b_2450x354.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Lrde!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F514e2f4a-1d89-4075-91b0-b62edd92070b_2450x354.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Then, I wrote a simple function to plot a word cloud leveraging existing Python library called <code>wordcloud</code>. I first tokenized the text, by splitting the sentence into individual words. Next, I utilized the default set of stopwords and removed any stopwords accordingly. The <code>WordCloud()</code>.<code>generate()</code> function calculates the number of instances each word appears. This means that the size of words in the word cloud we see later is a direct reflection of the frequency of the word in the text.</p><pre><code><code>def custom_wordcloud_draw(text, color = 'white'):
    """
    Plots wordcloud of string text after removing stopwords
    """
    cleaned_word = " ".join([word for word in text.split()])
    wordcloud = WordCloud(stopwords= STOPWORDS.update(['using',   'based', 'analysis', 'study', 'research', 'viruses']),
                  background_color=color,
                  width=1000,
                  height=1000
                 ).generate(cleaned_word)
    plt.figure(1,figsize=(8, 8))
    plt.imshow(wordcloud)
    plt.axis('off')
    display(plt.show())</code></code></pre><pre><code><code>wordcloud_draw(str(all_title_df.select('all_titles').collect()[0]))</code></code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CSr1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc87086b9-d8f9-4fa1-aadc-52a5181c24d8_1290x1276.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CSr1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc87086b9-d8f9-4fa1-aadc-52a5181c24d8_1290x1276.jpeg 424w, https://substackcdn.com/image/fetch/$s_!CSr1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc87086b9-d8f9-4fa1-aadc-52a5181c24d8_1290x1276.jpeg 848w, https://substackcdn.com/image/fetch/$s_!CSr1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc87086b9-d8f9-4fa1-aadc-52a5181c24d8_1290x1276.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!CSr1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc87086b9-d8f9-4fa1-aadc-52a5181c24d8_1290x1276.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CSr1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc87086b9-d8f9-4fa1-aadc-52a5181c24d8_1290x1276.jpeg" width="1290" height="1276" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c87086b9-d8f9-4fa1-aadc-52a5181c24d8_1290x1276.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1276,&quot;width&quot;:1290,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;word+cloud+1&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="word+cloud+1" title="word+cloud+1" srcset="https://substackcdn.com/image/fetch/$s_!CSr1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc87086b9-d8f9-4fa1-aadc-52a5181c24d8_1290x1276.jpeg 424w, https://substackcdn.com/image/fetch/$s_!CSr1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc87086b9-d8f9-4fa1-aadc-52a5181c24d8_1290x1276.jpeg 848w, https://substackcdn.com/image/fetch/$s_!CSr1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc87086b9-d8f9-4fa1-aadc-52a5181c24d8_1290x1276.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!CSr1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc87086b9-d8f9-4fa1-aadc-52a5181c24d8_1290x1276.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As you can see, there is quite a number of non-meaningful words in the word cloud, e.g. &#8220;using&#8221;, &#8220;based&#8221;, &#8220;analysis&#8221;, etc. Let&#8217;s try to remove some of them by making a very minor change to our <code>wordcloud_draw</code> function: call the function <code>UPDATE</code> on <code>STOPWORDS</code> to add these custom stopwords.</p><pre><code><code>def custom_wordcloud_draw(text, color = 'white'):
    """
    Plots wordcloud of string text after removing stopwords
    """
    cleaned_word = " ".join([word for word in text.split()])
    wordcloud = WordCloud(stopwords= STOPWORDS.update(['using', 'based', 'analysis', 'study', 'research', 'viruses']),
                      background_color=color,
                      width=1000,
                      height=1000
                     ).generate(cleaned_word)
    plt.figure(1,figsize=(8, 8))
    plt.imshow(wordcloud)
    plt.axis('off')
    display(plt.show())</code></code></pre><pre><code><code>custom_wordcloud_draw(str(all_title_df.select('all_titles').collect()[0]))</code></code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xKSq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eaf4a4b-f83e-45d1-aa4d-cc8b148622ff_1290x1272.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xKSq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eaf4a4b-f83e-45d1-aa4d-cc8b148622ff_1290x1272.png 424w, https://substackcdn.com/image/fetch/$s_!xKSq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eaf4a4b-f83e-45d1-aa4d-cc8b148622ff_1290x1272.png 848w, https://substackcdn.com/image/fetch/$s_!xKSq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eaf4a4b-f83e-45d1-aa4d-cc8b148622ff_1290x1272.png 1272w, https://substackcdn.com/image/fetch/$s_!xKSq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eaf4a4b-f83e-45d1-aa4d-cc8b148622ff_1290x1272.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xKSq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eaf4a4b-f83e-45d1-aa4d-cc8b148622ff_1290x1272.png" width="1290" height="1272" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4eaf4a4b-f83e-45d1-aa4d-cc8b148622ff_1290x1272.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1272,&quot;width&quot;:1290,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Screen Shot 2020-03-26 at 8.15.44 AM.png&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Screen Shot 2020-03-26 at 8.15.44 AM.png" title="Screen Shot 2020-03-26 at 8.15.44 AM.png" srcset="https://substackcdn.com/image/fetch/$s_!xKSq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eaf4a4b-f83e-45d1-aa4d-cc8b148622ff_1290x1272.png 424w, https://substackcdn.com/image/fetch/$s_!xKSq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eaf4a4b-f83e-45d1-aa4d-cc8b148622ff_1290x1272.png 848w, https://substackcdn.com/image/fetch/$s_!xKSq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eaf4a4b-f83e-45d1-aa4d-cc8b148622ff_1290x1272.png 1272w, https://substackcdn.com/image/fetch/$s_!xKSq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eaf4a4b-f83e-45d1-aa4d-cc8b148622ff_1290x1272.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Yayy! We successfully removed the non-meaningful words we are not interested in.</p><p>Now, let&#8217;s move onto NLP analysis using deep learning methods.</p><h2>Deep Learning</h2><p>We are going to use a summarizer model trained on BERT and also K-means clustering that initially was used to summarize lectures! Refer to this <a href="https://arxiv.org/abs/1906.04165">paper</a> to read more about it. To use this library, you need to pip install <code>bert-extractive-summarizer.</code></p><p>As with all machine learning models, you should be well aware of your model&#8217;s limitations. For this summarizer model, the known limitation is that it does not do well at summarizing text that has over 100 sentences.</p><p>In this section, my goal is to summarize abstracts of papers, so that I need to read even fewer words to get the gist of the paper!</p><p>Let me start by reading the abstract of one of the papers.</p><pre><code><code>from summarizer import Summarizer

abstract = str(comm_use_subset.select("abstract.text").take(2)[1])
abstract</code></code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eB1g!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c56aae-02a2-4ff6-8b74-25aaf23fcb51_2004x1118.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eB1g!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c56aae-02a2-4ff6-8b74-25aaf23fcb51_2004x1118.png 424w, https://substackcdn.com/image/fetch/$s_!eB1g!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c56aae-02a2-4ff6-8b74-25aaf23fcb51_2004x1118.png 848w, https://substackcdn.com/image/fetch/$s_!eB1g!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c56aae-02a2-4ff6-8b74-25aaf23fcb51_2004x1118.png 1272w, https://substackcdn.com/image/fetch/$s_!eB1g!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c56aae-02a2-4ff6-8b74-25aaf23fcb51_2004x1118.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eB1g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c56aae-02a2-4ff6-8b74-25aaf23fcb51_2004x1118.png" width="1456" height="812" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/69c56aae-02a2-4ff6-8b74-25aaf23fcb51_2004x1118.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:812,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;This abstract is so long that it can&#8217;t fit into my browser window!&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="This abstract is so long that it can&#8217;t fit into my browser window!" title="This abstract is so long that it can&#8217;t fit into my browser window!" srcset="https://substackcdn.com/image/fetch/$s_!eB1g!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c56aae-02a2-4ff6-8b74-25aaf23fcb51_2004x1118.png 424w, https://substackcdn.com/image/fetch/$s_!eB1g!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c56aae-02a2-4ff6-8b74-25aaf23fcb51_2004x1118.png 848w, https://substackcdn.com/image/fetch/$s_!eB1g!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c56aae-02a2-4ff6-8b74-25aaf23fcb51_2004x1118.png 1272w, https://substackcdn.com/image/fetch/$s_!eB1g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c56aae-02a2-4ff6-8b74-25aaf23fcb51_2004x1118.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">This abstract is so long that it can&#8217;t fit into my browser window!</figcaption></figure></div><p>Now, let me train a summarizer model and remove any sentences that has fewer than 20 characters using the <code>min_length</code> parameter.</p><pre><code><code>model = Summarizer()
abstract_summary = model(str(abstract), min_length=20)

full_abstract = ''.join(abstract_summary)
print(full_abstract)</code></code></pre><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ie5x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8f3ee38-bd2c-4c5b-a724-3d4e4086a090_2014x480.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ie5x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8f3ee38-bd2c-4c5b-a724-3d4e4086a090_2014x480.png 424w, https://substackcdn.com/image/fetch/$s_!Ie5x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8f3ee38-bd2c-4c5b-a724-3d4e4086a090_2014x480.png 848w, https://substackcdn.com/image/fetch/$s_!Ie5x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8f3ee38-bd2c-4c5b-a724-3d4e4086a090_2014x480.png 1272w, https://substackcdn.com/image/fetch/$s_!Ie5x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8f3ee38-bd2c-4c5b-a724-3d4e4086a090_2014x480.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ie5x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8f3ee38-bd2c-4c5b-a724-3d4e4086a090_2014x480.png" width="1456" height="347" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d8f3ee38-bd2c-4c5b-a724-3d4e4086a090_2014x480.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:347,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Yayy! The generated summary of the abstract can fit within my browser window. It&#8217;s much shorter to read now! You can also experiment with the max_length parameter to see if your summary looks different from mine!&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Yayy! The generated summary of the abstract can fit within my browser window. It&#8217;s much shorter to read now! You can also experiment with the max_length parameter to see if your summary looks different from mine!" title="Yayy! The generated summary of the abstract can fit within my browser window. It&#8217;s much shorter to read now! You can also experiment with the max_length parameter to see if your summary looks different from mine!" srcset="https://substackcdn.com/image/fetch/$s_!Ie5x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8f3ee38-bd2c-4c5b-a724-3d4e4086a090_2014x480.png 424w, https://substackcdn.com/image/fetch/$s_!Ie5x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8f3ee38-bd2c-4c5b-a724-3d4e4086a090_2014x480.png 848w, https://substackcdn.com/image/fetch/$s_!Ie5x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8f3ee38-bd2c-4c5b-a724-3d4e4086a090_2014x480.png 1272w, https://substackcdn.com/image/fetch/$s_!Ie5x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8f3ee38-bd2c-4c5b-a724-3d4e4086a090_2014x480.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Yayy! The generated summary of the abstract can fit within my browser window. It&#8217;s much shorter to read now! You can also experiment with the <code>max_length</code> parameter to see if your summary looks different from mine!</figcaption></figure></div><h2>The End</h2><p>By now, hopefully you feel more motivated to start your own analysis, without letting the messiness of this dataset deter you! You probably noticed that I didn&#8217;t clean my data at all prior to performing these two quick-and-dirty NLP methods (generating a word cloud visualization and an abstract summary). Yes, these two results don&#8217;t yield much actionable insight. But, hopefully this has helped you jumpstart some of your own analysis! You may try to cluster the categories of papers on coronaviruses published out there, so that other researchers can direct their effort to less well-studied areas!</p><p>If you are interested in non-NLP analysis on this particular dataset, check out the <a href="https://www.youtube.com/watch?t=6s&amp;v=A0uBdY4Crlg">Youtube video</a> that I shared. There are coding notebooks linked in the Youtube description too.</p><p>It&#8217;s hard to stay hopeful these days, but I hope the utmost hope that this disorienting and disheartening COVID-19 situation will soon come to pass! I hope you and your loved ones are able to draw near to each other and stay healthy!</p><p>Code Repository</p><ol><li><p>Code presented in this post is available on my <a href="https://github.com/chengyin38/databricks/tree/master/CORD-19%20Literature%20NLP%20Analysis">GitHub</a>.</p></li><li><p>My other friends and I presented our analysis on coronavirus-related data at a webinar, feel free to checkout this <a href="https://www.youtube.com/watch?t=6s&amp;v=A0uBdY4Crlg">Youtube page</a> for the walkthroughs. They have also provided their notebook links in the webinar description as well.</p></li></ol>]]></content:encoded></item><item><title><![CDATA[Analyzing fatal force with Databricks SQL]]></title><description><![CDATA[https://www.databricks.com/blog/2020/11/16/fatal-force-exploring-police-shootings-with-sql-analytics.html]]></description><link>https://www.chengyineng.com/p/coming-soon</link><guid isPermaLink="false">https://www.chengyineng.com/p/coming-soon</guid><dc:creator><![CDATA[Chengyin Eng]]></dc:creator><pubDate>Wed, 25 Sep 2024 22:23:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Aq6B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d7fe587-bae8-43a5-8752-353e38ef9562_1220x696.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Aq6B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d7fe587-bae8-43a5-8752-353e38ef9562_1220x696.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Aq6B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d7fe587-bae8-43a5-8752-353e38ef9562_1220x696.png 424w, https://substackcdn.com/image/fetch/$s_!Aq6B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d7fe587-bae8-43a5-8752-353e38ef9562_1220x696.png 848w, https://substackcdn.com/image/fetch/$s_!Aq6B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d7fe587-bae8-43a5-8752-353e38ef9562_1220x696.png 1272w, https://substackcdn.com/image/fetch/$s_!Aq6B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d7fe587-bae8-43a5-8752-353e38ef9562_1220x696.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Aq6B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d7fe587-bae8-43a5-8752-353e38ef9562_1220x696.png" width="1220" height="696" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0d7fe587-bae8-43a5-8752-353e38ef9562_1220x696.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:696,&quot;width&quot;:1220,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:467255,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Aq6B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d7fe587-bae8-43a5-8752-353e38ef9562_1220x696.png 424w, https://substackcdn.com/image/fetch/$s_!Aq6B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d7fe587-bae8-43a5-8752-353e38ef9562_1220x696.png 848w, https://substackcdn.com/image/fetch/$s_!Aq6B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d7fe587-bae8-43a5-8752-353e38ef9562_1220x696.png 1272w, https://substackcdn.com/image/fetch/$s_!Aq6B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d7fe587-bae8-43a5-8752-353e38ef9562_1220x696.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p><a href="https://www.databricks.com/blog/2020/11/16/fatal-force-exploring-police-shootings-with-sql-analytics.html">https://www.databricks.com/blog/2020/11/16/fatal-force-exploring-police-shootings-with-sql-analytics.html</a> </p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.chengyineng.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.chengyineng.com/subscribe?"><span>Subscribe now</span></a></p>]]></content:encoded></item></channel></rss>