- There were some gargantuan analyses targeted on the non-textual components of the submissions, let’s assume, their timing, the e-mail addresses susceptible, and diversified metadata. Shout out to the work of Jeffrey Fossett, who did a foremost circulate analysis of the in part submitted comments in May additionally that inspired this submit and some of the programs susceptible within the analysis, to Chris Sinchok, GravWell, and heaps diversified posts I studied in preparing this analysis.
- Let me know right here have to you need to well have any questions or would like secure admission to to the dataset I scraped from the FCC’s ECFS submission machine — if enough of us search details from of it, I could host the dataset on Google BigQuery so that you need to well trip SQL queries on the ~sixty Four GB dataset for your have.
¹ I.e., no longer from a spambot or share of an identified campaign.
² Beefy disclosure: I used to be a summer season regulation clerk for Commissioner Clyburn in 2010, and though I enormously admire her present work championing earn neutrality, the opinions and POV in this submit are my have.
³ No longer clustered as share of a observation submission campaign, no longer a reproduction observation.
⁴ Details quiet from beginning of submissions (April 2017) except Oct twenty seventh, 2017. The long-running observation scraping script suffered from a few disconnections and I estimate that I misplaced ~50,000 comments due to it. Even supposing the Procure Neutrality Public Comment Period ended on August 30, 2017, the FCC ECFS machine persisted to have interaction comments afterwards, which were integrated within the analysis.
⁵ I susceptible an md5 hash feature, which had a low enough collision charge and allowed me to (slightly) fleet get and depend up duplicates. I tossed out submissions with out a speak observation text however in every other case did not attain any diversified text preprocessing on the text earlier than encoding and clustering in dispute to retain artifacts within the text that will give clues as to the manner of submission.
⁶ A gargantuan percentage of these ~3 million “odd” comments had been indubitably duplicates — perfect differing by a few characters or phrases or having a diversified signature. In dispute to conclusively and exhaustively categorize these comments, I selected to neighborhood comments by which implies. Comments had been was doc vectors made out of the moderate of all note vectors within the observation. The note vectors had been obtained from spaCy, which uses the note vectors from the paper by Levy and Goldberg (2014). [Correction from Matthew Honnibal: spaCy now uses the GloVe vectors by Pennington et al.]
⁷ I made two passes at clustering the doc vectors. First with DBSCAN with a euclidean distance metric at a in spite of the entirety low epsilon to name evident clusters and cull them out manually utilizing a string signature. This left ~2 million odd comments. From that 2 million, I susceptible HDBSCAN on a a hundred,000 observation sample with cosine distance to name ‘looser’ clusters, and then susceptible
approximate_predict()to classify final comments as both internal those identified clusters or as outliers. Casting off duplicates, this resulted in decrease than 800,000 odd outlier “organic” comments. [Correction: As HDBSCAN Author Leland McInnes notes beneath, cosine distances don’t yet play effectively with HDBSCAN — to be right, I susceptible the euclidean distance metric between l2-normalized doc vectors, which most often works effectively as an alternative.]
⁸ Sized from the dozens to the tens of millions.
⁹ Long-established Expression in this pastebin.
¹⁰ Here’s since the combinations of observation configurations grows exponentially with every discipline of synonyms introduced. Also, to be right, there had been some mad-lib comments that had been duplicated as soon as, however no longer bigger than that.
¹¹ Page 3 of the Verizon Comments (submitted August 30, 2017)
¹² FCC Chairman Pai’s Observation re the Draft Divulge (revealed November 21, 2017)
¹³ Whereas there are indubitably diversified that you need to well name to mind explanations for this discipline of results, I believe Occam’s Razor could serene be aware. More investigation into the timing and emails susceptible for this explicit campaign would present extra corroborating evidence.
¹⁴ Plotted on a log-scale so that you need to well serene explore the coloration of the smaller bars.
¹⁵ Because the author of the Gravwell look states: “[The evidence] forces us to enact that both the very act of going to the FCC observation predicament and providing a observation is perfect handsome to those of a definite political leaning, or that the bulk submission files is elephantine of lies.”
¹⁶ Pro-repeal comments are on lines 176, 228, 930 within the pastebin. There also gave the look to be three earn neutrality supporters that looked perplexed about the terminology (lines 332, 366, 901) and one script kiddie (line 261). It’s that you need to well name to mind I indubitably have neglected one or two, and I’m happy to upright any mistakes in this observation discipline have to you know them.
¹⁷ My extra statistically-inclined colleague informs me that the central limit theorem breaks down at the intense limits (the attach the inhabitants percentage is attain zero% or a hundred% of a inhabitants), which I indubitably have taken his note/skills for, for now, and can evaluate later. [Edit: I indubitably have chanced on a factual addition to this on a reddit observation. The interval is ninety nine.12% to ninety nine.90%, 19 times out of 20.]
¹⁸ Line 102 within the pastebin.
¹⁹ [A final late addition: Lest I am unintentionally giving the wrong impression to folks who haven’t been following the net neutrality debate as closely, I want to be clear that there were suspicious campaigns from all sides of the debate from the text-only analysis; however, none were as numerous and as intentionally disguised as the 1.3M ‘unique’ comments identified in the post.]