30 June 2017

The RNC Files: Inside the Largest US Voter Data Leak


In what is the largest known data exposure of its kind, UpGuard’s Cyber Risk Team can now confirm that a misconfigured database containing the sensitive personal details of over 198 million American voters was left exposed to the internet by a firm working on behalf of the Republican National Committee (RNC) in their efforts to elect Donald Trump. The data, which was stored in a publicly accessible cloud server owned by Republican data firm Deep Root Analytics, included 1.1 terabytes of entirely unsecured personal information compiled by DRA and at least two other Republican contractors, TargetPoint Consulting, Inc. and Data Trust. In total, the personal information of potentially near all of America’s 200 million registered voters was exposed, including names, dates of birth, home addresses, phone numbers, and voter registration details, as well as data described as “modeled” voter ethnicities and religions. 

This disclosure dwarfs previous breaches of electoral data in Mexico (also discovered by Vickery) and the Philippines by well over 100 million more affected individuals, exposing the personal information of over sixty-one percent of the entire US population. 

The data exposure provides insight into the inner workings of the Republican National Committee’s $100 million data operation for the 2016 presidential election, an undertaking of monumental scope and painstaking detail launched in the wake of Mitt Romney’s loss in 2012. Deep Root Analytics, TargetPoint, and Data Trust—all Republican data firms—were among the RNC-hired outfits working as the core of the Trump campaign’s 2016 general election data team, relied upon in the GOP effort to influence potential voters and accurately predict their behavior. The RNC data repository would ultimately acquire roughly 9.5 billion data points regarding three out of every five Americans, scoring 198 million potential US voters on their likely political preferences using advanced algorithmic modeling across forty-eight different categories. 

Spreadsheets containing this accumulated data—last updated around the January 2017 presidential inauguration—constitute a treasure trove of political data and modeled preferences used by the Trump campaign. This data was also exposed in the misconfigured database and had been for an unknown period of time. 

UpGuard’s discovery — of perhaps the largest known exposure of voter information in history—is corroborated by technical evidence, as well as by the public statements of the responsible firms and political staffers. 

The Discovery 

In the early evening of June 12th, UpGuard Cyber Risk Analyst Chris Vickery discovered an open cloud repository while searching for misconfigured data sources on behalf of the Cyber Risk Team, a research unit of UpGuard devoted to finding, securing, and raising public awareness of such exposures. The data repository, an Amazon Web Services S3 bucket, lacked any protection against access. As such, anyone with an internet connection could have accessed the Republican data operation used to power Donald Trump’s presidential victory, simply by navigating to a six-character Amazon subdomain: “dra-dw”. 

Upon inspection of the contents, “dra-dw” is shown to stand for “Deep Root Analytics Data Warehouse.” The concept of a “data warehouse” is common in modern business— essentially, it is a massive collection of data prepared specifically for complex analysis. Deep Root Analytics confirmed they owned and operated the dra-dw bucket, which was subsequently secured against public access the night of June 14th, shortly after Vickery notified federal authorities. 

In total, 1.1 terabytes of data in the warehouse—an amount roughly equivalent to 500 hours worth of video—was fully downloadable. Among these files were clear indications of the repository’s political importance, with file directories named for a number of high-powered and influential Republican political organizations. As such, the exposed Deep Root Analytics warehouse contained a remarkable amount of fully accessible data. 

Yet this was not all. An additional 24 terabytes of data was stored in the warehouse, but had been configured to prevent public access. Ultimately, the amount of data stored in the misconfigured database was equivalent in size to about 10 billion pages of text. 

Less clear was the significance of intriguing but inaccessible files, such as one titled “for_strategy_xroads_updated_FINAL” - which may refer in some capacity to American Crossroads, the Super PAC co-founded by former George W. Bush adviser Karl Rove that was very active in 2016 electoral financing. Also found was a large cache of Reddit posts, saved as text: 

It would ultimately take days, from June 12th to June 14th, for Vickery to download 1.1 TB of publicly accessible files, which included two critical directories titled “data_trust” and “target_point.” 

The Operation 


Deep Root Analytics, the Republican data firm which created and maintained the exposed data warehouse, was co-founded in 2013 by Alex Lundry, a Republican campaign data scientist who had served as data director in Mitt Romney’s unsuccessful 2012 presidential campaign. The company bills itself as “the most experienced group of targeters in Republican politics,” offering media analytics services to corporations, lobbying groups, and GOP political campaigns seeking to reach specific target demographics. Deep Root claims to be able to more effectively reach these desired demographics by “microtargeting” using big data analytics, allowing clients to make better-informed decisions when purchasing advertising. 

It was a pedigree that would earn Lundry a position as “Chief Analytics Officer” with the 2016 Republican presidential campaign of former Florida Governor Jeb Bush. While Bush would fail to win the nomination even after assembling a well-credentialed data team, Trump would have the inverse problem, winning the nomination without having created a robust data operation within his campaign. Following the formal conclusion of GOP primary season in July 2016 with Trump’s nomination, the RNC would move quickly in coordinating their data team’s efforts with those of the Trump campaign in the upcoming general election fight against Hillary Clinton

In order to win the election, the RNC would need to draw heavily upon the resources of several private firms specializing in data analytics. Among these private consultancies was Data Trust, a Washington-based firm that claims to “continually develop a Republican and conservative data ecosystem through voter file collection, development, and enhancement.” 

Data Trust, “the GOP’s exclusive data provider,” was created by the RNC in 2011, per National Review, “to shoulder the cost of building and managing the GOP’s voter file”—its repository of detailed voter information crucial to any successful electoral advertising and get-out-the-vote efforts. As reported by Slate, Data Trust operates as a private-sector satellite of the RNC—“a hybrid, a private company that party bosses built but can’t formally run.” 

Within the Deep Root Analytics database, the folder “data_trust” appears to contain nothing less than the full fruits of this RNC/Data Trust effort to house as comprehensive and detailed a repository of potential 2016 voter information. 

Within “data_trust” are two massive stores of personal information collectively representing up to 198 million potential voters. Consisting primarily of two file repositories, a 256 GB folder for the 2008 presidential election and a 233 GB folder for 2012, each containing fifty-one files - one for every state, as well as the District of Columbia. Each file, formatted as a comma separated value (.csv), lists an internal, 32-character alphanumeric “RNC ID”—such as, for example, 530C2598-6EF4-4A56-9A7X-2FCA466FX2E2—used to uniquely identify every potential voter in the database. These RNC IDS uniquely link disparate data sets together, combining dozens of sensitive and personally identifying data points, making it possible to piece together a striking amount of detail on individual Americans specified by name. 

Both Vickery and this reporter looked themselves up in these spreadsheets, confirming that the files contained accurate and sensitive personal information. Listed here are the .csv categories: 

"RNCID", "RNC_RegID", "State", "SOURCEID", "Juriscode", "Jurisname", "CountyFIPS", "MCD", "CNTY", "Town", "Ward", "Precinct", "Ballotbox", "PrecinctName", "CD_Current", "CD_NextElection", "SD_Current", "SDProper_Current", "SD_NextElection", "SDProper_NextElection", "LD_Current", "LDS_Current", "LDProper_Current", "LD_NextElection", "LDS_NextElection", "LDProper_NextElection", "NamePrefix", "FirstName", "MiddleName", "LastName", "NameSuffix", "Sex", "BirthYear", "BirthMonth", "BirthDay", "OfficialParty", "StateCalcParty", "RNCCalcParty", "StateVoterID", "JurisdictionVoterID", "AffidavitID", "LegacyID", "LastActiveDate", "RegistrationDate", "VoterStatus", "PermAbs", "SelfReportedDemographic", "ModeledEthnicity", "ModeledReligion", "ModeledEthnicGroup", "HHSEQ", "HTSEQ", "RegistrationAddr1", "RegistrationAddr2", "RegHouseNum", "RegHouseSfx", "RegStPrefix", "RegStName", "RegStType", "RegstPost", "RegUnitType", "RegUnitNumber", "RegCity", "RegSta", "RegZip5", "RegZip4", "RegLatitude", "RegLongitude", "RegGeocodeLevel", "RADR_LastCleanse", "RADR_LastGeoCode", "RADR_LastCOA", "ChangeOfAddress", "COADate", "COAType", "MailingAddr1", "MailingAddr2", "MailHouseNum", "MailHouseSfx", "MailStPrefix", "MailStName", "MailStType", "MailStPost", "MailUnitType", "MailUnitNumber", "MailCity", "MailSta", "MailZip5", "MailZip4", "MailSortCodeRoute", "MailDeliveryPt", "MailDeliveryPtChkDigit", "MailLineOfTravel", "MailLineOfTravelOrder", "MailDPVStatus", "MADR_LastCleanse", "MADR_LastCOA", "AreaCode", "TelephoneNUm", "TelSourceCode", "TelMatchLevel", "TelReliability", "FTC_DoNotCall", "PhoneAppendDate", "VH12G", "VH12P", "VH12PP", "VH11G", "VH11P", "VH10G", "VH10P", "VH09G", "VH09P", "VH08G", "VH08P", "VH08PP", "VH07G", "VH07P", "VH06G", "VH06P", "VH05G", "VH05P", "VH04G", "VH04P", "VH04PP", "VH03G", "VH03P", "VH02G", "VH02P", "MT10_Party", "MT10_GenericBallot", "MT10_Turnout", "MT10_ObamaDisapproval", "MT10_Jobs", "MT10_Healthcare", "MT10_SoCo", "PG01", "PG02", "PG03", "PG04", "PG05", "PG06", "PG07", "PG08", "PG09", "PG10", "PG11", "PG12", "PG13", "PG14", "PG15", "PG16", "PG17", "PG18", "PG19", "PG20", "PG21", "PG22", "PG23", "PG24", "PG25", "PG26", "PG27", "PG28", "PG29", "PG30", "PG31", "PG32", "PG33", "PG34", "PG35", "PG36", "PG37", "PG38", "PG39" 

Starting with the potential voter’s first and last names—limiting even the barest possibility of the data sets masking the identities of those described—the files go onto list a a great deal more data, including the voter’s date of birth, home and mailing addresses, phone number, registered party, self-reported racial demographic, voter registration status, and even whether they are on the federal “Do Not Call” list. Also included as data fields are the “modeled ethnicity” and “modeled religion” of the potential voter—particularly sensitive personal details that have historically been a source of controversy for data collection

While not every field is populated for each individual, if the answer is known, it appears to have been included. A smaller folder for the 2016 election was also included in the database, but unlike the 2008 and 2012 folders, only included .csv files for Ohio and Florida - arguably the two most crucial battleground states. The entire “data_trust” folder, it bears repeating, was entirely downloadable by any individual accessing the URL of the database. 


This exposure of the personal information of millions of Americans was not, perhaps, the most damaging pool of data exposed. To understand its significance, additional context is necessary. 


The RNC’s multiyear effort in building a world-class data operation would come to employ Deep Root Analytics in a partnership with other data firms to do for the RNC what Obama’s data team had done for the Democrats, as reported by Ad Age in a detailed post-election profile of the RNC data operation


“In this case, the people doing most of the data modeling and voter scoring – especially for field operations, voter contact and television advertising -- were from a collective of three data firms hired by the RNC: TargetPoint Consulting, Causeway Solutions, and Deep Root Analytics, which officially worked with the RNC through a new subsidiary called Needle Drop.” 


RNC payments to two of the firms mentioned in the database totaled over $5 million, as also reported by Ad Age


Between January 2015 and November 2016, the RNC paid TargetPoint $4.2 million for data services, and gave Causeway around $500,000 in that time, according to Federal Election Commission reports. Deep Root, acting as Needle Drop, was paid $983,000 by the RNC. 


Needle Drop principal TargetPoint Consulting—where Deep Root Analytics founder Alex Lundry was employed as “Chief Data Scientist” from 2005 to 2015—is referenced in the database with a folder titled “target_point.” TargetPoint, a GOP-aligned, Alexandria, Virginia-based “full service market research and knowledge management firm,” specializes in microtargeting key demographics on behalf of corporate and political clients—a tactic they claim to have pioneered “after President George W. Bush deployed our services for his successful 2004 campaign.” 


TargetPoint is a trusted and well-established authority on data operations within conservative political circles, having worked in the past on Rudy Giuliani’s 2008 presidential bid, the 2008 McCain/Palin campaign, and the National Republican Senatorial Committee’s reelection efforts. TargetPoint founder Alexander Gage, a former polling and market researcher, explained to the Washington Post in 2007 his philosophy of data analytics while serving as presidential candidate Mitt Romney’s Director of Strategy: 


“‘Microtargeting is trying to unravel your political DNA,’ [Gage] said. ‘The more information I have about you, the better.’ The more information [Gage] has, the better he can group people into "target clusters" with names such as ‘Flag and Family Republicans’ or ‘Tax and Terrorism Moderates.’ Once a person is defined, finding the right message from the campaign becomes fairly simple.” 


While it may be better for data firms like TargetPoint to stockpile your most sensitive personal information, for the 198 million Americans whose sensitive identifying details and potential political inclinations were compiled on a public-facing cloud server lacking any security barriers, the view may be different. 


The contents of the “target_point” folder were even more intrusive than those of the Data Trust repository, if less obviously intimidating at first glance: fourteen files saved in the Alteryx Database format (.yxdb), a file format designed specifically for large-scale data analysis. Most of the files were last updated in mid to late-January 2017, with several labeled as “Contact File,” with different dates signifying when they were updated. 


Contained within these “Contact File” spreadsheets are the aforementioned 32-character alphanumeric RNC IDs for 198 million potential American voters, as well as the corresponding names and addresses of the voters. The clear linkage between every RNC ID and the name and identifying personal details of all 198 million people ensures all data using the RNC ID as an identifier can be tied back to the person’s real name. 


The remaining files provide a rare glimpse into a systematic large-scale analytics operation being performed using a massive repository of 198 million potential voters, combining personal details, backgrounds, and political behavior to, paraphrasing Gage, “unravel their political DNA”. The result is a database of grand scope and scale, collecting the modeled personal and political preferences of most of the country—adding up to an unsecured political treasure trove of data which was free to download online. 


The file dates and names indicate the other files largely concern post-election data analytics conducted in the run-up to and around Trump’s inauguration on January 20th, 2017. Some of the files align with public statements by RNC and TargetPoint officials about the kind of targeted analysis performed over the course of the campaign. A file titled “DRA Post Elect 2016 Reluctant DJT scores 1-6-17.yxdb,” for example, contains 69 million rows, and is illustrative of the kind of post-election analysis in the repository executed by the GOP data team. The likelihood of this analysis as a product of the RNC data team is corroborated by public disclosures in the press of similar microtargeting, such as TargetPoint’s analysis of "’DJT Underperform’ voters, or Republicans still unconvinced about supporting Mr. Trump.” 


In the 50 GB file titled “DRA Post Elect 2016 All Scores 1-12-17.yxdb,” each potential voter is scored with a decimal fraction between zero and one across forty-six columns. Each of the fields under each of the forty-six columns signifies the potential voter’s modeled likelihood of supporting the policy, political candidate, or belief listed at the top of the column, with zero indicating very unlikely, and one indicating very likely. 


RNC_RegID, State, 2012ObamaVoter_DRA_12_16, 2012RomneyVoter_DRA_12_16, 2016ClintonVoter_DRA_12_16, 2016TrumpVoter_DRA_12_16, AmericaFirstForeignPolicy_agree_DRA_12_16 AmericaFirstForeignPolicy_disagree_DRA_12_16 AutoCompaniesShipJobsOverseas_agree_DRA_12_16 AutoCompaniesShipJobsOverseas_disagree_DRA_12_16 CorpReputs_AmericanMakers_DRA_12_16, CorpReputs_DailyLives_DRA_12_16, CorpReputs_Egalitarians_DRA_12_16, CorpReputs_EnviroConscious_DRA_12_16, CorpReputs_OpportunitySeekers_DRA_12_16, CorpReputs_STEMSupporters_DRA_12_16, CorpReputs_SupplyChainers_DRA_12_16, CorpReputs_Unifers_DRA_12_16, DemLeadersStandUpToTrump_DRA_12_16, DemLeadersWorkWithTrump_DRA_12_16, DParty_DRA_12_16, FinancialServicesHarmful_agree_DRA_12_16 FinancialServicesHarmful_disagree_DRA_12_16 FinServicesCompany_Dreamers_DRA_12_16 FinServicesCompany_RiskMitigators_DRA_12_16 FossilFuelsImportantForUSEnergySecurity_DRA_12_16 FossilFuelsNeedToMoveAwayFrom_DRA_12_16, InvestInfrastructure_agree_DRA_12_16, InvestInfrastructure_disagree_DRA_12_16, LowerTaxes_agree_DRA_12_16, LowerTaxes_disagree_DRA_12_16, NonReluctantDJTVoter_DRA_12_16, NonReluctantHRCVoter_DRA_12_16, PharmaCompsDoGreatDamage_agree_DRA_12_16, PharmaCompsDoGreatDamage_disagree_DRA_12_16, ReformGovtRegulations_agree_DRA_12_16, ReformGovtRegulations_disagree_DRA_12_16, ReluctantDJT_Above.5_DRA_12_16, ReluctantHRCVoter_DRA_12_16, RepealObamacare_agree_DRA_12_16, RepealObamacare_disagree_DRA_12_16 RParty_DRA_12_16, StopIllegalImmigration_agree_DRA_12_16, StopIllegalImmigration_disagree_DRA_12_16, TrumpStandUpToDems_DRA_12_16, TrumpWorkWithDems_DRA_12_16, USAFinancialSituation_Optimistic_DRA_12_16, USAFinancialSituation_Pessimistic_DRA_12_16 


Calculated for 198 million potential voters, this adds up to a spreadsheet of 9.5 billion modeled probabilities, for questions ranging from how likely it is the individual voted for Obama in 2012, to whether they agree with the Trump foreign policy of “America First,” to how likely they are to be concerned with auto manufacturing as an issue, among others. 



The spreadsheet is an impressive deployment of analytical might. However, while each potential voter is signified by their 32-character RNC internal ID, it is a one-step process to determine the real name associated with the modeled policy preferences, as the aforementioned “Contact File” also exposed in the database links the RNC ID to the potential voter’s actual identity. 


This reporter was able, after determining his RNC ID, to view his modeled policy preferences and political actions as calculated by TargetPoint. It is a testament both to their talents, and to the real danger of this exposure, that the results were astoundingly accurate. 
The Significance 


This exposure raises significant questions about the privacy and security Americans can expect for their most privileged information. It also comes at a time when the integrity of the US electoral process has been tested by a series of cyber assaults against state voter databases, sparking concern that cyber risk could increasingly pose a threat to our most important democratic and governmental institutions. 


That such an enormous national database could be created and hosted online, missing even the simplest of protections against the data being publicly accessible, is troubling. The ability to collect such information and store it insecurely further calls into question the responsibilities owed by private corporations and political campaigns to those citizens targeted by increasingly high-powered data analytics operations. 

What is beyond debate in 2017 is the increasing inability to trust in the integrity of information technology systems, particularly at scale. As reliance on technology increases, so too grows the cyber risk surface; as more and more functions of life migrate onto digital platforms, more and more functions of life invite cyber risk. Beyond the almost limitless criminal applications of the exposed data for purposes of identity theft, fraud, and resale on the black market, the heft of the data and analytical power of the modeling could be applied to even more ambitious efforts - corporate marketing, spam, advanced political targeting. Any of these potential misuses of private information can be prevented, provided stakeholders obey a few simple precepts in collecting and storing data. 

The fundamental problems which exposed this data are not rare, uncommon, or consigned to one side of the partisan divide; indeed, while those responsible for this exposure are of one party, the 198 million Americans affected span the entire political spectrum, their information revealed regardless of their political beliefs. The same factors that have resulted in thousands of previous data breaches—forgotten databases, third-party vendor risks, inappropriate permissions—combined with the RNC campaign operation to create a nearly unprecedented data breach. 

Despite the breadth of this breach, it will doubtlessly be topped in the future—to a likely far more damaging effect—if the ethos of cyber resilience across all platforms does not become the common language of all internet-facing systems.

No comments: