Ethics in software & computing

Between 2009 and 2013, I took a break from the software industry to go be a teacher.  I loved it, and in that time went from being hired to teach courses in ColdFusion and SQL, to eventually running multiple Bachelor’s degree programs covering Web and Mobile development.  Since going back to the industry, I’ve continued to participate in guiding future revisions to those, and related, programs, as they constantly readjust to an ever-evolving ecosystem.

One of the things I harp on every time I give feedback is the absolute criticality of ethics courses in software and computing programs.  The general argument against is that ethics courses don’t contribute to the technical skills of the graduate, so they are seen as less relevant than, say, Technical Writing.

I’d assert that all the crap we see today, from Theranos to Cambridge Analytica and beyond, is a direct consequence of lack of ethics education in software-related programs.

But those are the easy examples.  How about a harder one?

I’m not going to link this, because I don’t want to give the algorithms any traffic, but watch this 20-minute video where a group of “data scientists” present their “evidence” of fraud in the Georgia 2020 elections:

youtube dot com/watch?v=IKiyAy9vjrk

This is a more subtle, but just as direct, consequence of the lack of ethics in software education.  You get people like this who intentionally use their position of technical knowledge to mislead others.  And yet, they present themselves as “unbiased” and “just concerned data scientists”.  It’s people like this who erode public trust in science.

Let’s break down what a lack of ethics looks like.

The claims made in this video boil down to:

  • There were irregularities in the unaggregated data. (The “90%” thing.)
  • There were irregularities in the downstream aggregated time series data. (The “negative” thing.)
  • There are vulnerabilities in the chain of custody which insures the integrity of the data. (The “SD card” thing.)

I’m going to take the last one first because it’s easiest.

The SD Card Thing

I agree with their assertion: it’s absolutely vital that we have controls to inspect and verify the data and metadata at every step of that chain of custody. (For clarity: the “data” here is the raw vote information, while “metadata” is the information about that data, like when it was collected, what its source was, etc.) Having said that, the presence of weak links in that chain are not, in and of themselves, evidence of fraud.

Personally, I’d love to see redundant systems in place here. For example, the Dominion machine spitting out not just an SD card but maybe also printed authenticity and integrity keys which could be handled separately from the SD cards. Maybe those controls are actually in place, and this presentation was just trying to keep things simple for the non-technical audience.  Maybe.  The presenters were clearly on a limited time window, so it’s possible.

But to be clear: this part was FUD. Fear, uncertainty, and doubt. They made no actual accusations about improper handling of the data. They brought it up only to sow mistrust. They tried to cloak it in “we’re reasonable people here, we’re not talking about dead people voting or anything silly”, but then they spent minutes of their time fearmongering without actually accusing anything. (This particular section, by the way, is precisely why cases keep getting thrown out for lack of evidence.)

The 90% Thing

The assertion here, that it’s statistically impossible that 90% of the vote in a particular precinct went to one candidate, is just incorrect. This is also where the data scientist and educator in me starts to get angry, because they are bordering on unethical at this point. There’s a chance they just haven’t thought things through, but that then implies that they are, at the very least, just very bad at their jobs.

The chart they show at 9:10 points to them being incorrect. You see that nice, gentle curve? The one that looks like the zoomed-in hump in a normal distribution? Yeah. That’s literally exactly what we’d expect to see.

Is it really surprising to see overwhelming blue turnout after the absolutely massive efforts to get blue-leaning people to vote? Especially when the red side’s message was the exact opposite (“don’t bother voting, it’s all a fraud anyway”)?

Is it possible the counts were changed and it just happened to end up producing a normal curve?  Sure.  Is it possible it was done on purpose?  Sure.  But is it impossible that real world votes produced those numbers, as these people claimed?  Nope.  So now we have these “data scientists” declaring “facts” based on their technical knowledge which are not true.

If they wanted to present the exact same information as opinion, I’d have no problem with it. I have a problem with anyone calling themselves a data scientist who presents their opinions as facts.

The “votes going negative” Thing

The blatantly unethical part here is that these “data scientists” have switched what they’re talking about, and are intentionally conflating two different things. They are attempting to use their knowledge and authority as data scientists to mislead. This is, to put not too fine a point on it, fucked up.

Before this, they were talking about the adjudication process, where a ballot needs to be reviewed by humans to help the computer understand the intent of the voter. They then jump immediately into “strange activity in incremental votes”. They want to imply these two things have something to do with each other. They want the audience to think “oh, if humans can override what the reader sees, then they can change the votes wholesale”. But the part they’ve left out is that they are no longer talking about the raw ballot data — they are talking about shifts in the NYT dataset, which is far downstream from the raw ballot data.

What they’re not telling you, but should be, is that any data pipeline like this of any significant size is going to have the ability to detect, investigate, and correct problems with the data stream. Sometimes those controls are automated — there are plenty of machine learning systems which understand nothing about the underlying data, but use automated statistics to learn the “shape” of the stream and then raise alarms if something anomalous shows up. But sometimes those controls are done by humans. For example, maybe once an hour a human pulls a sample from the stream up in Excel and looks at it just to see if everything seems to be running smoothly. Generally, we prefer to use the former, the automated systems, but they are expensive and take time to set up, so it’s often much cheaper and easier to have humans eyeball it.

What we saw repeatedly in November were stories of the human parts of that chain making errors, downstream controls detecting those errors and raising alarms, and then humans going back and correcting them in the upstream source. In some cases that favored blue, but in other cases it favored red. IIRC, there was an election (in NV, maybe?) which was originally called for blue, but then the blue election officials discovered and corrected a problem, ultimately leading to the race going to red. At least one other error came down to a human putting a value in the wrong cell of a spreadsheet.  Yes, sometimes pipelines this large and complex are still subject to human error.

But that doesn’t fit their narrative, so they conveniently leave it out.

Why is the system so fragile?

All modern large-scale systems are built on the concept of risk management — laypeople would know it as the “A times B times C” monologue from Fight Club, which was oversimplifying but fundamentally correct: we evaluate the risk of something going wrong against the cost of mitigating that risk, and we make a call whether to invest in the mitigation.  It’s brutal, but it’s how the world works, from manufacturing to health care.

Anyone working in any kind of engineering field, computers and networks or otherwise, can tell you that the cost of those mitigations goes up exponentially with each one that is added.  At some point you say that the investment isn’t worth it anymore. Instead, you rely on an industry standard: Defense in Depth. That is, you rely on other controls, such as humans standing there watching, to catch any problems.

The data pipeline for vote counting is going to be hard to get right. It really only operates one month out of every 2 or 4 years. And when it does operate, it goes from zero to hundreds of millions of records in days or weeks. Our republic actually works against us here: we don’t have a single, unified voting data pipeline because legally we can’t. If any politician tried to advocate for one, they’d be burned in effigy for “expanding the government” and “trashing the sovereign right of the states to manage their own elections”.

So is it really surprising that the patchwork mess of a Rube Goldberg machine that is our vote counting data pipeline occasionally shows some irregularities which need to be investigated and corrected?

If you actually find a problem which can’t be corrected, call it out.  Investigate it.  But I have problems with attempting to present the lumps and eccentricities of an absurdly complex system as “evidence” of anything.  Doing so is nothing more than fearmongering, and is why good people don’t trust science and don’t trust data.


Look, long story long, I do think we should investigate when things look fishy. I do. Absolutely.

But the shit in this video is just FUD. It’s people masquerading as “impartial data scientists” and abusing their technical skills to try to do sleight of hand trickery to non-technical people. It’s fucked up.

And it’s what happens when you don’t teach smart people the difference between acceptable and unacceptable uses of their skills.  It will keep happening, as long as we continue to undervalue ethics and other “soft skills” in our STEM programs.  We are the reason our dystopian corporatized and politicized future continues to slouch toward us, unabated.

Categorized as Web