Secure by Design
8 min.
Five Secure by Design Practices You Can Implement Today
Most developers don't write insecure code on purpose – the guardrails just weren't there. This post covers five Secure by Design practices you can apply in your current sprint. No reorganization, no security team initiative required.

Most developers don’t write insecure code on purpose. They write it because the defaults weren’t secure, the error handling was an afterthought, or the service account inherited permissions from a template nobody revisited. In most cases we’ve seen, the issue wasn’t that someone was careless. The guardrails just weren’t there.
This post covers five practices you can apply in your current sprint. No reorganization required, no dependency on a security team initiative. Each one targets a recurring source of vulnerabilities that we see across projects.
Practice 1: Scope Your Service Accounts Down to the Minimum
Service accounts tend to start with broad permissions and never lose them. Someone sets up a deployment pipeline, copies an IAM policy from a working example, and moves on. Six months later, that account still has write access to production databases it never touches. Default policies in cloud platforms ship broader than needed, and shared accounts across teams make it impossible to trace who did what.
- One account, one job. If a service account serves multiple pipelines, split it. When one gets compromised, the blast radius stays small.
- Replace static credentials with short-lived tokens. Hardcoded API keys in environment variables are easy to set up and easy to forget. Switch to workload identity federation or equivalent mechanisms that issue temporary credentials on demand.
- Use just-in-time elevation instead of permanent admin roles. If a CI/CD step needs elevated privileges for a specific task, grant them for that task and revoke them automatically afterward.
- Audit what’s actually being used. Most cloud providers offer access analysis tooling that shows which permissions an account exercises versus what it’s been granted. The gap is your attack surface.
The IAM policy is where the copy-paste problem shows up most clearly. Here’s the same deployment service account written two ways: the version that usually ships, and the version that should. Both policies cover the same task of updating a specific Lambda function family and reading deployment artifacts from S3. The difference is how much trust each one extends to get there.
Before: Open Version
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:*",
"lambda:*",
"iam:*"
],
"Resource": "*"
}
]
}After: Scoped Properly
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DeployPaymentServiceLambdas",
"Effect": "Allow",
"Action": [
"lambda:UpdateFunctionCode",
"lambda:UpdateFunctionConfiguration",
"lambda:GetFunction",
"lambda:PublishVersion"
],
"Resource": "arn:aws:lambda:eu-central-1:123456789012:function:payment-service-*"
},
{
"Sid": "ReadDeploymentArtifacts",
"Effect": "Allow",
"Action": ["s3:GetObject"],
"Resource": "arn:aws:s3:::deploy-artifacts-prod/payment-service/*",
"Condition": {
"StringEquals": {
"aws:SourceVpc": "vpc-0abc1234def567890"
}
}
}
]
}The scoped version replaces wildcard actions with explicit verbs – lambda:UpdateFunctionCode, s3:GetObject – so the account can only do exactly what the pipeline requires. Resource ARNs are locked to a specific service family rather than applying across the account. A condition block adds a second constraint: the call must originate from a known VPC, so even a stolen credential doesn’t work from outside that boundary. Statements are split with Sid identifiers, which turns CloudTrail logs from a wall of Allow entries into something an engineer can actually trace back to a specific deployment step.
Practice 2: Make Your Error Handling Fail-Secure, Not Fail-Open
A service goes down, and the load balancer routes traffic to a fallback that skips authentication. An unhandled exception dumps a full stack trace into the API response, complete with database connection strings. The common thread: the system hit an unexpected state and the failure path hadn’t been designed. Firewalls that default to “allow” when the rule engine crashes, auth services that return “authorized” when they can’t reach the identity provider. Each of these turns a temporary outage into a security incident.
- Default to deny, not allow. Every access decision should require an explicit “yes” from your auth logic. If the check fails, times out, or throws an exception, the answer is “no.”
- Catch exceptions before they leak information. Wrap external-facing error responses in generic messages and route the details to your logging pipeline. Stack traces, internal paths, and tokens in API responses or error messages are free reconnaissance for attackers.
- Lock down your fallback configurations. If your application can’t read its config, it should start in a minimal, locked state rather than falling back to permissive defaults. Missing values should disable functionality, not enable it.
- Test your failure paths. Kill a dependency in staging and watch how your service reacts. If you’ve never done that, the answer is probably “not well.”
Here’s what fail-secure error handling looks like in an Express.js application. First the default that ships, then a version designed for production.
Before: Default that leaks internals
app.use((err, req, res, next) => {
res.status(500).send(err.stack);
});After: Generic response to client, full details in logs
const logger = require('./logger');
const { randomUUID } = require('crypto');
app.use((err, req, res, next) => {
const correlationId = req.correlationId || randomUUID();
const status = err.statusCode || 500;
// Everything sensitive goes to the log pipeline, never to the client
logger.error({
correlationId,
method: req.method,
path: req.path,
userId: req.user?.id,
ip: req.headers['x-forwarded-for']
err: { name: err.name, message: err.message, stack: err.stack }
}, 'Unhandled exception');
// 5xx: hide details. 4xx: only expose a message the error class marked safe.
res.status(status).json({
error: status >= 500 ? 'Internal Server Error' : ('Bad Request'),
correlationId
});
});The revised handler never sends sensitive details to the client. Stack traces, internal paths, and request metadata go to the structured logging pipeline instead where your engineers can find them and attackers can’t. The client receives a generic message and a correlation ID: enough for a support conversation, useless for reconnaissance. The 5xx path returns only ‘Internal Server Error’; 4xx responses expose only what the error class explicitly marks safe. The correlationId is generated from crypto.randomUUID() rather than from any user-supplied value, so it can’t be manipulated to infer internal state.
Practice 3: Validate Every Request, Not Just the First One
A user logs in, gets a session token, and from that point on, every API call sails through because the system checked once and decided to trust. Thirty minutes later, the user’s role has changed, the token is still valid, and the requests keep going through with outdated permissions. This is one of the quieter vulnerabilities because it doesn’t look like a bug. The auth system works. The problem is that authorization was treated as a one-time event. Cached permissions that never refresh, long-lived tokens, internal endpoints without auth logic because “only our services call them” – in a microservices architecture, this compounds fast.
- Route authorization through a single enforcement point. One place that makes access decisions, applied consistently across all services. Scattering auth logic across individual endpoints is how checks get forgotten.
- Keep tokens and sessions short-lived. Issue tokens with short expiry windows and force revalidation at meaningful intervals. If a user’s permissions change mid-session, the next request should reflect that.
- Treat internal APIs the same as external ones. “This endpoint is only called by our own services” is not a security boundary. Service-to-service calls need auth checks too.
Practice 4: Set Secure Defaults – Don’t Rely on Configuration
A developer scaffolds a new service, pushes it to staging, and it works. Nobody questions the defaults. Months later, a security review reveals CORS set to allow all origins, session cookies without the Secure flag, and the admin panel reachable without authentication. Frameworks and cloud platforms optimize for getting started quickly, which means shipping with permissive settings. The assumption is that teams will harden things before production. In practice, the config that worked in development is the config that ships.
- Harden your project templates. If your team uses starter repos, Helm charts, or Terraform modules, lock them down: strict CORS, secure cookie flags (e.g., Secure, HttpOnly, SameSite), TLS required, admin interfaces disabled. Every project that inherits those templates starts secure without anyone having to remember.
- Make risky features opt-in, not opt-out. Debug modes, verbose logging, permissive firewall rules: these should require a deliberate change to enable. If turning something dangerous on is as easy as not turning it off, someone will ship it by accident.
- Fail on missing configuration. If an environment variable is missing, the application should refuse to start rather than silently applying a permissive fallback. A loud startup failure beats a quietly running service with open permissions.
A Docker Compose file is a practical place to see this gap. Both configurations run the same service, what changes is how much of the host the container can reach.
Before: What frameworks generate by default
services:
api:
image: my-api:latest
ports:
- "3000:3000"
environment:
- NODE_ENV=productionAfter: Hardened Version
services:
api:
image: my-api:1.4.2@sha256:abc123def456... # pinned digest, not :latest
ports:
- "127.0.0.1:3000:3000" # bind to loopback, expose via reverse proxy
environment:
- NODE_ENV=production
user: "10001:10001" # non-root UID/GID
read_only: true # immutable root filesystem
tmpfs:
- /tmp:size=64M,mode=1777 # writable scratch space only where needed
cap_drop:
- ALL # drop all Linux capabilities, add back only what's needed
networks:
- backend
deploy:
resources:
limits:
memory: 512M
cpus: '0.5'
healthcheck:
test: ["CMD", "node", "healthcheck.js"]
interval: 30s
timeout: 5s
retries: 3
restart: unless-stopped
networks:
backend:
driver: bridge
internal: true # no direct egress; outbound via explicit proxy---Two things in the hardened version will break applications on first run: read_only: true fails immediately if the application writes to any path not explicitly covered by a tmpfs mount – check for log directories, temp files, and PID files before enabling it. cap_drop: ALL removes all Linux capabilities, which breaks applications that bind to port 80 or rely on CAP_NET_BIND_SERVICE; add back only the specific capability you need. internal: true cuts all outbound traffic from the container – remove it if the service makes legitimate calls to external APIs, and route outbound through an explicit proxy instead.
Beyond those caveats: :latest is replaced with a pinned version and SHA256 digest, which prevents silent upstream image swaps from introducing changes you didn’t review. The port binding is restricted to 127.0.0.1, so external traffic has to come through a reverse proxy rather than directly to the container. The container runs as a non-root user – UID/GID 10001 – which limits what a compromised process can do on the host. Memory and CPU limits are set explicitly, which bounds the blast radius of a resource exhaustion attack or a crypto-mining payload that makes it through.
Practice 5: Automate One Security Check in Your Pipeline Today
You already run linters and tests before merging code. Adding a security check is the same principle: catch problems before they reach production, automatically. Yet in a lot of teams, security validation still happens manually or not at all until weeks before a release. The other failure mode is overcorrection: three scanners at once, a flood of false positives, and within a month the team clicks “ignore” on everything. Start with one check. Get the noise level under control, then expand.
- Dependency scanning is the lowest-friction starting point. Tools like Dependabot or OWASP Dependency-Check flag known vulnerabilities with minimal configuration. A CVE in a dependency is a verifiable fact, not a heuristic guess, so the signal-to-noise ratio is better than most SAST tools.
- Add a secret scanner alongside your dependency checks. Tools like gitleaks or TruffleHog scan commits for API keys, tokens, and hardcoded secrets before they reach a shared branch.
- Set a severity threshold your team actually respects. Break the build on critical and high, surface medium as warnings, suppress low. Adjust based on what your team acts on. If everything blocks, developers will route around it.
- Assign ownership for findings, not just the tool. A scanner without someone responsible for triaging its output is a scanner that gets ignored. Decide upfront who reviews, who decides fix-or-suppress, and how quickly.
Here’s a minimal GitHub Actions workflow using Grype – one scanner, one severity threshold, and the results in a place where someone is actually accountable for them.
GitHub Actions: Grype
name: security-scan
on:
pull_request:
push:
branches: [main]
schedule:
- cron: '0 6 * * 1' # weekly, Mondays 06:00 UTC - catches CVEs disclosed after last commit
workflow_dispatch: # allow manual triggering
jobs:
dependency-scan:
name: Dependency vulnerability scan
runs-on: ubuntu-latest
timeout-minutes: 10 # prevent hung jobs from blocking the queue
permissions:
contents: read
security-events: write # required to upload SARIF
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Scan repository with Grype
id: scan
uses: anchore/scan-action@v4 # pin to specific version; verify latest before publish
with:
path: "."
severity-cutoff: high # fail build on HIGH or CRITICAL findings
fail-build: true
output-format: sarif
only-fixed: true # skip vulns with no available fix - noise reduction
- name: Upload findings to GitHub Security tab
if: always() # upload even when the previous step fails the build
uses: github/codeql-action/upload-sarif@v3
with:
sarif_file: ${{ steps.scan.outputs.sarif }}
category: grype # disambiguates findings if you add more scanners laterThe workflow runs on every pull request, every push to main, and on a weekly schedule. The schedule matters because CVEs are disclosed continuously, not just when code changes. severity-cutoff: high means HIGH and CRITICAL findings fail the build; everything below surfaces as informational without blocking the pipeline. only-fixed: true is worth noting: it filters out vulnerabilities with no available patch, which removes a significant source of alert fatigue. Findings are exported as SARIF and uploaded to the GitHub Security tab, which gives them a UI, a triage workflow, and an assignable owner instead of disappearing into CI logs that nobody reviews after the build passes.
From First Practice to Secure by Design Habit
Five practices, none of which require a reorganization. Scope your service accounts, fail safe, validate every request, harden your defaults, add one automated check.
If you’ve implemented one and it worked, pick the next. That’s how security practice builds: through incremental changes that compound over time. The hard part isn’t any individual practice. It’s building the habit of treating security as part of the work rather than as a separate task.




