Production

Why Does My Production Line Keep Stopping?

Lasse Ran Carlsen10 min read

Most unplanned stops come from a short list of causes that compound each other. Sensor blind spots, delayed maintenance response, equipment running past rated cycles, power quality events, raw material drift, PLC faults, and shift-change errors account for the majority of lost production time. Fixing them requires data correlation, not more dashboards.

1. Sensor Blind Spots

About 23% of unplanned stops [1] trace back to sensors that weren't monitoring the right parameter. Vibration on a gearbox bearing, temperature on a motor winding that was never wired to the PLC. You can't catch a failure mode you aren't measuring. Most plants discover these gaps only after a post-mortem.

The typical line has dozens of sensors, but they were installed based on the OEM's default I/O list. That list covers the obvious stuff: main drive current, hydraulic pressure, discharge temperature. It rarely covers secondary failure modes.

Consider a gearbox on a conveyor drive. The OEM monitors oil temperature and output speed. But the most common failure mode for that gearbox is bearing wear on the input shaft. Without a vibration sensor on that bearing housing, the first sign of trouble is a seized bearing and a dead line.

The fix isn't buying more sensors blindly. Walk your downtime Pareto. Pull the last 12 months of unplanned stops. For each one, ask: was there a sensor that could have caught this 30 minutes earlier? If not, that's a blind spot.

Common blind spots we see repeatedly:

  • Motor winding temperature on auxiliary drives (not just the main spindle)
  • Vibration on gearbox input bearings (not just output)
  • Pressure differential across filters that get changed on a calendar, not on condition
  • Ambient temperature near VFDs in enclosed panels

Retrofitting a wireless vibration sensor costs $200-400 per point. Compare that to the cost of one unplanned stop. The math usually takes about five seconds.

2. Delayed Maintenance Response

The average time between failure detection and wrench-on-machine is 47 minutes in plants without mobile alerts [2]. Work orders sit in a queue. Techs don't know which machine triggered the alarm or what parts to bring. Half the downtime is just logistics, not repair.

A sensor catches a fault. The PLC throws an alarm. The operator calls maintenance. The maintenance coordinator checks the board, assigns a tech, and the tech walks to the crib for parts. That chain is where 47 minutes disappear on average in plants without mobile dispatch.

Break that down. The alarm-to-call step takes 3-5 minutes because the operator tries a restart first. The call-to-dispatch step takes 8-15 minutes because the coordinator is handling three other requests. The dispatch-to-arrival step takes another 10-20 minutes because the tech has to figure out which machine, pull up the manual, and grab parts.

The actual repair? Often under 15 minutes.

Plants that cut response time in half typically do three things. First, they send alarms directly to the assigned tech's phone with machine ID, fault code, and location. Second, they pre-stage common spare parts at the line instead of in a central crib. Third, they attach a short troubleshooting checklist to each alarm code so the tech arrives with a plan.

None of this requires new technology. It requires changing the information flow so the right person gets the right context immediately. A CMMS with mobile push notifications handles most of it. The bottleneck is almost never the repair skill. It's the time spent figuring out what happened and where.

Maintenance tool cart with a two-way radio parked beside a stopped conveyor belt, amber stack light glowing in the background
Maintenance tool cart with a two-way radio parked beside a stopped conveyor belt, amber stack light glowing in the background

3. Equipment Running Past Rated Cycles

OEMs rate servo motors for roughly 20,000 hours [3]. Plants routinely run them to 28,000 or more. Cycle count tracking lives in a spreadsheet, if it exists at all. When a servo fails at 26,000 hours, it's not a surprise. It's a predictable event that nobody was tracking.

Every rotating component has a rated life. Servo motors, spindle bearings, ball screws, pneumatic cylinders. OEMs publish these numbers, but they end up in a binder on a shelf.

The problem isn't that plants ignore maintenance. It's that hour-based or cycle-based replacement intervals aren't connected to the actual running hours of the machine. A press that runs two shifts accumulates hours twice as fast as the same press on one shift. The calendar-based PM schedule treats them the same.

Here's what typically happens. A servo motor rated for 20,000 hours is installed. The PM schedule says "replace every 3 years," which assumes single-shift operation. The plant adds a second shift six months later. Nobody updates the PM interval. The motor hits 28,000 hours in 2.5 years and fails.

The fix is tracking actual run hours per component, not per machine. Most PLCs already count cycles or run hours internally. The data is there. It's just not connected to the maintenance system.

Start with your top 10 downtime contributors. For each one, find the rated life of the wear components. Compare that to the actual accumulated hours. You'll likely find several components running 30-40% past their rated life. Those are your next unplanned stops.

4. Power Quality Issues

Voltage sags below 90% of nominal cause VFDs to trip. Most plants experience 5-12 sag events per month [4]. The line faults, the operator restarts, and nobody logs the root cause. Without power quality monitoring at the panel level, these events look like random stops.

Variable frequency drives are sensitive to input power quality. A voltage sag to 85% of nominal for even 100 milliseconds can cause a DC bus undervoltage fault. The drive trips, the motor stops, the line goes down.

The frustrating part is that these events are intermittent and hard to reproduce. The operator sees a drive fault, resets it, and the line runs fine for hours. It looks random. But if you install a power quality meter on the incoming feed, you'll often find a pattern.

Common sources of voltage sags:

  • Large motors starting on other feeders in the same plant
  • Utility switching events (capacitor banks, tap changers)
  • Welding operations on shared circuits
  • Compressor starts without soft starters

Plants with 5-12 sag events per month lose 30-90 minutes of production time per event when you include the restart sequence, quality checks, and re-establishing steady state.

Power quality meters at the main distribution panel cost $2,000-5,000 installed. They log every sag, swell, and transient with a timestamp. Once you have that data, you can correlate drive faults with power events. The fix might be as simple as adding a DC bus reactor to the VFD, installing a ride-through module, or moving a welding circuit to a different feeder.

Without the data, you're chasing ghosts.

Power quality analyzer clamped onto cables inside an open electrical panel, waveform display showing a voltage sag event
Power quality analyzer clamped onto cables inside an open electrical panel, waveform display showing a voltage sag event

5. Raw Material Variability

Incoming material properties like moisture content, viscosity, and particle size drift between batches and sometimes within a batch. If you're not correlating material lot numbers with line speed and reject rates, you won't see the pattern. The line looks unreliable, but the real variable is the feedstock.

A packaging line runs perfectly for three days, then suddenly starts jamming every 20 minutes. The maintenance team checks everything mechanical. Nothing is wrong. Then the material lot changes and the problem disappears.

This pattern is common in food and beverage, pharma, plastics, and any process where raw material properties matter. Film thickness varies by 5-10 microns between rolls. Resin melt flow index shifts between suppliers. Powder moisture content changes with warehouse humidity.

The line was tuned for a specific material spec. When the material drifts, the line parameters are no longer optimal. But because the material change isn't tracked alongside machine performance, nobody connects the two.

What helps:

  • Log material lot numbers against production runs in your MES or even a shared spreadsheet.
  • Track incoming material certificates of analysis. Note the actual values, not just pass/fail against spec.
  • When you get a run of stoppages, check if the material lot changed in the last 2-4 hours.
  • Build a simple scatter plot of reject rate vs. key material properties over 30 days.

You'll often find that 80% of your "random" stops correlate with material lots from specific suppliers or specific property ranges. That's actionable. You can tighten incoming specs, adjust line parameters per material batch, or set up automatic recipe changes based on incoming test results.

6. PLC and Software Faults

PLC firmware updates, HMI patches, and recipe changes introduce faults that are hard to trace. One wrong timer value and the line faults every 200 cycles. Software changes aren't always version-controlled or documented. The machine ran fine yesterday because yesterday's code was different.

A production line starts faulting intermittently. The alarm log shows a sequence timeout on station 4. Maintenance checks the mechanical components. Everything looks fine. The real cause? Someone adjusted a timer value in the PLC program last Tuesday to fix a different issue, and it created a race condition that only triggers at high cycle rates.

This happens more than most plants admit. PLC code is often modified by multiple people: the OEM's integrator, the plant's controls engineer, the third-party contractor who added the new vision system. Changes are made online, sometimes without saving the project file. There's no diff, no commit log, no rollback path.

Common software-related stoppages:

  • Timer values changed to fix one issue that create a fault in a different sequence
  • Recipe parameters uploaded from an old backup that don't match current tooling
  • HMI firmware updates that change the behavior of communication drivers
  • Safety PLC logic changes that alter the restart sequence timing

The fix is basic version control discipline. Save the PLC project file after every change. Use a naming convention with dates. Better yet, use PLC version control software that diffs ladder logic and function blocks.

Before any software change goes live, document what changed, why, and what the rollback steps are. When the line starts faulting, the first question should be: did anyone touch the code in the last 48 hours?

Laptop displaying ladder logic connected to a PLC rack inside a control cabinet, patch cables and documentation binder visible
Laptop displaying ladder logic connected to a PLC rack inside a control cabinet, patch cables and documentation binder visible

7. Operator Errors During Shift Changes

68% of plants report higher scrap rates [5] in the first 30 minutes after a shift change. Outgoing operators don't always pass along machine quirks, active alarms, or parameter adjustments they made during their shift. The incoming crew starts cold, and the line pays for it.

Shift handoffs are a known risk. The data confirms it. Plants that track scrap and downtime by time of day consistently see spikes in the first 30 minutes after a shift transition. The pattern is predictable, but most plants treat it as unavoidable.

What goes wrong during handoffs:

  • The outgoing operator adjusted a temperature setpoint 2 degrees up to compensate for a material batch. They didn't mention it. The incoming operator sees the non-standard setpoint and changes it back.
  • A machine had an intermittent fault that the outgoing operator learned to work around. The incoming operator doesn't know the workaround and calls maintenance.
  • The outgoing operator ran the last 30 minutes of their shift at reduced speed to avoid a known jam point. The incoming operator starts at full speed.

Plants that reduce shift-change losses do it with structured handoff procedures. Not a clipboard that nobody reads. A 5-minute standing meeting at the machine with both operators present.

The handoff should cover three things: what's different from standard right now, what problems came up and how they were handled, and what's queued for the next shift. Some plants use a digital shift log displayed on the HMI. Others use a whiteboard at the line. The format matters less than the habit.

One approach that works well: the incoming operator shadows the outgoing operator for 10 minutes before the official shift change. They run the machine together. Problems that would have taken 20 minutes to diagnose take 2 minutes because the context is right there.

What Modern Teams Do Differently

Plants that cut unplanned downtime by 30% or more share a common approach. They layer spatial context on top of sensor data using digital twin technology. Instead of monitoring individual data points in isolation, they correlate machine state, environmental conditions, and material data in a single visual model.

Traditional monitoring gives you time-series charts on a screen. You see that motor 7 hit 85°C at 2:14 PM. What you don't see is that motor 7 sits next to a steam line that was leaking, in a corner of the plant where airflow is restricted, during a shift that was running 10% above target speed.

Digital twin platforms solve this by mapping sensor data onto a spatial model of the production floor. Every sensor has a location. Every machine has neighbors. Every environmental factor has a zone of influence.

This spatial correlation catches patterns that flat dashboards miss. When a VFD trips, the digital twin shows you that three other VFDs in the same electrical panel have been trending warm for a week. A traditional alarm system treats each drive independently.

The approach works because manufacturing problems are rarely isolated. A stopping event on one machine is usually the visible symptom of a system-level issue involving power, environment, material, and operating parameters interacting. Seeing those interactions requires a model that preserves spatial relationships.

Teams using digital twins for root cause analysis report finding the actual cause of chronic stops 40-60% faster than teams working from alarm logs [6] and spreadsheets alone. The speed comes from being able to visually trace upstream and downstream effects without walking the floor for every investigation.

The tooling has caught up in the last two years. What used to require months of modeling and custom development can now be deployed in days using floor plan imports and automated sensor mapping. The barrier for most plants isn't cost or complexity. It's awareness that the category exists.

FAQ

Frequently Asked Questions

It varies widely by industry and line value. Automotive plants commonly cite $22,000 per minute [7]. For mid-size discrete manufacturing, $5,000-15,000 per hour is a reasonable range. The key is calculating your own number: take your daily output value and divide by production hours. That gives you the revenue cost per hour. Add labor, scrap, and restart costs on top.
Pull your alarm log and maintenance work orders for the last 90 days. Categorize each stop by root cause, not symptom. Group them into buckets: mechanical, electrical, software, material, operator. Sort by total minutes lost, not number of events. A single 4-hour stop outweighs twenty 5-minute nuisance alarms. Focus your first improvement project on the top bucket.
Fix preventive maintenance first. If your PM completion rate is below 90%, adding predictive sensors will generate alerts that your team doesn't have bandwidth to act on. Get the basics right: PMs completed on time, spare parts stocked for critical components, and run-hour tracking on wear items. Then layer in vibration and thermal monitoring on your top 5 downtime contributors.
There's no universal number. Start with your downtime data, not a sensor catalog. Identify the top 10 failure modes by total downtime minutes. For each one, ask: what physical parameter would have given us 30 minutes of warning? That tells you what to measure and where. Most lines need 15-30 additional monitoring points beyond what the OEM installed.
If your OEE is below 65%, you have significant low-hanging fruit. World-class discrete manufacturing runs 85%+, but jumping straight to that target isn't useful. Focus on the gap between your current OEE and 75%. That gap is almost always dominated by 2-3 chronic issues. Fix those first. Chasing the last 5% of OEE requires a different level of investment and is a later-stage problem.

Related Resources

Tool

Manufacturing Downtime Cost Calculator

Calculate the true cost of unplanned downtime across your production lines. Includes lost revenue, labor waste, and scrap costs. Free, instant results.

Learn more
Comparison

Digital Twin vs SCADA

A practical comparison of SCADA and digital twin platforms for manufacturing. Covers data models, visualization, alerting, and deployment trade-offs.

Learn more
Comparison

Digital Twin vs MES

A practical comparison of MES and digital twin platforms for manufacturing. Covers ISA-95 levels, OEE tracking, production traceability, and how the two systems complement each other.

Learn more
Solution

Unplanned Downtime Prevention

Most manufacturers discover downtime after it costs them. Sandhed gives you the visibility to catch equipment issues before they shut down production.

Learn more
Solution

Maintenance Management

Maintenance teams lose hours tracking down service records, chasing overdue tasks, and figuring out what was done last time. Sandhed puts every work order, service record, and maintenance schedule on your 3D floor plan where you can see it.

Learn more
Answer

How to Reduce Equipment Downtime: 8 Strategies Ranked by Impact

Reducing equipment downtime starts with knowing where you're losing time, not with buying technology. The eight strategies below are ranked by how much downtime they typically eliminate in the first year. The top three are organizational fixes that cost almost nothing. The rest require incremental investment but build on each other.

Learn more
Answer

How to Monitor a Factory Floor in Real Time

Real-time factory monitoring means having sensor data from machines, environment, and process parameters available within seconds, not hours. It starts with choosing the right sensors for your top failure modes, runs through an edge-to-cloud architecture that handles the data volume, and works only if the alert design respects your operators' attention.

Learn more

Stop Guessing Why Your Lines Go Down

See how spatial sensor correlation finds the root causes that alarm logs miss. Get a walkthrough with your own floor plan data.