[ Upstream commit 3cf74230c139f208b7fb313ae0054386eee31a81 ]
If qi_submit_sync() is invoked with 0 invalidation descriptors (for
instance, for DMA draining purposes), we can run into a bug where a
submitting thread fails to detect the completion of invalidation_wait.
Subsequently, this led to a soft lockup. Currently, there is no impact
by this bug on the existing users because no callers are submitting
invalidations with 0 descriptors. This fix will enable future users
(such as DMA drain) calling qi_submit_sync() with 0 count.
Suppose thread T1 invokes qi_submit_sync() with non-zero descriptors, while
concurrently, thread T2 calls qi_submit_sync() with zero descriptors. Both
threads then enter a while loop, waiting for their respective descriptors
to complete. T1 detects its completion (i.e., T1's invalidation_wait status
changes to QI_DONE by HW) and proceeds to call reclaim_free_desc() to
reclaim all descriptors, potentially including adjacent ones of other
threads that are also marked as QI_DONE.
During this time, while T2 is waiting to acquire the qi->q_lock, the IOMMU
hardware may complete the invalidation for T2, setting its status to
QI_DONE. However, if T1's execution of reclaim_free_desc() frees T2's
invalidation_wait descriptor and changes its status to QI_FREE, T2 will
not observe the QI_DONE status for its invalidation_wait and will
indefinitely remain stuck.
This soft lockup does not occur when only non-zero descriptors are
submitted.In such cases, invalidation descriptors are interspersed among
wait descriptors with the status QI_IN_USE, acting as barriers. These
barriers prevent the reclaim code from mistakenly freeing descriptors
belonging to other submitters.
Considered the following example timeline:
T1 T2
========================================
ID1
WD1
while(WD1!=QI_DONE)
unlock
lock
WD1=QI_DONE* WD2
while(WD2!=QI_DONE)
unlock
lock
WD1==QI_DONE?
ID1=QI_DONE WD2=DONE*
reclaim()
ID1=FREE
WD1=FREE
WD2=FREE
unlock
soft lockup! T2 never sees QI_DONE in WD2
Where:
ID = invalidation descriptor
WD = wait descriptor
* Written by hardware
The root of the problem is that the descriptor status QI_DONE flag is used
for two conflicting purposes:
1. signal a descriptor is ready for reclaim (to be freed)
2. signal by the hardware that a wait descriptor is complete
The solution (in this patch) is state separation by using QI_FREE flag
for #1.
Once a thread's invalidation descriptors are complete, their status would
be set to QI_FREE. The reclaim_free_desc() function would then only
free descriptors marked as QI_FREE instead of those marked as
QI_DONE. This change ensures that T2 (from the previous example) will
correctly observe the completion of its invalidation_wait (marked as
QI_DONE).
Signed-off-by: Sanjay K Kumar <sanjay.k.kumar@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240728210059.1964602-1-jacob.jun.pan@linux.intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2c13012e09190174614fd6901857a1b8c199e17d ]
We will use a global static identity domain. Reserve a static domain ID
for it.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Link: https://lore.kernel.org/r/20240809055431.36513-4-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3a8990b8a778219327c5f8ecf10b5d81377b925a ]
On qcom msm8998, writing to the last context bank of lpass_q6_smmu
(base address 0x05100000) produces a system freeze & reboot.
The hardware/hypervisor reports 13 context banks for the LPASS SMMU
on msm8998, but only the first 12 are accessible...
Override the number of context banks
[ 2.546101] arm-smmu 5100000.iommu: probing hardware configuration...
[ 2.552439] arm-smmu 5100000.iommu: SMMUv2 with:
[ 2.558945] arm-smmu 5100000.iommu: stage 1 translation
[ 2.563627] arm-smmu 5100000.iommu: address translation ops
[ 2.568923] arm-smmu 5100000.iommu: non-coherent table walk
[ 2.574566] arm-smmu 5100000.iommu: (IDR0.CTTW overridden by FW configuration)
[ 2.580220] arm-smmu 5100000.iommu: stream matching with 12 register groups
[ 2.587263] arm-smmu 5100000.iommu: 13 context banks (0 stage-2 only)
[ 2.614447] arm-smmu 5100000.iommu: Supported page sizes: 0x63315000
[ 2.621358] arm-smmu 5100000.iommu: Stage-1: 36-bit VA -> 36-bit IPA
[ 2.627772] arm-smmu 5100000.iommu: preserved 0 boot mappings
Specifically, the crashes occur here:
qsmmu->bypass_cbndx = smmu->num_context_banks - 1;
arm_smmu_cb_write(smmu, qsmmu->bypass_cbndx, ARM_SMMU_CB_SCTLR, 0);
and here:
arm_smmu_write_context_bank(smmu, i);
arm_smmu_cb_write(smmu, i, ARM_SMMU_CB_FSR, ARM_SMMU_CB_FSR_FAULT);
It is likely that FW reserves the last context bank for its own use,
thus a simple work-around is: DON'T USE IT in Linux.
If we decrease the number of context banks, last one will be "hidden".
Signed-off-by: Marc Gonzalez <mgonzalez@freebox.fr>
Reviewed-by: Caleb Connolly <caleb.connolly@linaro.org>
Reviewed-by: Bjorn Andersson <andersson@kernel.org>
Link: https://lore.kernel.org/r/20240820-smmu-v3-1-2f71483b00ec@freebox.fr
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b5e86a95541cea737394a1da967df4cd4d8f7182 ]
Queued invalidation wait descriptor status is volatile in that IOMMU
hardware writes the data upon completion.
Use READ_ONCE() to prevent compiler optimizations which ensures memory
reads every time. As a side effect, READ_ONCE() also enforces strict
types and may add an extra instruction. But it should not have negative
performance impact since we use cpu_relax anyway and the extra time(by
adding an instruction) may allow IOMMU HW request cacheline ownership
easier.
e.g. gcc 12.3
BEFORE:
81 38 ad de 00 00 cmpl $0x2,(%rax)
AFTER (with READ_ONCE())
772f: 8b 00 mov (%rax),%eax
7731: 3d ad de 00 00 cmp $0x2,%eax
//status data is 32 bit
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20240607173817.3914600-1-jacob.jun.pan@linux.intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20240702130839.108139-2-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 927c70c93d929f4c2dcaf72f51b31bb7d118a51a ]
The Allwinner H6 IOMMU has a bypass register, which allows to circumvent
the page tables for each possible master. The reset value for this
register is 0, which disables the bypass.
The Allwinner H616 IOMMU resets this register to 0x7f, which activates
the bypass for all masters, which is not what we want.
Always clear this register to 0, to enforce the usage of page tables,
and make this driver compatible with the H616 in this respect.
Signed-off-by: Jernej Skrabec <jernej.skrabec@gmail.com>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Chen-Yu Tsai <wens@csie.org>
Link: https://lore.kernel.org/r/20240616224056.29159-2-andre.przywara@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
kmemleak_ignore() shouldn't be used on the gen_pool allocations since they
aren't slab allocations. This leads to a flurry of warnings from kmemleak;
fix it by using kmemleak_ignore() correctly.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
[ Upstream commit 80fea979dd9d48d67c5b48d2f690c5da3e543ebd ]
If devm_add_action() returns -ENOMEM, then MSIs are allocated but not
not freed on teardown. Use devm_add_action_or_reset() instead to keep
the static analyser happy.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Signed-off-by: Aleksandr Aprelkov <aaprelkov@usergate.com>
Link: https://lore.kernel.org/r/20240403053759.643164-1-aaprelkov@usergate.com
[will: Tweak commit message, remove warning message]
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 404ec4e4c169fb64da6b2a38b471c13ac0897c76 ]
Newer AMD systems can support multiple PCI segments, where each segment
contains one or more IOMMU instances. However, an IOMMU instance can only
support a single PCI segment.
Current code assumes that system contains only one pci segment (segment 0)
and creates global data structures such as device table, rlookup table,
etc.
Introducing per PCI segment data structure, which contains segment
specific data structures. This will eventually replace the global
data structures.
Also update `amd_iommu->pci_seg` variable to point to PCI segment
structure instead of PCI segment ID.
Co-developed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20220706113825.25582-3-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Stable-dep-of: a295ec52c862 ("iommu/amd: Fix sysfs leak in iommu init")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a34f3e20ddff02c4f12df2c0635367394e64c63d ]
The page request queue is per IOMMU, its allocation should be made
NUMA-aware for performance reasons.
Fixes: a222a7f0bb6c ("iommu/vt-d: Implement page request handling")
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240403214007.985600-1-jacob.jun.pan@linux.intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0feda94c868d396fac3b3cb14089d2d989a07c72 ]
On many systems that have an AMD IOMMU the following sequence of
warnings is observed during bootup.
```
pci 0000:00:00.2 can't derive routing for PCI INT A
pci 0000:00:00.2: PCI INT A: not connected
```
This series of events happens because of the IOMMU initialization
sequence order and the lack of _PRT entries for the IOMMU.
During initialization the IOMMU driver first enables the PCI device
using pci_enable_device(). This will call acpi_pci_irq_enable()
which will check if the interrupt is declared in a PCI routing table
(_PRT) entry. According to the PCI spec [1] these routing entries
are only required under PCI root bridges:
The _PRT object is required under all PCI root bridges
The IOMMU is directly connected to the root complex, so there is no
parent bridge to look for a _PRT entry. The first warning is emitted
since no entry could be found in the hierarchy. The second warning is
then emitted because the interrupt hasn't yet been configured to any
value. The pin was configured in pci_read_irq() but the byte in
PCI_INTERRUPT_LINE return 0xff which means "Unknown".
After that sequence of events pci_enable_msi() is called and this
will allocate an interrupt.
That is both of these warnings are totally harmless because the IOMMU
uses MSI for interrupts. To avoid even trying to probe for a _PRT
entry mark the IOMMU as IRQ managed. This avoids both warnings.
Link: https://uefi.org/htmlspecs/ACPI_Spec_6_4_html/06_Device_Configuration/Device_Configuration.html?highlight=_prt#prt-pci-routing-table [1]
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Fixes: cffe0a2b5a34 ("x86, irq: Keep balance of IOAPIC pin reference count")
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20240122233400.1802-1-mario.limonciello@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit afc95681c3068956fed1241a1ff1612c066c75ac upstream.
In some cases the firmware expects cbndx 1 to be assigned to the GMU,
so we also want the default domain for the GMU to be an identy domain.
This way it does not get a context bank assigned. Without this, both
of_dma_configure() and drm/msm's iommu_domain_attach() will trigger
allocating and configuring a context bank. So GMU ends up attached to
both cbndx 1 and later cbndx 2. This arrangement seemingly confounds
and surprises the firmware if the GPU later triggers a translation
fault, resulting (on sc8280xp / lenovo x13s, at least) in the SMMU
getting wedged and the GPU stuck without memory access.
Cc: stable@vger.kernel.org
Signed-off-by: Rob Clark <robdclark@chromium.org>
Tested-by: Johan Hovold <johan+linaro@kernel.org>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/20231210180655.75542-1-robdclark@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 85b80fdffa867d75dfb9084a839e7949e29064e8 upstream.
The VT-d spec requires (10.4.4 Global Command Register, TE field) that:
Hardware implementations supporting DMA draining must drain any in-flight
DMA read/write requests queued within the Root-Complex before switching
address translation on or off and reflecting the status of the command
through the TES field in the Global Status register.
Unfortunately, some integrated graphic devices fail to do so after some
kind of power state transition. As the result, the system might stuck in
iommu_disable_translation(), waiting for the completion of TE transition.
Add MTL to the quirk list for those devices and skips TE disabling if the
qurik hits.
Fixes: b1012ca8dc4f ("iommu/vt-d: Skip TE disabling on quirky gfx dedicated iommu")
Cc: stable@vger.kernel.org
Signed-off-by: Abdul Halim, Mohd Syazwan <mohd.syazwan.abdul.halim@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20231116022324.30120-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Lots of subsystems, such as the TPU, occasionally spam hundreds of
thousands of IOMMU faults which are not only resource heavy due to the IRQ
overhead, but also destroy dmesg/ramoops with tons of spam. These errors
appear to be nonfatal and don't seem actionable for anyone outside of
Samsung or Google, so turn them off by default.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>