AI search in creative storage is mostly three jobs running quietly in the background: a model watches your frames, another listens to your audio, and a third turns both into numbers a computer can compare. When an editor types "drone shot over the coastline at golden hour" and the right clip appears at the right timecode, that is the payoff. This is a plain-language field guide to what these systems actually index, where the processing happens, and what each model costs you, because those three answers decide whether AI search is a quiet convenience or a line item that grows with your archive.
What it actually indexes #
Almost every AI search feature in creative storage is built from the same handful of signals, no matter whose logo is on the box. There is visual analysis, where a model samples frames and records what it sees: objects, scenes, on-screen text, faces, logos. There is audio analysis, where speech is transcribed into searchable text and speakers are separated out. And there is the layer most people mean when they say "semantic" search, where both the pictures and the words get converted into vectors so the system can match meaning rather than exact spelling.
LucidLink's media-library AI, built on Moments Lab's MXT-2 engine, is a good map of the full menu: it detects faces, objects, logos, on-screen text, light changes, motion, and camera angles on the visual side, uses speaker diarization to separate voices on the audio side, transcribes speech, and clusters related shots into coherent "moments" (LucidLink blog, checked Jun 2026). iconik runs a similar set, generating transcripts, detecting faces, identifying objects and scenes, and summarizing content from the transcript so you can read the gist before opening a file (iconik AI page, checked Jun 2026). Shade indexes scenes, faces with name-based clustering, transcription with timecodes, and even specific objects like jersey numbers and product labels (Shade AI search page, checked Jun 2026).
The "semantic" part is worth a quick analogy. A model like OpenAI's CLIP turns an image into a list of about 512 numbers that describe what the picture means, not what its pixels are. Two completely different photos of the Eiffel Tower land near each other in that number-space, and your typed query gets turned into the same kind of list. Search then becomes "find the stored vectors closest to my query vector." That is why you can search for a concept you never tagged: the meaning was captured at ingest, whether or not anyone wrote it down.
Where the processing happens #
This is the question that decides both your bill and your confidentiality posture, and it is the one vendors are least precise about. There are three places the work can run: in the vendor's cloud, on third-party AI cloud services the vendor calls out to, or locally on your own machine or NAS.
The cloud-and-call-out model is the common one. iconik is candid that visual analysis runs on proxy files, audio is extracted for transcription, full-resolution originals are not shipped to external providers, and customer content is not used to train external models (iconik AI page, checked Jun 2026). That is a reasonable design, but read it carefully: a proxy and an audio track still leave your environment and reach a third-party engine. Shade markets a "privacy-first" stance with end-to-end encryption and an AI privacy guarantee that data is never used for training, while still being a cloud platform that indexes on ingest (Shade, checked Jun 2026). LucidLink's feature is a partnership: Moments Lab's MXT-2 does the indexing, which means the intelligence lives with a separate AI vendor that your footage has to reach.
Local processing is the other end of the spectrum, and it is more practical than it used to be. whisper.cpp, a C/C++ port of OpenAI's Whisper, transcribes on a plain CPU and accelerates on Apple Silicon's Neural Engine, NVIDIA CUDA, or cross-vendor Vulkan, with no per-minute fee and audio that never leaves the machine (whisper.cpp project, checked Jun 2026). The same is true of a CLIP-style embedding model, which is small enough to index frames on a workstation or a NAS with a modest GPU. The honest tradeoff: local indexing is bounded by the hardware you own, so a one-time backfill of a 200 TB archive will take longer than renting a fleet of cloud GPUs for an afternoon. For ongoing ingest on a small team, local keeps up fine. If you want the deeper version of this argument, see local vs cloud AI indexing and the privacy cost of cloud AI search.
What it costs, and how the meter runs #
There are really only two pricing shapes, and the difference matters more than the headline number. Per-seat-plus-storage pricing is predictable: you pay for people and capacity, and indexing is bundled. Metered pricing scales with how much footage you push through, which is great until you backfill an archive. Most platforms blend the two, with a generous-sounding seat price and a credit meter humming underneath.
| Option | Billing shape | Representative price | The catch |
|---|---|---|---|
| Shade | Per seat, AI bundled | $20/seat/mo annual ($25 monthly), 500 GB active storage per seat, unlimited indexing | Storage beyond the per-seat allotment pushes you to custom Enterprise; reported average contract is $10,000-$15,000/yr for a 10-user, 25 TB setup |
| iconik | Per seat plus metered AI credits | Standard user $65/mo, Power user $120/mo, Collaborator $9/mo; AI is pay-per-use credits | AI credit and per-minute rates are not published; transcription and recognition draw down credits as you index |
| LucidLink + Moments Lab | Storage subscription plus partner AI | Custom; MXT-2 priced by contract duration and usage | Two relationships and two bills; AI indexing terms negotiated separately from your Filespace |
| Raw cloud AI (AWS, DIY) | Strictly metered, per minute | Rekognition label detection $0.10/min; Transcribe from $0.024/min (tier 1) | You build the pipeline yourself; a 1,000-hour archive backfill is roughly $6,000 of label detection plus about $1,400 of transcription |
| Local (whisper.cpp, local CLIP) | One-time hardware, no meter | $0 per minute after your existing machine | Bounded by your own GPU/CPU; large backfills are slower than renting cloud capacity |
The metered numbers are the ones that surprise people. Amazon Rekognition video label detection runs $0.10 per minute and Amazon Transcribe starts at $0.024 per minute at tier 1 (AWS pricing, checked Jun 2026). Those sound trivial until you do the archive math: a thousand hours of footage is 60,000 minutes, so a single full-library label-and-transcribe pass lands near $7,400 of raw API cost before anyone has built a search box on top. Vendors who bundle AI into a seat price are absorbing some version of that meter, which is exactly why their storage allotments are capped and overages route to a sales call. For a full cross-platform teardown of the add-on math, see the true cost of AI add-ons across the major platforms.
Why two products with the same features still search differently #
Two platforms can both claim "objects, faces, transcription, semantic search" and still feel completely different in daily use, because the quality lives in choices the marketing page does not mention. How often does the model sample frames? A system that grabs one frame every two seconds will miss a half-second insert shot that one sampling every quarter-second would catch. How good is the embedding model at your specific content, sports versus interviews versus drone b-roll? And how is the vector index built, because that determines whether a search across millions of clips returns in a blink or a long pause.
Shade is a useful illustration of where the engineering effort goes. The company uses OpenAI for some custom metadata generation but built its own distilled semantic-search engine in-house specifically so it can run searches across millions of assets on ordinary CPU infrastructure instead of expensive GPUs (TechCrunch, Apr 2026). That is not a cosmetic decision. It is the difference between an index that stays affordable at scale and one whose cost climbs with every clip. iconik reports its engine has completed 5.7 million AI analyses, 5.1 million transcriptions, and recognized 578,000 faces, which is a fair signal of a mature pipeline rather than a fresh bolt-on (iconik AI page, checked Jun 2026). When you evaluate, test on your own footage, not the demo reel, and watch for the moment the system should have found and did not. For the question of whether all this actually saves editing hours, see does AI search actually save editors time.
Where this fits into a self-hosted mount #
If your footage lives on your own NAS behind a real mount, the AI-search question changes shape: you are no longer choosing whose cloud indexes your media, you are choosing what to run next to it. The local route covered above, whisper.cpp for transcription and a CLIP-style model for frame embeddings, runs perfectly well against storage you already own, and the index stays on your side of the network. JuiceMount keeps a local search index for fast filename and metadata lookups, which covers "find the file" but is not, and does not claim to be, a frame-by-frame semantic engine. That deeper layer is something you bolt on with the open models above, and it is honest to say the polished, turnkey "type a sentence, get the moment" experience is still easier to buy from Shade or iconik than to assemble yourself today.
The reason to keep it local is not that the cloud products are bad. They are genuinely good, and for many teams the bundled convenience is worth the bill. The reason is ownership: with metered AI search, indexing a large archive is a recurring or one-time cost measured in thousands of dollars, and your footage has to travel to a third party to earn it. On owned storage, the meter is your hardware, and the embargoed cut never leaves the building. Which side of that tradeoff is right depends on your archive size, your client contracts, and your appetite for assembling tools, which is the whole point of knowing what is actually running.
Sources, checked June 2026
- LucidLink blog, "Unlock your media library with AI and real-time access" (Moments Lab MXT-2 partnership, what it indexes: faces, objects, logos, on-screen text, diarization, semantic clustering).
- iconik artificial-intelligence page (transcription, facial recognition, object/scene detection, summarization; proxy-only analysis, originals not sent externally, no training on customer content; 5.7M analyses, 5.1M transcriptions, 578K faces).
- iconik pricing page (Browse $0, Collaborator $9/mo, Standard $65/mo, Power $120/mo; pay-per-use AI credits).
- Shade AI search and product pages (scene detection, face clustering and name search, timecoded transcription, object/product labeling; privacy-first, end-to-end encryption, no training on customer data).
- Shade pricing (Growth $20/seat/mo annual, $25 monthly, 500 GB active storage per seat, unlimited AI indexing).
- TechCrunch, "Shade lands $14M..." Apr 2026 (in-house distilled semantic engine running on CPU; OpenAI used for some metadata; reported average contract $10,000-$15,000/yr for 10 users and 25 TB).
- Amazon Rekognition pricing (video label detection $0.10/min, tier 1).
- Amazon Transcribe pricing (from $0.024/min at tier 1, billed per second).
- whisper.cpp project (local CPU/Metal/CUDA/Vulkan transcription, no per-minute fee, audio stays on device).
- Vultr and community references on CLIP frame embeddings and FAISS vector search (512-dim embeddings, shared text/image space, cosine-similarity retrieval).