All Roads Lead to Archive: Episode 3 - A Chat with Grau Data- #AI can be a lot more affordable

Every time your AI processes a file, you're paying a hidden tax. And most companies are paying it again and again without realizing it. It's called the AI token tax. And for data inensive organizations, it quietly drains millions in redundant compute and storage every year. Today, we're going to show you how to stop it. All right. Welcome back to All Roads Lead to Archive. Allstar series about data, media, and archives. I'm Jeff Sangpiel. I'm joined by David Surf, the chief data evangelist at Grow Data, a company with deep roots in tape and large-scale archive. David Grow's background is in safeguarding massive archives. How does that history prepare you for this new AI landscape? >> Well, first, uh, thanks for having me on today, Jeff. Appreciate it. We love to share what we're working on. And, um, you know, our history is deep in archive. It's taught us that, you know, customers don't really want to just use their unstructured data to store it. They want to be able to turn it into something that is beneficial to the business. And now AI is creating a new pressure. Machine generated data from like satellites, genomic sequencers, production sensors, autonomous systems, and uh anything that has me and workflows are generating millions of files every day. Yet about 80 to 90% of that data isn't even accessible for AI or analytics. So most of it just sits up on some type of deep archive storage whether that's an object store or a tape library. And um you might be able to you know search a bit for for it but most of it's invisible and particularly some of the critical insights that are hidden with inside a file. And so the only way to get there might be some type of

manual process. That's what's also keeping project cost climbing and creating real you know lagards in way we're seeing performance in in AI return on investment. So the background from grow data, you know, was our very large massive archive customers saying to us, how do we better use our unstructured data? >> And I I know from the media and entertainment side, there's a ton of unstructured data that folks would love to turn into something usable, but there's a cost to get there. So let's get right to that cost. What exactly is the AI token tax? >> That's a great question. Imagine imagine running the same five minute video through three different AI processing tools like a transcription maybe a sediment analyzer compliance scanning and each tool reads the full file. It breaks it into tokens right so we take a file it goes through a embedding engine that generates a token and that token then gets stored into a vector database and that burns GPU cycles. So that three different AI tools hitting it, you've just paid three times for the same work. Now scale that across 100 million files. That's millions of dollars a year wasted in redundant processing. It's the hidden cost in which is making it very hard for organizations to realize their AI objectives. this ballooning budget. It slows down workflows, consumes resources, and also it limits the access to critical data because getting that information out of the file is actually pretty hard. >> Yep. So, let's take a look at how a million dollars would add up in that. >> So, the breakdown in in this image is really talking about where do we really pay the penalty or what we call the AI

tax. And you'll notice that it's in the data prep. There's a certain [snorts] amount for compute, of course, the GPUs that you're putting together and the associated cost, but there's this cost to extraction, which a lot of times is individuals. It's the human cost to prepare that data. Um, and of course, there's all the workflow that's related to that is I've got the file sitting here. I've got to open a file. That's that's a large cost is touching that file to open it. There's the the harvesting of data out of that file which is the generation of the insights that come from the file and many of these files will have embedded metadata which is very difficult to get to. So it's this processing cost that involves human labor that can consume 80 plus percent of the expense in data prep >> and and that's where everyone thinks oh it's AI it's not going to cost me anything but because the data is not structured you are going to have to pay to get it in order to then allow AI to touch it. I I know tons of libraries of data sitting around on older generations of tape because they didn't have any benefit to this concept when they were created that it's going to cost a lot to try to get insights for from them. So I think also looking forward if you are looking to repack or repackage an archive going through this process is going to save you or someone you're trying to sell that data to. whether it's for uh AI inference and learning for someone else's models or to sell the entire library and remonetize that particular set of content they're not going to have to pay repeatedly to get these insights to pay the tokens on on getting it for transcription for translation for the

emotional sentiment to find the dodge logos all through it um all that fun sort of stuff so um I mean what we're solving ving here. It sounds like it's a fragmented pipeline that we're we're trying to get through. >> It it's true. I mean, what we have is a bit of legacy in the way that these AI workflows, you know, work with the current access to storage. So, we've you know, we we all understand the tiers in storage. You have performance tiers and you have your nearline and then the the deeper archive as you were pointing out. And that uh the real challenge is that is in these large environments where we're generating massive volume of data. Uh the goal certainly is get it off the expensive tier fairly quickly. We want to get it into an archive, but the AI workflows have changed to where we don't know what we need. So there there is the core of the problem is that we're keeping everything up into an expensive tier. We're touching it over and over and over again. And to break that cycle uh it's what we call the harvest once reuse multiple times in the ability to uh get the insights out on the file creation. That's really where you break it. So there is we have data that's coming back in we'll go in and open the file on creation harvest those insights and generate a persistent proxy of all this metadata. And then that metadata is as new data is generated again and again it's just expanding the proxy and that proxy can now feed multiple workflows. So it can feed in an embedding engine will generate the vectors and um that is now servicing these multiple AI workflows without having to go back and touch the

file over and over and over again. And I think that would also eliminate having to have duplicate storage and the duplicate cost especially at the the high end letting you get to that archive uh sooner. I mean and also kind of eliminates these silos that like the like the visual here is going to talk about. >> Absolutely. So the you're this is where you're really benefiting an active archive architecture, right? is is the fact that that by impacting the workflow and capturing the insights up front, you can keep a flow of that file to wherever it needs to be. Whether there's one or more copies, those are compliance and governance requirements. But getting it off primary storage definitely reduces the redundancy because most of the time uh between 80 90% of the time you never need the file again. What you want is you want these insights, this this embedded metadata and related contextual information that's being extracted. And so all of a sudden you're generating um less performance storage, but you're more importantly is you have 80% fewer tokens being generated, 80% less compute being used, you know, 30% overall lower AI uh infrastructure costs. All of these things now can service these key workflows like search um or rag ra you know and and archive requirements all from a singular uh source of data >> and that that's breaking that whole cycle that we've been in and just thought was the way we always should do it. Something something new comes along we'll just run it again. We'll just run it again. We'll just run it again. Oh, they're going to turn on ThreeM Island so we can run it again. We're just going to run it again. we don't have to run it again because we the the country is not

going to have enough power to continue to run everything again anyway. Not to mention the fact most organizations aren't going to want to pay for this over and over and over again when they're not when they're seeing incremental possible changes in the amount of information they're gathering from the new AI runs each time as well. So I mean how do we stop this cycle? Well, what I think I think is important what you just said which is that um you know the idea was that they were going to be able to run it again and so we've seen storage focus in AI primarily around performance everything has been faster and there have been a tremendous things to to accelerate that and there's a lot of data that is real time that wouldn't necessarily fit into this model and so you could think of those as you know instantaneous elements that are being generated so faster access is critical but in vertical to generate these massively large volumes. So anything that has imaging of any sort, life sciences where it's genomics files, manufacturing where it's sensors off a production line, this machine generated data is fixed content and its value is from within the insight. So it's not like they're hit they really need to hit that file. So performance on accessing that file even though you can make it instantaneously faster, it that redundancy doesn't go away. you have to change the architecture. So this this is a disruptive approach to uh thinking about how do you you store that data and make it available back into the data users whether whether it's AI workflows or it could also be feeding general analytic workflows like if you're provisioning back into other applications or data users are trying to

uh feed internal processes you have this one um metadata uh proxy that can service whatever whatever needs to access the data. And so that that is a very unique way to think of it. And of course the key in the storage architecture is you can now take that file and put it where it's most effectively stored and most cost effectively stored. >> And and I I know from my point of view I'm I'm all about the most cost-effective storage. Um the the thing here is it's it's a disruptive workflow upon a workflow that's already disruptive. And the the thing is, you know, AI is disruptive in unique ways, power, cooling, um data prep, and being able to disrupt the things that are going to cause that that hockey stick of the cost there. Um that that sounds like a wonderful thing for most organizations. So speaking of most organizations, who's doing this today with with metadata hub? >> Um we we I could share you know two really distinct use cases. Um one one is from a life science institute sus institute Berlin. They they've been one of the earliest adopters and have uh of metadata hub and have actually helped us uh to find new use cases constantly which is pretty exciting. They're they're very large. They have 200 pabytes. So they have a very large uh uh infrastructure. They that's all on tape. They also have a pabyte of high performance uh storage. And their goal is that when files are created, they capture the insights uh the metadata instantly. So they can move that file the tape as fast as they can. They they were able to catch on to the idea of that ROI through storage um uh archiving

very very quickly. But their their in their initial use case with us was to drive their researchers that were generating these files that needed deeper insights uh that came from inspecting the file and extracting this metadata that can then drive what they were trying to do in discovery. And so that ability to uh be driven by the business case which is you know faster discovery and microscopy and genomics research and to feed and provision into those applications uh drove to the drove their IT infrastructure to realize um that once we've captured this data we could store it better. So that was really um one of our our learning the learning curve for us was watching customers discover something in real time that we could repeat and share with anybody. So um you know life sciences look very similar across anything rate research high performance computing these all look very very similar but we have a very different use case uh with an automotive manufacturer um global manufacturer multiple production facilities that um had a requirement on legal hold and so on each of their production lines their cameras so there was computer vision where they need to do uh inspection from the manufacturing line [snorts] and they're producing tens of millions of new file files per year about 50 million new files per year that have very strict governance and legal hold requirements. So each file that's has a unique identifier that's generated that's in that is uh created based on that production line that creates a unique identifier that drives and triggers a workflow. So the metadata hub is extracting the information off the image files and combining that with the unique identifier that then drives a

data orchestrator, a data mover. And that data mover can then say based on these elements that we've surfaced, uh we're driving their policy. We it actually creates uh or sorry sets the triggers that then says where does this data get stored and for how long? And so the customer is able to create a fully agentic workflow that allows them to go from file creation all the way to preservation and and and compliance um uh through their process. And that was critical because it's a very large entity. They didn't want to have if there were errors or hum you know human errors that could be very costly for them. And uh this was able to reduce um their storage cost by 50% in the way that they were able to use archival storage. It accelerated their entire data handling and um provided this transparent automated workflow for them. So this was a very powerful use case that um you see commonly in anything in the manufacturing. We're seeing more and more computer vision as we see more kind of drones or even robotics that are going to constantly have this streaming video content. A lot of that would be something that they might want to use for risk assurance when they're checking, you know, buildings, bridges, infrastructure, those are interesting use cases. Anything that's flying in the air, uh satellite streaming imaging, these are all types of interesting similar use cases. And and what I'm seeing there is you're you're not only providing the proxies are providing agentic workflows, but you're giving further agency back to the original media and and and materials because it now has the ability to more quickly get processed and turn into actionable insights. The orchestrator shows this

sort of thing happening all the time. Oh, we need to make a production shift or we need this different sort of imagery coming in from the micros microscope and you know lets you change things on the fly a lot faster and I'm assuming in in some industries that's probably going to save a good chunk of money. >> You know it and and that's a great point and I think it's worth noting here Jeeoff that that the metadata hub is an enabling technology. it it's enabling the way we can do things that uh come from these deeper accessible insights. So you're enabling a better more effective active archive. you're enabling data provenence which improves RA a right so it's not like we're solving an RA a problem for example we're we're enabling it to be solved or improved and same with this embedding um and the AI tax it's the ability to enable just a better uh workflow and architecture so that [clears throat] that's where metadata hub fits it it improves something from your your unstructured data and makes it more usable Yep. And that's what we've always been about making the unstructured data more usable. Um a as you know we create more and more of this stuff as we go along. Um so metadata is provide metadata hub is providing the why and the data mover is providing the how. Um how do these you know these two play together nicely? that you said it exactly which is metadata hub is providing this intelligence layer which is the ability to um extract these insights that could be fed into uh the data mover orchestration tools and uh drive those workflows uh drive those AIs but um one thing might worth noting here um if I

may is that if you're interested to be able to use tape we're seeing such a resurgence in tape uh driven by data sovereignty Uh so people are looking to bring data back in house. There's um a big big issue that also uh the ability to take advantage of the additional security through air gap and um even and of course cost but you know cost has sensitivity. If you're only going to keep the data shortterm maybe you don't recognize it but anything it's in longterm um it's a buy versus rent decision. It's owning tape versus renting some type of service. And so um Growl has another solution called extreme store which is able to turn any tape library into an S3 Glacier style deep archive. So when that's combined with the callstars library all of a sudden you have this on premise Glacier that is massively scalable. So the costar libraries can scale um in uh rack size and with a meta with an extreme store you can put one or many uh libraries behind it to present a single uh S3 deep archive that can go into uh tens of pabytes. And so this this this allows a customer now have even even a cleaner stack. You've got the source data that can come from your file systems, your onremise or cloud-based um archives. The metadata hub is extracting the embedded metadata that is then harmonizing it. So it's it's bringing it together from disparate sources. So it's all searchable and then can be provisioned and it can provision into a data mover and that data mover can then write to the S3 uh frontend extreme store that can then leverage the tape. So when when we look at the stack, this is how this all comes together um to improve what whatever the customer is

looking to drive. >> Yep. And the the thing is a cleaner library is always good and from the users perspective and from the software perspective from a lot of applications out there it's going to look just like S3. Right. >> Exactly. Exactly. the same APIs and as a matter of fact we we have you know u we're I think we're the only purposebuilt uh S3 tape object archive there are certainly other products that do it but they came from legacy and etc and and they all have their purpose but grow in our in our decades of tape we saw this coming and we developed out extreme store purposebuilt for this long-term S3 object behavior that would would emulate um pretty close to 100% what your experience would be with a cloud-based uh object store. >> That's perfect. And and I think that's probably a critical point for the audience here. Um it lets tape is easy to use as cloud storage is in the concept of your workflow. >> Well, I think most applications are all moving to object- based workflows. So I mean there is a difference is you do need to be able to understand u there's S3 and then there's deep archive and deep archive is the ability to say um when you're writing up to the cloud your application knows that way there could be a delay so if you're any application that already supports um deep archive then writing to the extreme store is going to look no different and and I think mo the world's heading there the volume of data is driving it we don't you don't really have a choice but to be able to better use it we're seeing this through most cloud vendors are already supporting it. Applications are aware of

it. So, it's it's becoming pretty standard. >> Yep. And and the thing is if you've got it on prem, it's a lot less than the cost of the cloud, which is just going to be taped in someone else's data center. And you get that full sovereignty. Um so that researcher in in the earlier example is never going to lose access. They can still always see the file in the metadata hub catalog no matter where it is. >> Well, with a with a with a data orchestrator. So the metadata hub is always going to provide a persistent view of the file and actually knows where that file is. So you might have multiple sources that have been uh harvested from. And so when the user is looking at that up if a file is being moved so you have an orchestrator that moves from A to B, [clears throat] the orchestrator working and they're they can integrate with the metadata. So as the file moves that's being updated all of that updates into the metadata tags and it provides a consistency for that user um which is ideal and um so they can now better use an archive again and search but in most cases I mentioned a lot of times you just don't need that file so um yes they can have a view of it but it's parked safe secure you know protected based on what the uh what the uh the business requirement is for retention. Yeah. And and one of the things I'm hearing a lot of is, you know, concern about the sovereignty of of data and it being used by someone else for insights. And since you've got smaller proxies, if those are in the cloud, gaining new insights, getting rid of those is a lot less painful when you still have all your original data where you can also

have a copy of those proxies as well. So it it basically allows you to have the best of both worlds, both on prem and in the cloud to to to keep your data safe and have the cloud to get those new insights as they come along because you're simply just running it on the proxy and pulling all that metadata that you've already had. >> You know, I think I think we've all seen it as, you know, there is no oneizefits-all. So we have data ecosystems that need to be able to support multiple different types of use cases and um and and that's really what needs to be supported. So the metadata hub is able to you know connect everything to provide visibility and a self-aware data ecosystem. You combine that with automated workflows to move files back and forth and they're policy driven. Well, those policies are now driven by the content value, not just arbitrary when was the file created type drivers in a policy engine. So, you're really getting to what I think we've been waiting for for over 20 years, right? Which is a level of intelligence inside the uh data movement and workflow retention. All of it is again, let's come back to what's driving it AI. AI workflows are changing it due to performance, cost, and access to that data. And so having a an intelligent infrastructure that that works within your AI workflow is is critical to that. >> So CFOs, close your ears. It's not just a cost story. You're changing the entire workflow, making the entire front end before the AI more intelligent. >> You you are. And you know, it it's interesting. And I, you know, I shared that use case with Sousay where it started with the data scientist and their their need to to do their job.

What it what it really does turn out though at some point you got to pay for it. So, you know, the CFO does care, right? He he's going to benefit um not just from the fact that you're reducing what is killing AI deployment. I think most most people have probably seen the headline AI is not reaching, you know, its its objectives and ROI. And a lot of that just comes down to how something was implemented. But most of it's probably cost. It's more expensive than people were thinking just because we're still early in the curve and we're solving lots of things, right? We're we're getting better power utilization in GPUs and more efficiency and this and the metadata hub is just one of those efficiencies. It's helping reduce remove something that is a you know cause a a redundant cause in the system that doesn't need to be there. But at the end of the day, you know, a huge part of this is the we're going to keep this data forever, right? That that's just the way the world is. And so you you need to bring it across your entire system and and that and storage is still a big cost. It it may not be the most urgent thing because people have to get results. So they always are pushing the application layer first. But when you can deploy something that helps you solve both more efficient use of your data and immediately help you reduce your cost in the infrastructure and storage. That's a pretty powerful statement for the CFO. >> It it is. And you know who should be paying the closest attention to this conversation? Well, certainly the researchers, the guys that consume the data, the data architects um and um anyone where they're look thinking about how they're implementing their AI

infrastructure. So, um we find it those those guys that are in the I have to process it and and touch that physical data to architecting the systems are are the guys that are hands-on, but we're finding the seale guys because they're the guys that have to answer for this. you know, companies are being asked, so you're implementing AI and what are your results? You know, they're they need to be able to have better control um to get more efficiency that the um that we're getting the sea level attention as well. So, when we think about how an organization can benefit um let me just grab an average size for an a big enterprise or a mid-size enterprise actually probably most enterprises probably have about a million files or so. Many of them are even growing at a million files per year. And um when you think of the usage, probably they're not going to to process all of those through AI. But if they capture the metadata on the file creation, so it's in the repository, they probably touching between 10 to 20% of those files per year. So when you think of the vector processing cost, you probably got a couple different workflows that are running per month times 12 months. You add this up, you're looking at what could be a million dollars a year in cost that could be saved um for an organization. You do that over a three-year period because it's compounded because every year we're going to add more AI applications and processing. So those use cases can increase the data volumes increase. nothing gets smaller, right? And so over time, you're looking at greater than $3 million on three years. That's a pretty powerful ROI. >> It is a really powerful a ROI. And the

thing I think you're you're missing out on there is because I I came from the big world the world of big data storage for video. Um I think if you're saving 3 million on 1 million files over three years, that's great. But I think your file counts are low. I think these files get into the billions and billions in each organization. So, you know, on the video side, I I know we used to have, you know, we had file maximums for spinning disc storage sets at 4 million. Um, they were more than exceeded and these file sets are much larger today. So I I think if we use a use case of million dollars saved for a million files per year, uh you can see how that will scale inside of your organization and allow there to be a lot more cost savings. I I think there's a huge benefit here. >> For sure. And and I and I was trying to just put it in context to the average enterprise, but you're absolutely right that verticals are going to have um huge variations. So anything that has like the research or anything in in streaming imaging you're you are talking most of these customers are generating hundreds of millions of files per year with billions in the archive. So I I think that they're all trying to get a handle on is one we want everything to be usable. So harvest the metadata on the file creation cuz then it's AI ready on once you've done the harvest of the metadata doesn't matter what you do it's storing the file it's ready to then process whether or not you run the vector or run the embedding that's that there two options we recommend as best practice that once you've harvest the metadata just run the run the embedding it's it's a trivial cost so then everything's in the vector

database ready and set The other option is just leave it all in the metadata proxy and then you can selectively query and then you could go back and you can embed as you need it. So it depends on what they're looking to do. But I I think that we move forward with AI and I think most people agree that nothing's been more impactful, you know, than AI. Nothing's going to be more impactful than what's coming from AI. We're only going to use this data more and more. So it it's again, we're not going backwards, right? It's all a forward looking of we're going to use more of it. We just don't know how yet. >> Yep. We're going to continue to spend more on it. So, let's figure out how we can really spend less on it. And this way to get there. >> Exactly. >> And the other thing I'm hearing out there is a lot of organizations simply have one question. How much are we spending on II? They're not asking what is the return on investment we're getting from AI? And that's the thing that really moves the needle because we can't just continue to shovel cash into into a GPU and think that it's going to be a wonderful thing. It needs to have real business results for real business problems. Um, and I and I think that's really the road from AI chaos to this intelligent archive. You got the one harvest, the multiple workflows, and you got your compute and storage working together. Um, David, thanks for joining us. Well, thanks Jeff. And for anybody who would like to actually um uh assess this, we we actually have uh an ability to provide a u landscape report where they can see what they have. I think the the question a lot of companies have is

how many files do I have? So, we can provide um quick audits and let people see how this can make a difference um with their own data. >> Exactly. And um I think you can you've got a link where you can go to more metadata.com and request a free AI tax audit of your own. Um it's always good to have a tax audit as the end of the year approaches. Um that'll show you how much you're going to possibly save uh and what that redundant costs really are to your organization. Uh David, thanks again for joining us. Folks, if you're if you're here, you need to be here. So, I'd like to see if you can like and subscribe and uh I look forward to uh our next episode of All Roads Lead to Archive.