Bob DuCharme—semantic technology veteran and author of the definitive O’Reilly guide Learning SPARQL—recently shared his thoughts on helping companies adopt knowledge graph technologies.
“I’ve always been interested in data that doesn’t fit neatly into tables,” he explained, describing how RDF’s flexibility enables organizations to start small and scale iteratively—unlike traditional database schemas that require complete upfront design.
DuCharme’s journey began with SGML (Standard Generalized Markup Language) and XML (Extensible Markup Language) in the early days of electronic publishing. He watched XML evolve from document markup to data exchange, eventually leading to RDF (Resource Description Framework, a standard means of contextualizing disparate, less structured or structured data using standard subject-predicate-object triples) and the semantic web (a term which Bob suggests we avoid using nowadays — See “Let’s Stop Using the Term Semantic Web”).
Today’s enterprise knowledge graph builders combine multiple approaches: taxonomies or Linked Open Vocabularies like SKOS (Simple Knowledge Organization System), validation through SHACL (Shapes Constraint Language), and knowledge graphs that integrate diverse data sources. DuCharme emphasizes that the key question is always “What are you trying to do?” The right combination of tools, he says in this interview, depends on the specific problem.
At Graphwise (formed from the merger of Ontotext and Semantic Web Company), DuCharme works with technologies that extract structured knowledge from documents, validate data quality, and enable both SPARQL (SPARQL Protocol and RDF Query Language) queries and natural language interaction through tools like Talk to Your Graph, a key component of the company’s Graph AI Suite.
Rather than treating AI (another term Bob wants to avoid) as monolithic, DuCharme points out that “Graphs just seem to do better than chunking and vectors”. The structured data provides guardrails and disambiguation that improves AI responses.
His advice for organizations? Use open standards like schema.org for interoperability, implement human-in-the-loop workflows, and remember that quality metadata is essential for getting value from your data. Follow Bob’s explorations in semantic metadata technology and music at bobdc.com.
Interview with Bob DuCharme
Technical Writer at Graphwise
The full recording of our conversation can be found in this YouTube embed:
For your convenience, the edited transcript of our conversation follows.
Alan Morrison: It’s Alan Morrison again with another issue of the GraphRAG Curator podcast, and I’m delighted to have with us today Bob DuCharme, who’s been in semantic technology for many years. You can follow his blog at bobdc.com. He’s been blogging there for years and he’s always such a clear writer—just a pleasure to read. Bob’s behind a book called Learning SPARQL that is another real model of clarity. So I’m happy to have Bob on with us. We’re going to talk about working inside companies and helping them understand and adopt semantic technology. So welcome, Bob.
Bob DuCharme: Well, thanks Alan, and thanks for your kind words.
Alan Morrison: I always start with a question about your background and how you got interested in this stuff, because it’s not obvious sometimes how people get started.
Bob DuCharme: Well, years ago I had a tech writer role at a software company in New York City. We may have even been using dedicated Wangs or something at that point, but we started to learn more about SGML—structured ways to store documents as structures but non-proprietary so that you could easily turn them into other things. So I started going to SGML conferences and getting to know the key people. Then some of those key people thought, “Wouldn’t it be great if we had a simplified version of SGML?” First they called it WebSGML, then they thought of a catchier name: XML. That’s where XML came from.
So I found myself at a consulting firm and also companies like Moody’s Investors Service and LexisNexis helping them transition to and take advantage of XML. I’ve always said I’m interested in data that doesn’t fit neatly into tables—so XML for tree structures. And then, for better or worse, the RDF effort grew out of the XML world. It was very helpful when they jettisoned the XML part. A lot of that work came out of there, and I thought with RDF the flexibility and simplicity—as more and more commercial and open source implementations were out there to play with—I just started playing with them and doing more things.
I even wrote on my personal blog about some cool features of GraphDB even before I was an employee of the company, and I heard they were having some of their employees read my book. So that was nice too. It’s great that things I was doing for fun and writing about on my blog I now have a job at one of the key companies doing that sort of thing for a living.
Alan Morrison: Absolutely. And you mentioned XML and it’s still being used in some areas. I think immediately of Michael Iantosca at Avalara and what he’s doing with DITA (Darwin Information Typing Architecture) in a hybrid AI context today. Can you give us a take on how these vestiges of XML are still around and people are still using them to benefit?
Bob DuCharme: When XML first became big during the dotcom boom, there were these companies trying to figure out how they could send data, you know, seamless e-commerce from one company to another, that didn’t necessarily fit into tables. There was this new standard XML and there were parsers, so they started using that to send transaction data. Then a lot of these people started to complain like XML was badly designed. It’s like, well, it wasn’t designed for what you guys are doing with it. Eventually JSON (Javascript Object Notation) started doing that sort of thing, but XML was originally designed for electronic publishing—to store content that can then easily be converted to multiple different formats as they come up.
The DTD or the schema for a document class—there are various specialized ones, but two of the big ones are DITA and DocBook. I used DocBook a lot. In fact, I wrote my SPARQL book using DITA, but then O’Reilly preferred DocBook because they were one of the original companies behind that, so I just wrote some XSLT (Extensible Stylesheet Language Transformations) to convert it. These were designed to be flexible structures for publishing but also to have lots of metadata—lots of slots for metadata to help you. I’ll say it: the whole point of metadata is to get more value out of your data, right? So there were lots of slots for metadata in the popular successful schemas like DocBook and DITA.
I think what Michael Iantosca and Avalara are doing is taking advantage of those slots—not only to put descriptive terms but by plugging that into curated taxonomies, managed metadata so that you can manage at an even more granular level, not just documents but down to the paragraph level if you want. So that’s the relationship I see between—well, the RDF can be stored in there as well or just the keywords—but lots of flexibility in how you store metadata and take advantage of its relationship to the data that it’s about.
Alan Morrison: Yeah. And I’m thinking when you’re talking about others who are in this hybrid world that Graphwise, for example, occupies—where you’ve got the so-called symbolic knowledge representation and the semantic metadata, and then you’ve got the probabilistic methods that data scientists and engineers are familiar with, and you’ve got vector embeddings in the mix. It seems like we’ve had a kind of monolithic approach to AI in much of what’s happened over the past five years, particularly with generative AI. The data science teams don’t seem to be that aware of semantic metadata and the value of it.
Michael’s an example of somebody who’s been on both sides of this and he’s putting these things together. Can you help us understand how these things are coming together to the extent that they are? How are we working with data science teams that might have things in tensors and they might have been stripping out the context of the data before they even start with training a dataset?
Bob DuCharme: Well, I think we can think in terms of a toolbox, right? There are these various tools and you can pick out the first, fourth, fifth, and seventh tools in your box to do one thing. Maybe it turns out the fifth one wasn’t what you wanted. But with things like vector embeddings—I’m even trying to avoid using the term AI anymore because it has meant so many different things over the years.
Pursuant to your point, what it often meant five years ago was machine learning with neural networks—deep learning—to create vectors to help show similarity with cosine similarity or something like that. When people said AI five or six years ago, I think that’s pretty much what they meant. Then one particular application of that was to analyze—people were analyzing and still are—chest X-rays or the path of boats on the water. But to analyze all the English available on the web to model there—very large model, large language model. Nowadays when people say AI, they mean the use of these large language models that have been created.
I think for people like data scientists, they want the data that’s going to help them solve their problem. Managing the data more long term, as far as I know, is typically less of an issue with data scientists. And it’s that management—a database manager, if you’re going to store data and try to get more different kinds of value out of that data over time, you want to store it in an organized way and store it with some metadata. Because the metadata will help you get more value out of it—to navigate and find what you need to, track provenance, and all the good things we’ve seen.
I think with data scientists, these things are less of an issue. They have a specific problem to solve, so they’re going to get the data to do it and then maybe move on, or maybe build a database. I’m not sure if that addresses part of what you asked.
Alan Morrison: Well, it seems that the methods of the standard database query and retrieval has been somehow adjacent to but not part of what the data scientists are doing to train data. How are they bringing the datasets together? Should they be doing it in a different way to get richer relationships in the data? How would we change the processes to make this more of a smooth and holistic cohesive operation?
Bob DuCharme: Well, I think like I said about picking out the first, fourth, and fifth tools in the toolbox, it depends on your goal. If I was a sales guy and someone was asking me about that, I’d say, “Well, what are you trying to do?” What is it that you’re trying to—oh, maybe the second and fourth and fifth tools here are what we want.
One of the advantages of the merger of Ontotext and Semantic Web Company into Graphwise is bringing together a wide selection of tools—the PoolParty taxonomy and ontology manager to help curate standards-based metadata of taxonomies and thesauruses. They also have many tools, as did Ontotext, for analyzing PDFs, text, and pulling out facts in the form of triples.
So these are just some of the tools that can be combined. GraphDB’s triple store graph database manager is where when you put all these various things together, it can sort of hold them together to build an application where you can store the data. It could be in clusters, it can be distributed to take advantage of all these different tools. To have all these different tools and then pick—with vectors, with full-text search, or finding concepts in there as identified by the kind of machine learning we were discussing. All of these different things are options that can be combined in different ways into what we call the Graph AI Suite. But there are other subsets of that as well that may meet certain customers’ needs. Getting back to: what are you trying to do for a given customer? How can we help?
And another thing I like in particular about an RDF-based approach is that it makes it a lot easier to start small and gradually build up from there. One of the things when people talk about RDF schema—so many systems for defining the structure of a set of data, you have to have a whole schema, figure out all your classes and all your properties and all the relationships before you even input the first record. Other systems are proud: “Oh, no schema, it’s schemaless, it’s so flexible.” Like with XML, it’s either all complete schema or no schema.
But with RDF, you can actually have a complete schema, you can have no schema, and you can have a partial schema. You can just describe a little bit of it in your schema, build a little application, take advantage of it, prove the value, and then add a little more to your schema to take advantage of a little more of that. So that kind of flexibility to iteratively build from a proof of concept to a robust enterprise application—that’s another thing I like about how all of the RDF-based tools fit together.
Alan Morrison:And so once you’ve done that, once you’ve done a POC and you’re impressed with the results, the validation part of it comes into play as well. It’s like you want to check after yourself to make sure you’re doing things in the most beneficial way that you can and it’s actually working and doing what you thought you were doing. How does that happen?
Bob DuCharme: Well, one of the things in the early days of semantic web that kind of scared people off was the Web Ontology Language or OWL, which built on a lot of research in the knowledge representation world. It provided ways to do all kinds of complex inferencing, which is cool—we’re coming up with new facts here, we’re inferring them. But it was a lot of work for the computer for an OWL system to do that.
A lot of people saw OWL and said, “Let me say that an employee is a subclass of person and an employee has to have a given name and a family name and a start date”—you know, a schema. But OWL actually didn’t really do that. It let you infer things and it was looking at it from a different angle. So that was kind of off-putting to people—this very complicated thing that didn’t let them do certain basics.
So over time, a new W3C standard to help with all this called SHACL—S-H-A-C-L, but a pun on shackles that hold people or hold things—was developed to give people a very simple way to say, “Yeah, a family name and given name are required and hire date are required fields,” and “This value has to be in this range,” or “A given value has to match this regular expression,” or “This value has to be between this number and this number.”
So SHACL is part of the family of standards and with a SHACL processor—and of course that’s built into GraphDB—I also like how you define the rules with triples in RDF. If you say, “Does this dataset meet these rules?”—whether it does or it doesn’t, the response is in triples, it’s in RDF, which means it’s easier to fit that data checking into a pipeline.
So when you’re gathering all these tools together to meet certain needs, SHACL can fit into a pipeline to read from one process and feed to another, or maybe a separate process if there were errors. So SHACL really has met a lot of needs and I think a lot of people have found that to replace what they were originally unable to get out of OWL.
Alan Morrison: Makes sense. And I know that people like Jessie Talisman, for example, advocate starting with something other than OWL. If you’re just getting started, SKOS, for example, is something that Semantic Web Company and now Graphwise recommends people starting with for starters.
It’s also the case that when I look at the capabilities of the stack as it exists now, what I’m really impressed by is the fact that you get all of this heterogeneous different kinds of data together in this graph form where it’s all accessible, and it could be FAIR information that’s in there that can be reused. There are so many different challenges that can be addressed with this technology, and I try to get people interested in exploring the different use cases.
One of the things that Semantic Arts talks about—Dave McComb—is, well, you don’t really need to have all this code in the applications. You could take most of the declarations in these applications and just put it in the graph. You talked about rules being in the graph. Dave makes the assertion in his books that maybe you could get rid of 85% of the code in the applications and just slim your apps down to 15%. I’m thinking that’s in an agent-based environment—that’s the same case, right? A lot of this could end up in the graph where it’s more manageable to begin with. Am I wrong about this?
Bob DuCharme: Sure. I mean, I think the ease of management of it gets back to good quality metadata. There are ways—as part of the fun, LLMs can help generate that sort of thing. But to have good metadata—and that’s why the graph modeling parts of Graphwise, within the Graphwise platform, to build the models to represent what you have so that people can take advantage of the relationships better. Just to say, “Oh, this is a subclass of this, so whenever you have an employee it’s actually a person too.” I mean that’s a very simple example, but to go even further—because that is the nice thing about schemas: they give you structure. You look over the structure before you look at the data and you get some idea what’s trying to be done. And so I think the quality metadata, of which schemas are part of that metadata, is important to navigate that.
Alan Morrison: Yeah. So what kinds of use cases excite you? When you’re doing your work, can you give us an example of the kinds of things that people are working on specifically to solve a problem, and where’s this technology becoming really useful that it hasn’t been in the past, for example?
Bob DuCharme: Things like you mentioned Michael Iantosca at Avalara—the work they’re doing. There are various organizations doing really interesting stuff. They have taxonomies—a lot of these organizations have had them for years—but they’re getting people to do more things with them. Instead of just having an official list of terms to use, when their storage conforms to standards like SKOS, it can then plug into more applications. They can do more things with it and help them get more out of their data.
I mentioned earlier the value of any given project is to start really small and scale up from there. That’s why I’ve always tried on my blog to show just tiny little examples to build up from. I mean, I’m trying to think of some of our other customers—
Alan Morrison: Well, let’s think about Michael for a minute that people may not be familiar with what he’s doing at Avalara. I know that he has this platform that he’s got together. Avalara does—it’s a huge resource for state tax information. I’m not clear on how the platform [i.e., the one that Iantosca’s working on] is related now to what Avalara does commercially. Is it still in R&D land, or is it commercial? Do you know?
Bob DuCharme: No, I think it’s commercial. I mean, not all 50 states, but it provides another good example of the technology I mentioned. You take information that is in PDFs or whatever, but just narrative text of paragraphs, and tax code and treasury regulations and things like that certainly count. But they do have a very specific structure as far as what pieces have to come in what order and the relationships. So to take advantage of that structure and then the entity extraction to know when the term “Charles Schwab” is referring to the company versus the guy, or what things are related to other things—to pull that out of a set of documents and then encode that as triples, as RDF, so that you can do the kind of querying whether with a SPARQL query language or the natural language querying that we—our “Talk to Your Graph”—and various other things support.
So pulling information out of those documents and then making that part of the knowledge graph lets them do more things with the knowledge graph as opposed to just doing a full-text search of, “All right, here’s a document, let’s see if it mentions deductions”—of course it will if it’s tax—but “a specific kind of deduction” or “a yacht”—you know, “let’s search these documents for the word yacht.” To have a more structured search because we’re getting some structure by pulling things out that conform to a certain ontology, and we can take advantage of that either with SPARQL queries or natural language queries, or from another application that is reading this and has its own front end to what its users see.
So it’s not all just about people using it directly, but to power other applications that people can then build on. And that’s why these products all have APIs too, so that people can build entirely new products and tools for use by people who may not know that it comes from Graphwise.
Alan Morrison: Right, right. It seems like it’s getting more and more possible for the interface to provide some level of automation. You mentioned Talk to Your Graph. TTYG is generating SPARQL queries, correct?
Bob DuCharme: Often. I mean, I think when you ask it a question, it gives you a response. You can even say, “How did you get that response?” It can show you the query. It might have been a full-text search to look for things.
I did a demo that’s on our website where I found a great site—I think it was called dummyjson.com or .org. It’s fake data, but it was for an online company selling watches and furniture, and there were ratings and all that, and it was all free fake data. So I did a video called “Build a Shopping Chatbot in Four Minutes,” and I walked through—I think I even have the creation of the repository. I created an RDF version of that data, pulled it into a GraphDB repository, and then did the configuration so that I could do a full-text search: “Show me all the watches where the average rating is higher than three.” This is in plain English—I would type that out. And I could say, “Show me the SPARQL query that you did,” but it was a toy example of a real-world problem—a shopping chatbot.
Alan Morrison: Yeah. I watched the TTYG webinar recently that Andreas Blumauer did, and there was a snippet of video from Semantic Partners where they were talking about the different classes of search that you could specify. So you could do, for example, the graph query. You could also do a similarity search, for example. Is it essential that you specify that sort of thing up front with TTYG so that the mechanism that you’re using knows where you want to go to retrieve the information that you’re looking for?
Bob DuCharme: Not necessarily. I know that when you create a new repository in GraphDB, you give it the name and so forth—whether you want to do any kind of inferencing, what inferencing profile you might be doing with it. And one of the checkboxes is “enable full-text search,” and I always just check it. But TTYG can really take advantage of that full-text search, which also hooks in with some of the connectors that GraphDB has for Elasticsearch and Lucene and some of the others. Those can be used to do—those contain Lucene-based searches of the—I think I did a blog entry or a video of that so that Lucene can be used to take advantage of “find this word within three words of that word.” It can search the GraphDB data. So it’s a nice example of using another tool with a different focus on the same—another tool in the toolbox, tool number 11 and 12: Lucene and others. And I believe with Lucene support it’s built in, and some of the other full-text ones require another license, I forget. But it’s more tools to pick from to do more cool stuff with the data.
Alan Morrison: Yeah, right, exactly. So when you bring agents into the mix, how are companies doing that most effectively considering this kind of platform? And I’m thinking in particular about sidestepping the risks that are associated with agents that might get into trouble if they’re not guided in the right way, or maybe given too much autonomy to begin with given the circumstance. Are you seeing a lot of that activity? Because there’s such a buzz about it right now.
Bob DuCharme: There is. I also think—I saw an article recently where they quoted some artificial intelligence researcher saying, “Oh, when it figures all this stuff and it might be wrong and there’s no human in the loop.” And it made me think, you know, this must be from some researcher because it seems like part of the buzz is so many companies are saying, “And we can do human-in-the-loop, right? We’re happy to add that into the system. Please give us money.”
And along with the human-in-the-loop, I think with Graph RAG—if you’re going to ask an LLM a question, we have this opportunity to say, “And here’s some extra data, Mr. LLM.” You know, maybe that helps. If you gave it a bunch of PDFs—well, that’s too big. We’ll split the PDFs into chunks, and even that’s still not—so what we’re finding, of all the various ways to provide data to the LLM via GraphRAG, a knowledge graph—RDF of data with structure, because they understand the structure—I mean, I’ve had fun just doing ChatGPT saying, “Show me the Turtle results if I had this RDF data and sent it a SPARQL query that did this kind of thing. What would the result look like?” And it does it.
So handing it maybe an RDFS or OWL and some data that goes with it—all that’s relevant to what you’re going to ask—it’s important. And this assumes we’re talking about companies with a specialty and their special data that helps people do jobs in their domain better. For a company like that to work with one of the LLMs and provide their own data as a graph—the graphs just seem to do better than chunking some of these vectors and some of the others. So that’s one way to really help it give more sensible answers and human-in-the-loop and to make it iterative as well. Here’s one thing I like with the LLMs: if I enter a question and I don’t really like the answer, I can say why I don’t like the answer and give me another one. But providing this data that helps with the guardrails a lot—they can be the guardrails in many cases. Data to say more background, potential things that might need to be disambiguated and things like that—you just tell it up front in a structured way. It’s going to be easier to understand better than your own natural language sentences.
Alan Morrison: Yeah.I saw you post something related to the Adam Kimball interview that I did not too long ago, and Adam was talking about this kind of feedback loop again and recursion, as I recall. But it seems like the feedback mechanisms need to be in more places inside organizations. And it’s like once you get this thing going where you’ve got the machine learning capability and it’s learning and perceiving, it’s like you could set this up in every department in a company so that you have a human from that department in the loop providing that specific information that department needs and just sort of refining it in this iterative process. I mean, I’m surprised that companies aren’t talking about this more.
Bob DuCharme: Well, it was fun for me the first time I saw that something like ChatGPT could just give you triples—syntactically correct triples. If you can add those back into your knowledge graph as part of a workflow that has someone review it, the person who reviews might have ideas: “Yeah, this information was pretty good, but it could be better.” Well, that can lead to tweaking the model so that it is better. So not just a thumbs up or thumbs down, but this human-in-the-loop can take part in tuning so that it gives even better information.
Alan Morrison: For sure. So it just seems like there are a lot of tools in the toolbox. Where are you seeing people use the kinds of standards that we’ve talked about more often? What are you seeing from your vantage point?
Bob DuCharme: I think all the ones we discussed. I mean, of course SHACL—I don’t see as much awareness of RDFS as I’d like to. Look at schema.org. Schema.org is RDFS, because like I said, for years people were put off by OWL—”It’s too complicated.” Well, here’s a simpler version that can really do basic modeling that anyone who’s modeled any data with any other system can get used to.
So it’s nice, hopefully, to see more of that—of the tools, of the standards-based tools. I mean, that and SHACL—the modeling and then SHACL, and then SKOS, of course, has always been popular but is getting more popular. Because, as people can do more, where a taxonomy or thesaurus may have just been a list of words before, now it’s something that can be plugged into a system and really help that system. So that’s really helping with SKOS—people using more of that. So yeah, SKOS, SHACL, RDFS, RDF itself. I’ll probably kick myself later for forgetting some, but—
Alan Morrison: I interviewed Tony Seale for the podcast. Tony talked when he was at UBS about using schema.org inside of UBS to get people on board with it, and it seemed like a good starting point from their perspective—something that wasn’t too intimidating to begin with.
Bob DuCharme: And you can prove the value. The disambiguation—if I’m going to have an employee but I need to represent that as a URI, I could have my company’s domain.com/employee. But my data is going to be much more interoperable with other data if I go with the schema.org version of that.
And that interoperability, that ability to plug—I mean, that’s one of the great things about RDF: when people do use common, well-published URIs to represent basic things, the more people are using them, there’s a network effect that makes more data easier to plug in with other data and to integrate within a growing knowledge graph that has more and more stuff to take advantage of.
Alan Morrison: And you can just disambiguate with the help of these external resources that you’re pointing to. And it seems like many will be familiar with schema.org for website optimization to begin with, and then you just use its adjacent cousins, I guess if you will, to sort of expand the use of the same methodology. Is that correct?
Bob DuCharme: Yeah. And I think those cousins—or children—you know, there are people that have taken schema.org and then made subclasses but made that available for other people. I think the IPTC (International Press Telecommunications Council) did for sport—I think that’s where they came up with their sports schemas to represent data from everything from a basketball game to tennis to Formula 1 racing. They made a sort of standard based on that standard that they encourage other people to have input on and then share and use. So it’s really grown beyond just the classes and properties that you see defined on the schema.org website.
Alan Morrison: You were talking about sports, and I was thinking about other endeavors. You’re a music aficionado—you’re a jazz bassist. Have you seen a lot of this kind of thing in what used to be called the music industry? MusicBrainz, for example, was an example of early semantic technology and the application of it. Are you seeing anything today that’s music-related that would be interesting? Because I play around with the guitar. I studied piano growing up. So I’m just curious.
Bob DuCharme: There’s Wikidata. And there was some database of Miles Davis-related stuff that was available in RDF, and there are some jazz ones out there, but it’s more for like record collectors and like “who played bass on this album in 1952” kind of things.
I had fun—there’s a website called BeatlesBible.com or .org which lists pretty much everyone who played every instrument on every Beatles song. And they listed it in such a regular way. It was all natural language English, but all with some colons and semicolons or whatever, but it was so regular that I thought, “I can just turn this into RDF.” So I wrote a script to turn it all into RDF. And it was so much fun to ask questions that people maybe never—”Who played piano more than anyone else?” It turns out I think Paul and then George Martin, but all the Beatles played piano on one song. “Where did Ringo play piano?” Oh, here. You know, I’m just doing all that with SPARQL queries.
I should really put that in Talk to Your Graph with natural language. “How many songs does Eric Clapton play?” Oh, two: “While My Guitar Gently Weeps” and what else? Oh, “All You Need Is Love,” when they had all their friends in the background singing—they got credit for that. That’s not really, but that was a fun bit of data.
And I several years ago had some fun with—someone had written a library to convert roundtrip RDF to MIDI, you know, MIDI for encoding music to RDF and back. So I thought, “Well, I could write some RDF, convert it to MIDI, and then hear it played.” So what I did—I did a couple little things, but I did one little sort of jazz one where there were three instruments. There was a ride cymbal just going tsss-ts-ts-tsss, and a bass playing fairly random quarter notes within certain parameters—either stepwise or thirds, but otherwise random—and then a muted trumpet.
Because with cheap MIDI, a regular trumpet and regular sax, they just sound like cheap keyboards. Muted trumpet’s a little better—picking random notes too, but with sort of bebop soloing kind of rhythms. And it was a lot of fun. I mean it gets boring after two minutes, I will admit, but to generate RDF to then generate music, that was kind of fun.
Alan Morrison: Yeah, you made me think of one of my favorite recordings—James P. Johnson’s 1925 piano roll of “Charleston.” And I was just amazed that in 1925, a lot of the liveliness of the way he played came through on the piano roll. So what you’re talking about is like, okay, we’re just trying to do this through MIDI now with all sorts of different instruments and all sorts of different tonalities.
Bob DuCharme: There’s music people call “black MIDI” where they’ll generate a MIDI file that’s just so incredibly dense that if you did show it on a regular score, the page would be pretty much black. And it’s obviously something that people can’t play, but you can generate it if you use good samples of a nice piano. You can hear it back. And so it’s more about coming up with algorithms to populate these MIDI files and then to listen back. It’s sort of like algorithmic composition, I guess.
One story I was going to tell about James P. Johnson—my mother once said that her dad, who was into jazz, had these old 78s. And as a young guitar player listening to a lot of Eric Clapton and people like that, I was like, “Was it Robert Johnson? Was it Robert Johnson?” She said, “I think there was a lot of piano.” I’m like, “Oh, okay. Yeah, probably James P. Johnson.” That would be cool.
Alan Morrison: Yeah, for sure. So what are you listening to lately? In terms of music, are you performing regularly as a bassist? Are you part of a combo or something?
Bob DuCharme: Actually, when I was young I played loud rock guitar, lead guitar in bands in New York trying to get signed.
And then we moved to Charlottesville. I’d played a little electric bass. When we moved here I thought, “Oh, it’d be fun to play upright bass.” You know, it’s tougher in a New York apartment.
So I learned and I ended up playing in the local minor league here, playing at this Italian restaurant every Wednesday night. It’s background music. And then a friend told me about a local community orchestra that was so desperate for bass players they would probably take me. And I had never really played classical music. I listened to a lot, but I had worked on my bowing and my reading a little more than some jazz beginners.
So I went and they didn’t kick me out, and pretty soon I was playing bass in Beethoven’s Fifth and Ninth. And then I thought, “Well, I always had this fantasy of writing string quartets.” So I switched to viola, and I’ve actually got a gig this Saturday afternoon with another community orchestra playing viola.
And a lot of what I listen to—like on Bandcamp, they have a great—you can tag. Here’s one of the perils of metadata: some of these websites where you can add your own tags to your own stuff, and people will say “this is classical music” when they’re just rambling up and down the white keys, maybe after a few bong hits, and it’s pretty boring.
But every other month, this one music writer picks out the best of what they call contemporary classical, like interesting new stuff, and he’ll write a little essay and have links to it. And I’ve even sent him some fan mail thanking him for doing this because—trusting, I guess it’s a good lesson in trusting the automation of assigning metadata or trusting people to add their own, the downsides of that. So curated by a real person with a point of view to say, “Here’s the good new stuff to listen to this month.” That has guided a lot of my listening. And then I hear certain people and go listen to their whole album on Bandcamp.
Alan Morrison: Absolutely. Just talking about governing the metadata—if you could make an organizational change, if there was some really common issue across companies that you knew you could change, what would you change to help us out with the governance problem that we have with data?
Bob DuCharme: You know, it might be a cop-out, but I’d have to go back to what I said before: for any given company, what are you trying to do? And then I would strongly encourage them to use standards. If they use schema.org, then it does interoperate with what the others are doing.
But several years ago when I worked at a company who had a taxonomy manager, I got to go to the Taxonomy Boot Camp conference in Washington every year, and I learned a lot more about the taxonomy world.
One thing I learned was that many people with library science degrees—or as we would say now, information school degrees—they became full-time taxonomists, and their job would be to go to a big company where, let’s say there’s a meeting in the company and you have a certain project and I say, “Alan, that project you’re working on—like three years ago, this guy who used to work here, he wrote a report. I think that would be really useful for you.” How are you supposed to find that report, that PDF?
Well, the full-time taxonomist—it’s their job to get people to assign the metadata so that you can find that. That’s the work that so many of these people are doing: organizing the taxonomies and the assignment of metadata.
But I heard stories at the conference where people just wouldn’t add metadata. “Okay, we’ll make it required. You have to add at least one keyword.” Everyone would add the keyword “aardvark” because alphabetically it was the first thing on the list, right? And obviously that doesn’t help. [Graphwise offers a machine-assisted approach to automation.]
So that was really interesting, hearing about full-time taxonomists describing their challenges and success stories of managing metadata. So I learned a lot from them, and it’s interesting to work with people like that again. So I’m sorry if that was kind of a cop-out to your question, but a single thing—you know, use standards.
Alan Morrison: Well, I was just thinking about role changes, for example. I was thinking about, okay, how do we get people to collaborate better? Because there are obviously people in organizations who are more attuned to the metadata side of things and they like working with it.
My mom was a librarian. She loved cataloging. She loved classifying things. Those people could be a lot better utilized in a big organization if they weren’t so siloed in departments.
And that’s sort of my dream: you’ve got this little AI team over here doing all these things and they think they have to do everything themselves. They don’t have to do everything themselves.
They’ve got all these other resources—they need to know how to tap into them. So there’s like a metadata sort of structure, a schema of sorts, to help them navigate their own organization. I used to do that at PwC—how do you navigate this organization informally to get to the people who can help you out?
Bob DuCharme: Yeah. And with a librarian like your mom, it shows structured metadata has gone back to the 19th century with—was it Melvil Dewey of Dewey Decimal fame? I think that was his first name. But people were working out structured metadata as a way to navigate the data over 100 years ago. And we can take advantage of those things and other things like them.
Alan Morrison: It’s like there’s so much out there and things that were created 100 years ago still have utility today and perhaps are underutilized. So just—
Bob DuCharme: I think I’ll have to do a search after—later today. There must be someone who must have put together a SKOS representation of the Dewey Decimal System. I mean, I know the Dewey kind of fell out of favor and people use Library of Congress more now, which is also available in SKOS. So that can be downloaded from Library of Congress, but the Dewey Decimal one in SKOS—that would be interesting to see.
Alan Morrison: For sure. Well, are there other things we need to talk about today? Other related topics that you wanted to cover before we sign off?
Bob DuCharme: No, I got to talk about Robert Johnson and generating music and lots of fun stuff, as well as this technology—a lot of it, which is why I had a blog. It was fun just to play with it. And it was sort of a dream come true to end up having a full-time job at a company making some of the better products in that area, that were also free, by the way. I always made it a point on my blog—I didn’t write about commercial software, but stuff that was free. And Graphwise had and still has a free version of GraphDB to try a lot of these technologies.
Alan Morrison: Yeah. And just to reiterate, bobdc.com is your blog site.
Bob DuCharme: Yeah. Blog there, or you see more about music activities and things like that.
Alan Morrison: A really fun site to read. And Bob, I appreciate the time so much today. Obviously, it’s really helpful that you just help us sort of think through some of these issues today.
Bob DuCharme: Oh, I love it. Thank you for inviting me. I love spouting off opinions.






Leave a Reply