Editing Out the Real World With ‘Diminished Reality’

Diminished Reality
Diminished Reality

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

In the 2014 Christmas special of the dystopian sci-fi anthology series Black Mirror, writer and series creator Charlie Brooker envisaged the possibility of cybernetic augmentations capable of ‘blocking out’ upsetting or banned material for a particular user – not only online, but in real-world interactions. In the episode, one of the main characters was ‘blocked’ by his ex-wife, so that she could literally no longer ‘see’ him.

The fictional ‘Z-Eye’ interface in the 2014 Black Mirror Christmas special allowed a user to ‘block’ people in the real world. Source: https://www.youtube.com/watch?v=AOHy4Ca9bkw

Towards the episode’s conclusion, Jon Hamm’s character was also ‘banned’ from being able to clearly see anyone else at all, due to his ‘offender’ status.

Now, an extraordinary program from Oxford University is envisaging such a system, in which the end user can ‘blur out’ upsetting or triggering images, objects and text as they walk around the world, using an open source, platform-agnostic ‘diminished reality’ system that could potentially be used with any available portable head-worn VR/AR glasses, or else directly incorporated into an existing commercial or closed-source system.

In this example, we see below a) a definition of the target user, who has developed a phobia about dogs, b) the circumstances in which they have to deal with real-world or represented images of dogs, and c) the prescribed solution – dogs, real or otherwise, are occluded from the view of the user, whose re-rendered viewpoint contains whatever 'masks' that user desires. Source: http://export.arxiv.org/pdf/2211.08005
In this example, we see below a) a definition of the target user, who has developed a phobia about dogs, b) the circumstances in which they have to deal with real-world or represented images of dogs, and c) the prescribed solution – dogs, real or otherwise, are occluded from the view of the user, whose re-rendered viewpoint contains whatever 'masks' that user desires. Source: http://export.arxiv.org/pdf/2211.08005

The process and supporting technologies are described in the new paper Cross-Reality Re-Rendering: Manipulating between Digital and Physical Realities. The system is the ultimate development of three earlier works from the same pool of researchers – GreaseDroid, a framework designed to defy manipulative dark patterns in commercial user interfaces; GreaseTerminator, a further attempt to ‘mind-proof’ invasive or exploitative tendencies in the mobile space; and GreaseVision, the first attempt in this research strand to actively apply visual blocking through server-side re-rendering, fed back to the user in real time.

Adblocking for Reality

The system has a working implementation, and can potentially be implemented on any device that’s capable of either replacing or overlaying the user’s field of view (VR and AR, respectively).

In the case of head-worn VR equipment, the system operates more like AR, in that the actual real-world viewpoint of the user is streamed back to them live, with offending objects omitted or inpainted, from a platform-agnostic central server that has been outfitted with ‘hooks’ – models and algorithms, simply trained on minimal user-chosen images, that result in models capable of intervening and overwriting their real-world analogues, from the point of view of the user.

An example of the kind of simple rig that can turn a mobile device into a 'diminished/augmented' reality overlay for the new system.

The framework allows the user also to review their ‘perception history’, and to wind back to events or seen objects that they might wish to not see, or to ‘paint out’ of their view later on.

The 'view history' interface is a web-hosted portal where the user can annotate digital (i.e. 2D) or real-world entities that they have encountered, and take action against them, by developing 'hooks' that will intervene if the objects or entities recur. Some of the hooks are based on simple fine-tuning of existing models, while others are more explicitly trained by the user.
The 'view history' interface is a web-hosted portal where the user can annotate digital (i.e. 2D) or real-world entities that they have encountered, and take action against them, by developing 'hooks' that will intervene if the objects or entities recur. Some of the hooks are based on simple fine-tuning of existing models, while others are more explicitly trained by the user.

The system specifically addresses the imposition of advertising material, both in the 2D realm (i.e. browsing on a computer or mobile device) and in the real world. Since it would be burdensome for a user to create hooks or filters even for common disturbances or annoyances that a large number of people might want to omit from their view, the framework is intended to provide crowdsourced hooks that relate not only to abstract forms of advertising or intrusion (such as advertising in websites or in apps), but also to specific locations or contexts within the viewer’s zone of activity.

In this sense, the system is designed to operate in a similar way to an adblocking system such as uBlock Origin, where the user can select from publicly available blocklists, and later add personalized blocking of elements specific to their needs and case.

The approach, which is partly based on work designed to thwart dark patterns, is also capable of recognizing tacit or ‘native’ in-context advertising, such as ‘Recommended items’ on platforms like YouTube and TikTok.

Diverse examples of the way that the system can intervene in both 'official' and user-contributed content in sharing platforms and social networks, via optical character recognition and subsequent (voluntary and user-desired) censorship.
Diverse examples of the way that the system can intervene in both 'official' and user-contributed content in sharing platforms and social networks, via optical character recognition and subsequent (voluntary and user-desired) censorship.

As seen in the image above, the system accommodates ‘text hooks’, using character-level optical recognition, which can feed interpreted text into models capable of identifying text that has either been identified by a community as generically offensive (racism, sexism, etc.), or else has been trained by the user themselves.

This part of the approach initially uses the EAST text-detection system, which is then fed to the Google Tesseract OCR framework, which subsequently extracts each character in the region until the opaque (i.e. rasterized) text information exists as operable text which can be evaluated by filters, and obscured as necessary.

Individuation of text using the EAST system. Source: https://arxiv.org/pdf/1704.03155.pdf

The paper states:

‘A sample application of this hook include interventions against text of specific conditions (e.g. placing censor boxes over hate speech, or generating new text personalized to the user). Another example is the identification and highlighting of specific text used in one interface (e.g. product ads on Facebook) and appearing in another (e.g. search results in Amazon, or appearing in real-life when in a store).’

Besides a possible phobia of dogs (see earlier image above), the paper provides other examples of use cases for the system. For instance, a person who has had an unusual and traumatic encounter with death, and who is triggered by the sight of gravestones or graveyards, could create a mask hook that will blur out such imagery, whether watched in a 2D environment, or encountered when traveling in the real world.

Below, three use-cases for the system; above, imposed occlusions on the offending items in an egocentric viewpoint.
Below, three use-cases for the system; above, imposed occlusions on the offending items in an egocentric viewpoint.

As we can see from the example ‘persona 2’, the system is intended to combat the ubiquity of commercial imperatives to buy – in this case, obscuring prices from a shopaholic, whether in store windows that they pass, or in online environments.

The Power of the Crowd, and the Need for Caution

User-contributed hooks are envisioned as one of the most potent features of the architecture. At the individual level, a sole user can review their day in the form of egocentric video, perhaps (the paper suggests) with the help of automated interpretation systems to help the viewer/user find key events containing elements that they do not wish to see or experience again.

Once such hooks have been created, the user would be encouraged to categorize them in some way and offer them for community use. For very specific phobias or aversions, such hooks might experience low uptake, or else the user may not wish to share them for other reasons.

However, it seems likely that the static positioning of advertising material in the real world is likely to be an easy and early target for the rapid development of ‘reality adblocking’, likely to emerge first in major urban centers, with slowly-growing coverage of more marginal or rural environments, such as large billboards or intrusive video placards placed strategically on highways or at intersections.

Regarding this, the paper’s author states:

‘Rather than waiting for a feedback loop for altruistic developers (e.g. app modifications for digital reality, or dedicated AR software for physical reality) to craft broad-spectrum interventions that may not fit their personal needs, the user can enjoy a personalized loop of crafting and deploying interventions, almost instantly for certain interventions such as element masks.

‘The user can enter metadata pertaining to each annotated object, and not only contribute to their own experience improvement, but also contribute to the improvement of others who may not have encountered the object yet.’

The paper comments further that ‘The primary utility of collaboration to an individual user is the scaled reduction of effort in intervention development.’

However, it obviously must be considered that the very same unfettered public behavior that may inspire users of the system to apply blocks (i.e. to offensive text), is likely to become subject to some of the risks of unregulated crowdsourcing.

For instance, since ‘offensive’ is a relative term, there would technically be nothing to stop a user developing and disseminating hooks and filters that exclude content they consider to be too ‘woke’; or which is intended to combat conspiracy theories such as those regarding the 1969 lunar landing; that discount the theory of global warming; or that in some way actively block out elements that, in a healthy society, should perhaps ideally be required to be dealt with internally by those who encounter them.

The system is designed to be extensible, and to take in a wide range of available models, from sources such as the model zoo, AdapterHub, huggingface, and GitHub. In theory, there is nothing to impede users from blocking out people, real or represented in images, that don’t accord with what they wish to see, such as people of different races, specific genders, ages, or which have other characteristics a user may consider ‘undesirable’.

Indeed, since the system is capable of inpainting – a procedure which evaluates the likely background behind an object or person and over-replicates it so that the object or person is erased – such ‘offending’ items, controversially, would effectively either disappear from the literal world-view of the end user, or (in the case of appended augmented reality technologies) be replaced by something that they would prefer to see.

At a practical level, the paper notes that it is important for users not to be able to omit or occlude objects which could threaten them if hidden. Thus, in the case of users who wish not to see bicycles, some safeguards would need to be locked into the system:

‘As a user can manipulate their physical realities, there may be some critical situations where the re-render needs to be undone, or the user should be informed of the non-overlayed reality. For example, when crossing the road, though bicycles are occluded, they should not be completely inpainted. They should be slightly blurred, or at least a big arrow should above the cyclist to inform the user that an object exists and is approaching them. This also means certain objects that are intended to be used for physical safety, such as fire extinguishers or traffic lights, should not be overlay-able.

‘We could insert safety checker models to verify that non-overlayable objects are not manipulated, or alternatively we could prompt the user to re-consider their decision (e.g. doing a sample playback in the view history of what happens when this object is occluded).’

Method and Tests

The system was tested not on a user group, but by evaluating persona-based use cases and observing performance. Since the idea and implementation are novel, there are no equivalent systems against which to compare it, with any possible metrics applicable only to granular aspects such as latency and accuracy of recognition, etc.

The use of ‘templated’ virtual users dates back a while, and in this regard the work follows the methodology of a prior initiative from 2012.

To create custom hooks, or ‘masks’ capable of blurring, inpainting or otherwise obscuring undesired objects, the researcher (who collaborated with peers in the previous three ‘ancestral’ projects to this, but who has authored the new paper alone) used 100 images per object, trawled from Google Images, annotating them with object labels and bounding boxes.

The author eschewed the possibility of using curated datasets such as CIFAR10, because such collections may not contain all the required elements (for instance, they may be a good generic source of ‘dogs’, but only contain US-style phone booths).

A Faster R-CNN was used for fine-tuning the model (i.e., by adding the gathered data to an already-trained model, potentially lowering its overall performance in favor of recognizing the new objects introduced). The model had been pre-trained on the MSCOCO real-time object recognition dataset.

From 2016, the Faster R-CNN model in action, now incorporated into the new paper's recognition system at 'editing time' (i.e. the point at which the user traverses their history and creates hooks to address objects that they want to be masked or in some way altered in future encounters). Source: https://arxiv.org/pdf/1506.01497.pdf

The pre-trained head of the model was thus replaced by a new one containing the new class (i.e., ‘graves’, ‘bicycles’, etc.). Ultimately, four image-based models were originated. For each of them, the remote system was now capable of applying a Gaussian blur on a live basis, with the headset streaming back the amended video on a live basis.

The author then donned the headset and visited physical locations that would contain the trained images, such as a cemetery, supermarket, dog park and cycle path).

It was found that most objects were able to be successfully occluded in real time, with failure cases limited to objects at acute or unusually rotated angles (where, in any case, they would not present their ’emblematic’ disposition). Objects that were very small in the distance also sometimes failed to be occluded, until they were near enough to trigger automated recognition.

To test the scalability of the system, scalability testing, from work formulated in 2010, was used, though real subjects were again not used in the tests, due to the challenge of scaling itself, which would have entailed a large (and growing) test-set of users.

Results for the removal (or obfuscation) of objects across various platforms and environments. In some cases, the desired blockable elements were not available in the interface, which could be a common user experience if the user is also employing secondary methods, such as adblocking.

In terms of blocking ‘banned’ elements in mobile environments, the author notes that such elements are not always consistent enough across mobile and desktop versions for blocking or inpainting to be consistently effective, and that per-platform  masks were needed to ensure full coverage. On the other hand, it was found that the GUI elements were consistent enough across iOS and Android for a single mask to be effective.

Also evaluated was the censoring of hate speech. To test this, a group of simulated users was leveraged, each of which contributed some type of hate speech (in this case, mostly misogynistic comments), with the system challenged to identify and obscure the comments.

The recognition model for this aspect was the Dynamically Generated Hate Speech Dataset, with the subset women specified. For this, the training set contained 1,652 examples against a test set of 187 examples. RoBERTa was pre-trained on English Corpora Wikipedia and BookCorpus.

The paper comments:

‘The empirical results from the scalability tests indicate that the ease of mask generation and model fine-tuning, further catalyzed by performance improvements from more users, enable the scalable generation of interventions.’

Going Local

The system proposed is designed to be resistant to platform specificity or user lock-in to dedicated commercial hardware systems. To this extent, the approach boldly relies on server-side processing and the achievement of latency adequate to real-time interaction. It is also intended to be user-configurable, rather than forcing its user-base to await the generosity of open source developers in implementing desired features.

The project’s self-stated requirement that the system not require a ‘high-end smartphone’ has precluded, for the time being, the possibility of edge computing, though systems such as YOLO are now quite capable of performant streaming and identification on moderately-specced smartphones (with power usage a trade-off against the need to stream from exterior sources).

The paper concludes that the proposed system is intended to inspire further work into diminished reality systems, where the user may block elements of the world that they find objectionable.

Assuming that safety issues were to be well-handled in such a framework, removing the risk that a user might walk in front of a ‘painted out’ truck or an electric car (the system currently has no capability to intervene in audio), the obvious road forward would seem to be that some grass-roots version of this real-life Z-Eye could stimulate interest and demand to a point where an innovative AI company could popularize an implementation, perhaps with optional consumer or platform-specific hardware support.

The ethical and psychological benefits of such a system could remain in the balance, however, as it offers users the possibility for 360-degree escapism and disenfranchisement from the world around them. Arguably, the coming wave of augmented reality systems, such as the one promised by Apple, will in any case bring these issues into the public consciousness.

More To Explore

AI ML DL

Controllable Deepfakes With Gaussian Avatars

Could Gaussian Splatting become the hottest new deepfake technology since 2017? The massive surge of interest from the research sector suggests it might – and the latest innovation not only brings full controllability to neural or deepfaked faces, but also lets you become someone else at an unprecedented level of photorealism and efficiency.

AI ML DL

Badly-Compressed Images Affect CLIP’s Performance, New Research Contends

CLIP is the new darling of the computer vision research, and of image-based generative AI, with wide uptake of the image/text analysis framework across the sector. However, new research indicates that CLIP’s efficiency and usefulness is negatively affected by badly-compressed images. Though this should not be a problem in the modern high-speed broadband age, it is – because so much essential data and methodologies still in use data back several decades.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle